Operations
Health, capacity, snapshots, backup, and upgrades for Ursula clusters.
For routine cluster-wide operations (bootstrap, graceful upgrade, node migration, scale up/down, restart-recovery), use ursulactl against an orchestrator. The endpoints documented on this page are the lower-level HTTP surface those operations build on, useful for ad-hoc inspection and break-glass repair on a single node.
Health and readiness
curl -f http://127.0.0.1:4437/healthz
curl -f http://127.0.0.1:4437/readyz/local
curl -f http://127.0.0.1:4437/readyz/cluster
| Endpoint | OK when |
|---|---|
/healthz | Process is alive. |
/readyz/local | Local recovery and hot-payload rehydration complete. |
/readyz/cluster | Local payload is ready, membership has voters, and a leader is known. |
/readyz/cluster is the right gate for rolling-restart automation. Don't move to the next node until the just-restarted node passes it.
Cluster status
curl http://127.0.0.1:4437/cluster/status
Returns node ID, leader, term, membership (voters + learners), Raft state, last-log / last-applied indices, and payload-readiness flags. The fields are documented in Troubleshooting → Diagnostic endpoints.
Metrics
curl http://127.0.0.1:4437/metrics
Prometheus text format. Series families:
- Public HTTP request counters and latency
- Raft RPC request counters, durations, in-flight, and last-success age (
ursula_raft_rpc_*) - Snapshot transfer counters and latency (
ursula_raft_snapshot_*) - Hot-payload rehydrate counters (
ursula_rehydrate_*) - Write backpressure state (
ursula_write_backpressure_*) - Live connection and watch fan-out
Start triage with raft_rpc_* when replication slows or elections look unstable, raft_snapshot_* when follower catch-up shifts into snapshot install, and rehydrate_* when a restarting node stays payload-incomplete.
Capacity inspection
curl http://127.0.0.1:4437/cluster/hot-cold-stats
Returns a snapshot of hot bytes, cold bytes, per-stream hot tallies, flush backlog, current backpressure state and reason, and cache sizes. This is the right endpoint to check when you're deciding whether to raise the hot-payload thresholds documented in Configuration → Runtime tuning.
Manual Raft snapshots and log compaction
ursulactl and the orchestrator handle snapshots automatically as part of upgrade and migration flows. For one-off operator work, the raw HTTP endpoints are:
# Take a Raft snapshot of cluster state on this node
curl -X POST http://127.0.0.1:4437/cluster/trigger-snapshot
# After a snapshot, the leader can purge log entries below the snapshot index
curl -X POST http://127.0.0.1:4437/cluster/purge-log
trigger-snapshot queues the snapshot and returns immediately; the result lands in cold storage asynchronously. purge-log is only safe after a snapshot has been taken and replicated.
Rolling restarts (manual)
For one-off manual restarts without the orchestrator:
- Pick a follower (not the current leader; confirm via
/cluster/status.leader_id). - Stop the process.
- Restart it pointing at the same
server.data_dirand config. - Wait for
/readyz/clusterto return200. - Move to the next follower.
- To restart the leader, first transfer leadership with
POST /cluster/transfer-leader, wait for the new leader to be elected, then proceed as for a follower.
For repeatable rolling restarts across a cluster, use ursulactl operations restart-recovery instead. It encodes leader transfer, readiness waits, and stability windows.
Logs
RUST_LOG=ursula=info ./ursula serve --config /etc/ursula/ursula.toml
RUST_LOG=ursula=debug ./ursula serve --config /etc/ursula/ursula.toml
info is the expected baseline for production. debug is helpful when chasing replication or rehydrate issues but is verbose enough that you'll want to redirect to a file.
Backup and disaster recovery
Ursula does not provide a backup or restore tool. There is no --export flag, no cluster-state dump, and no "restore from snapshot file" command. Plan your recovery story accordingly.
What you can rely on:
- Quorum durability. Acknowledged writes are durable as long as a majority of voters survives. With three voters in three AZs, any single-AZ outage is recoverable.
- Cold-tier durability. Once a Raft snapshot or chunk has been flushed to S3, it inherits S3-grade durability. The window of acknowledged-but-unflushed data is measured in the configured flush interval (seconds).
- Recovery invariants. A node refuses to start if its data directory is inconsistent (e.g., logs purged without a covering snapshot). This protects against silent corruption on restart.
What this means in practice:
- For node-level loss, replace the node and re-join the cluster via
ursulactl operations node-migration. The new node rehydrates from peers and cold storage. - For total cluster loss (all voters gone simultaneously), you can recover only what's in S3: the last persisted Raft snapshot plus any chunks. There is no tooling to spin up a fresh cluster from those S3 objects today. Treat full-cluster loss as needing a bespoke recovery procedure.
- For accidental deletion of a specific stream, there is no point-in-time restore. Application-level snapshots (
PUT /{bucket}/{stream}/snapshot/{offset}) are the only application-visible undo and are the writer's responsibility.
If you have stricter RTO/RPO needs, snapshot the S3 bucket out-of-band (versioning, cross-region replication, or scheduled object copies) so you at least retain the durable chunks for offline recovery analysis.
Upgrades
Ursula is at v0.x. There is no on-disk format compatibility shim and no migration code. Across versions, expect to rebuild clusters. The runtime will not refuse to start on a stale format, but it also will not migrate anything.
For routine same-version rolling restarts and for upgrades where the team has confirmed the new build is on-disk compatible with the running one, use:
ursulactl operations graceful-upgrade \
--target-version <version> \
--node-id 1 --node-id 2 --node-id 3 \
--wait
This drains each node, transfers leadership when needed, installs the new artifact, and waits for readiness before moving on. See Control CLI → Cluster lifecycle operations.
For upgrades that are not known-compatible (or any time you bump across a breaking change), the operational shape is:
- Stop writes at the edge (proxy / application).
- Drain the cold flush backlog (let the hot tier empty; check
/cluster/hot-cold-stats). - Tear the cluster down.
- Bring up the new version against fresh data directories.
- Reload application state from your own sources of truth.
This is intentionally heavy-handed. It will get lighter as v1.0 approaches and on-disk format compatibility is committed to.