Operations

Health, capacity, snapshots, backup, and upgrades for Ursula clusters.

For routine cluster-wide operations (bootstrap, graceful upgrade, node migration, scale up/down, restart-recovery), use ursulactl against an orchestrator. The endpoints documented on this page are the lower-level HTTP surface those operations build on, useful for ad-hoc inspection and break-glass repair on a single node.

Health and readiness

curl -f http://127.0.0.1:4437/healthz
curl -f http://127.0.0.1:4437/readyz/local
curl -f http://127.0.0.1:4437/readyz/cluster
EndpointOK when
/healthzProcess is alive.
/readyz/localLocal recovery and hot-payload rehydration complete.
/readyz/clusterLocal payload is ready, membership has voters, and a leader is known.

/readyz/cluster is the right gate for rolling-restart automation. Don't move to the next node until the just-restarted node passes it.

Cluster status

curl http://127.0.0.1:4437/cluster/status

Returns node ID, leader, term, membership (voters + learners), Raft state, last-log / last-applied indices, and payload-readiness flags. The fields are documented in Troubleshooting → Diagnostic endpoints.

Metrics

curl http://127.0.0.1:4437/metrics

Prometheus text format. Series families:

  • Public HTTP request counters and latency
  • Raft RPC request counters, durations, in-flight, and last-success age (ursula_raft_rpc_*)
  • Snapshot transfer counters and latency (ursula_raft_snapshot_*)
  • Hot-payload rehydrate counters (ursula_rehydrate_*)
  • Write backpressure state (ursula_write_backpressure_*)
  • Live connection and watch fan-out

Start triage with raft_rpc_* when replication slows or elections look unstable, raft_snapshot_* when follower catch-up shifts into snapshot install, and rehydrate_* when a restarting node stays payload-incomplete.

Capacity inspection

curl http://127.0.0.1:4437/cluster/hot-cold-stats

Returns a snapshot of hot bytes, cold bytes, per-stream hot tallies, flush backlog, current backpressure state and reason, and cache sizes. This is the right endpoint to check when you're deciding whether to raise the hot-payload thresholds documented in Configuration → Runtime tuning.

Manual Raft snapshots and log compaction

ursulactl and the orchestrator handle snapshots automatically as part of upgrade and migration flows. For one-off operator work, the raw HTTP endpoints are:

# Take a Raft snapshot of cluster state on this node
curl -X POST http://127.0.0.1:4437/cluster/trigger-snapshot

# After a snapshot, the leader can purge log entries below the snapshot index
curl -X POST http://127.0.0.1:4437/cluster/purge-log

trigger-snapshot queues the snapshot and returns immediately; the result lands in cold storage asynchronously. purge-log is only safe after a snapshot has been taken and replicated.

Rolling restarts (manual)

For one-off manual restarts without the orchestrator:

  1. Pick a follower (not the current leader; confirm via /cluster/status.leader_id).
  2. Stop the process.
  3. Restart it pointing at the same server.data_dir and config.
  4. Wait for /readyz/cluster to return 200.
  5. Move to the next follower.
  6. To restart the leader, first transfer leadership with POST /cluster/transfer-leader, wait for the new leader to be elected, then proceed as for a follower.

For repeatable rolling restarts across a cluster, use ursulactl operations restart-recovery instead. It encodes leader transfer, readiness waits, and stability windows.

Logs

RUST_LOG=ursula=info  ./ursula serve --config /etc/ursula/ursula.toml
RUST_LOG=ursula=debug ./ursula serve --config /etc/ursula/ursula.toml

info is the expected baseline for production. debug is helpful when chasing replication or rehydrate issues but is verbose enough that you'll want to redirect to a file.

Backup and disaster recovery

Ursula does not provide a backup or restore tool. There is no --export flag, no cluster-state dump, and no "restore from snapshot file" command. Plan your recovery story accordingly.

What you can rely on:

  • Quorum durability. Acknowledged writes are durable as long as a majority of voters survives. With three voters in three AZs, any single-AZ outage is recoverable.
  • Cold-tier durability. Once a Raft snapshot or chunk has been flushed to S3, it inherits S3-grade durability. The window of acknowledged-but-unflushed data is measured in the configured flush interval (seconds).
  • Recovery invariants. A node refuses to start if its data directory is inconsistent (e.g., logs purged without a covering snapshot). This protects against silent corruption on restart.

What this means in practice:

  • For node-level loss, replace the node and re-join the cluster via ursulactl operations node-migration. The new node rehydrates from peers and cold storage.
  • For total cluster loss (all voters gone simultaneously), you can recover only what's in S3: the last persisted Raft snapshot plus any chunks. There is no tooling to spin up a fresh cluster from those S3 objects today. Treat full-cluster loss as needing a bespoke recovery procedure.
  • For accidental deletion of a specific stream, there is no point-in-time restore. Application-level snapshots (PUT /{bucket}/{stream}/snapshot/{offset}) are the only application-visible undo and are the writer's responsibility.

If you have stricter RTO/RPO needs, snapshot the S3 bucket out-of-band (versioning, cross-region replication, or scheduled object copies) so you at least retain the durable chunks for offline recovery analysis.

Upgrades

Ursula is at v0.x. There is no on-disk format compatibility shim and no migration code. Across versions, expect to rebuild clusters. The runtime will not refuse to start on a stale format, but it also will not migrate anything.

For routine same-version rolling restarts and for upgrades where the team has confirmed the new build is on-disk compatible with the running one, use:

ursulactl operations graceful-upgrade \
  --target-version <version> \
  --node-id 1 --node-id 2 --node-id 3 \
  --wait

This drains each node, transfers leadership when needed, installs the new artifact, and waits for readiness before moving on. See Control CLI → Cluster lifecycle operations.

For upgrades that are not known-compatible (or any time you bump across a breaking change), the operational shape is:

  1. Stop writes at the edge (proxy / application).
  2. Drain the cold flush backlog (let the hot tier empty; check /cluster/hot-cold-stats).
  3. Tear the cluster down.
  4. Bring up the new version against fresh data directories.
  5. Reload application state from your own sources of truth.

This is intentionally heavy-handed. It will get lighter as v1.0 approaches and on-disk format compatibility is committed to.