Troubleshooting

Common Ursula failure modes, what they look like, how to diagnose them, and what to do.

When something goes wrong, start with the three diagnostic endpoints rather than guessing from logs. Most operational symptoms have a clear signal in /cluster/status or /metrics.

Diagnostic endpoints

# Membership, leader, log/apply indices, payload readiness
curl http://NODE:4437/cluster/status

# Prometheus series for HTTP, Raft, snapshot, rehydrate, backpressure
curl http://NODE:4437/metrics

# Readiness: distinguishes local-recovery state from cluster-quorum state
curl http://NODE:4437/readyz/local     # local payload recovered?
curl http://NODE:4437/readyz/cluster   # voters set AND leader known?

Key fields in /cluster/status:

FieldWhat it tells you
stateinitialized once membership is set; uninitialized means bootstrap hasn't happened on any reachable peer
is_leader, leader_idWho's driving replication right now
voters, learnersCluster composition
last_log_index, last_applied_indexReplication progress; gap = apply lag
local_ready, payload_incompleteHot payload rehydration state on this node
membership_change_in_progressAdd/remove voter operation in flight

Key Prometheus series:

MetricUse for
ursula_raft_apply_lag_entriesHow far this node is behind committed log entries
ursula_raft_current_leader0 when no leader is known
ursula_write_backpressure_active1 when 429s are being issued
ursula_write_backpressure_reason_active{reason}Which threshold tripped (hot_bytes, apply_lag, …)
ursula_raft_rpc_*Peer RPC health; start here when replication stalls
ursula_raft_snapshot_*Snapshot install volume and duration
ursula_rehydrate_*Hot-payload rehydrate pulls (post-restart)

Startup failures

Process exits immediately with config error

Common causes:

  • Missing $ENV_VAR referenced in [cold]. Example error: URSULA_AWS_ACCESS_KEY_ID is not set. Ursula fails fast rather than running with a half-set credential. Export the variable or use the AWS SDK credential chain (omit the explicit s3_* keys).
  • cold.backend = "fs" in multi-node mode. Cluster mode requires shared cold storage. Switch to s3 or remove cluster.routes for single-node.
  • bootstrap_node_id missing in multi-node config. When cluster.routes is non-empty, every node must set cluster.bootstrap_node_id explicitly.
  • election_timeout_max_mselection_timeout_min_ms. Raft needs a non-empty randomization window.

Process exits with I/O error on data directory

The server.data_dir must be writable. Check ownership and permissions on the directory and its subpaths (wal/, meta/, snapshots/, cold_cache/). If you're running under systemd, confirm ReadWritePaths= covers the data dir.

Process panics with RocksDB lock-held error

Another ursula process is still holding the RocksDB lock on server.data_dir. Confirm with lsof / fuser. If the previous process crashed, the lock file may need manual removal, but check the on-disk state isn't being concurrently used first.

Port already in use

server.listen is bound by another process. Pick another port or stop the conflicting one. Note that the public HTTP and inter-node gRPC share a single listener.

Bootstrap and cluster join

/readyz/cluster returns 503 indefinitely on a new cluster

Check /cluster/status. If state is uninitialized and voters is empty:

  • Only the node whose cluster.bootstrap_node_id matches its own node_id is allowed to initialize the cluster. Confirm exactly one node has that match.
  • The bootstrap node needs reachable routes to its declared peers. From the bootstrap node, curl http://PEER:4437/healthz should succeed.
  • If the bootstrap node is reachable but other peers can't reach it, fix the server.advertise value; it must be the address peers should connect to, not the loopback.

New node added via /cluster/add-learner never becomes voter

Check the learner's /cluster/status:

  • payload_incomplete: true means hot-payload rehydration isn't finished. Wait. Adding a voter while the learner is still rehydrating risks losing quorum if another node fails simultaneously.
  • local_ready: false after a long wait usually points to a problem reading from cold storage; check S3 permissions and ursula_rehydrate_requests_total{result} for failures.

Rolling restart of a follower causes prolonged unavailability

After restart, the node must rehydrate hot payloads from cold storage before serving. Wait for /readyz/cluster on the restarted node before restarting the next one. Track progress with ursula_rehydrate_bytes_total and ursula_rehydrate_duration_seconds.

Write path

429 Too Many Requests with Retry-After

Backpressure is active. Which threshold triggered determines the fix. Check ursula_write_backpressure_reason_active{reason}:

ReasonMeaningFix
hot_bytesGlobal hot tier exceeds FDS_V3_WRITE_BACKPRESSURE_MAX_HOT_BYTESCold flush isn't keeping up; check S3 / cold backend health
stream_hot_bytesA single stream exceeds its per-stream capSlow down writes on that stream, or raise FDS_V3_WRITE_BACKPRESSURE_MAX_STREAM_HOT_BYTES
flush_backlog_bytesFlush queue too largeSame as hot_bytes; cold backend is the bottleneck
apply_lagState machine apply lag too largeFollowers can't keep up; investigate Raft RPC and CPU pressure

Honor Retry-After. Hammering the server during backpressure will not get the write through faster.

409 Conflict with producer_seq_conflict

The producer sent the same Producer-Id with an unexpected Producer-Epoch / Producer-Seq. The response body contains expected_seq. Common patterns:

  • A client crashed and restarted without incrementing Producer-Epoch. Bump epoch on restart.
  • Two writers share the same Producer-Id. Each writer needs a distinct ID.
  • A retry skipped a sequence number. Producer sequences must be contiguous within an epoch.

412 Precondition Failed on conditional write

Someone else wrote to the stream since you last fetched the ETag. Re-read with HEAD, re-derive the new state, retry with the updated If-Match.

404 Not Found on POST /{bucket}/{stream}

The bucket doesn't exist or the stream hasn't been created. PUT /{bucket} and PUT /{bucket}/{stream} first, then append. PUT on existing resources is idempotent.

503 on writes with growing apply_lag

Followers can't apply log entries fast enough. Causes:

  • Quorum lost (some voters down). /cluster/status will show fewer than ⌈n/2⌉+1 reachable voters. Bring down nodes back or formally remove them with /cluster/remove-node.
  • Disk or CPU pressure on followers. ursula_raft_rpc_duration_seconds will be elevated on the affected peer.
  • Snapshot install in progress for a follower. Wait, or raise raft.install_snapshot_timeout_ms if you have a slow link.

Cold flush failures (writes succeed but data isn't reaching S3)

Check the logs for cold-flush errors. Most common:

  • S3 bucket doesn't exist or wrong region.
  • IAM permissions missing. Required: s3:GetObject, s3:PutObject, s3:ListBucket, s3:DeleteObject.
  • s3_endpoint set to a value the host can't reach (typo, private endpoint not on the route).

Hot-tier-only operation will eventually trigger hot_bytes backpressure, so this masquerades as a write-path issue once the buffer fills.

Read path

410 Gone with stream-earliest-offset header

The requested offset has been trimmed by a snapshot. The response includes the earliest still-available offset. Two options:

  • Use /bootstrap to fetch the latest snapshot plus retained updates, then continue from there.
  • Seek forward to stream-earliest-offset and accept the gap.

Either way, do not request the original offset again. It's gone.

404 Not Found on read of a previously valid stream

Either the stream was deleted, or its Stream-TTL / Stream-Expires-At expired. TTL is set at creation; expired streams return 404 rather than 410.

SSE connections drop after ~30–60 seconds

A proxy or load balancer is closing idle connections faster than Ursula's keepalive heartbeat. Either:

  • Raise the proxy's idle timeout above the heartbeat interval, or
  • Place a TCP-level keepalive on the connection.

Heartbeats are emitted as SSE comment lines, so a working connection should show traffic even with no data events.

Live tail seems "stuck" but the stream is being appended to

The reader is connected to a follower that's lagging behind. The Stream-Up-To-Date response header indicates whether the serving node has caught up. To force leader-side reads, target the leader directly (its ID is in /cluster/status.leader_id).

Replication and consensus

/cluster/status.leader_id is null for more than a few seconds

The cluster has no leader. Either an election is in progress or quorum is lost. Check ursula_raft_current_leader across nodes:

  • If multiple nodes briefly claim leader, election is flapping. Likely network or clock issues. Inspect ursula_raft_rpc_* for high RPC durations.
  • If no node claims leader, fewer than ⌈n/2⌉+1 voters are reachable. Restore reachability or remove unreachable voters from membership.

Snapshot install times out

ursula_raft_snapshot_duration_seconds{result="error"} ticks up; the affected follower stays behind.

  • Network bandwidth between leader and follower is too low for the snapshot size within raft.install_snapshot_timeout_ms.
  • Cold storage on the follower is unreachable (snapshot transfer sends a blob reference; the follower pulls from cold directly).

Raise the timeout, or fix the underlying network / cold-storage path.

ursula_raft_apply_lag_entries grows unbounded on a follower

That follower can't apply log entries as fast as the leader produces them. Causes:

  • Disk pressure on the follower. iostat on the data volume will show it.
  • CPU saturation. The follower's state machine apply runs in user space.
  • A pathologically large single stream that's flushing slowly.

Trigger a snapshot on the leader if the lag exceeds what catch-up can recover, or take the follower out of the voter set until it catches up.

Still stuck?

Open a GitHub issue with:

  • /cluster/status output from every node
  • curl http://NODE:4437/metrics | grep -E 'ursula_raft_|ursula_write_backpressure|ursula_rehydrate'
  • The last ~200 log lines at RUST_LOG=ursula=debug
  • Your config (with secrets redacted)