Troubleshooting
Common Ursula failure modes, what they look like, how to diagnose them, and what to do.
When something goes wrong, start with the three diagnostic endpoints rather than guessing from logs. Most operational symptoms have a clear signal in /cluster/status or /metrics.
Diagnostic endpoints
# Membership, leader, log/apply indices, payload readiness
curl http://NODE:4437/cluster/status
# Prometheus series for HTTP, Raft, snapshot, rehydrate, backpressure
curl http://NODE:4437/metrics
# Readiness: distinguishes local-recovery state from cluster-quorum state
curl http://NODE:4437/readyz/local # local payload recovered?
curl http://NODE:4437/readyz/cluster # voters set AND leader known?
Key fields in /cluster/status:
| Field | What it tells you |
|---|---|
state | initialized once membership is set; uninitialized means bootstrap hasn't happened on any reachable peer |
is_leader, leader_id | Who's driving replication right now |
voters, learners | Cluster composition |
last_log_index, last_applied_index | Replication progress; gap = apply lag |
local_ready, payload_incomplete | Hot payload rehydration state on this node |
membership_change_in_progress | Add/remove voter operation in flight |
Key Prometheus series:
| Metric | Use for |
|---|---|
ursula_raft_apply_lag_entries | How far this node is behind committed log entries |
ursula_raft_current_leader | 0 when no leader is known |
ursula_write_backpressure_active | 1 when 429s are being issued |
ursula_write_backpressure_reason_active{reason} | Which threshold tripped (hot_bytes, apply_lag, …) |
ursula_raft_rpc_* | Peer RPC health; start here when replication stalls |
ursula_raft_snapshot_* | Snapshot install volume and duration |
ursula_rehydrate_* | Hot-payload rehydrate pulls (post-restart) |
Startup failures
Process exits immediately with config error
Common causes:
- Missing
$ENV_VARreferenced in[cold]. Example error:URSULA_AWS_ACCESS_KEY_ID is not set. Ursula fails fast rather than running with a half-set credential. Export the variable or use the AWS SDK credential chain (omit the explicits3_*keys). cold.backend = "fs"in multi-node mode. Cluster mode requires shared cold storage. Switch tos3or removecluster.routesfor single-node.bootstrap_node_idmissing in multi-node config. Whencluster.routesis non-empty, every node must setcluster.bootstrap_node_idexplicitly.election_timeout_max_ms≤election_timeout_min_ms. Raft needs a non-empty randomization window.
Process exits with I/O error on data directory
The server.data_dir must be writable. Check ownership and permissions on the directory and its subpaths (wal/, meta/, snapshots/, cold_cache/). If you're running under systemd, confirm ReadWritePaths= covers the data dir.
Process panics with RocksDB lock-held error
Another ursula process is still holding the RocksDB lock on server.data_dir. Confirm with lsof / fuser. If the previous process crashed, the lock file may need manual removal, but check the on-disk state isn't being concurrently used first.
Port already in use
server.listen is bound by another process. Pick another port or stop the conflicting one. Note that the public HTTP and inter-node gRPC share a single listener.
Bootstrap and cluster join
/readyz/cluster returns 503 indefinitely on a new cluster
Check /cluster/status. If state is uninitialized and voters is empty:
- Only the node whose
cluster.bootstrap_node_idmatches its ownnode_idis allowed to initialize the cluster. Confirm exactly one node has that match. - The bootstrap node needs reachable routes to its declared peers. From the bootstrap node,
curl http://PEER:4437/healthzshould succeed. - If the bootstrap node is reachable but other peers can't reach it, fix the
server.advertisevalue; it must be the address peers should connect to, not the loopback.
New node added via /cluster/add-learner never becomes voter
Check the learner's /cluster/status:
payload_incomplete: truemeans hot-payload rehydration isn't finished. Wait. Adding a voter while the learner is still rehydrating risks losing quorum if another node fails simultaneously.local_ready: falseafter a long wait usually points to a problem reading from cold storage; check S3 permissions andursula_rehydrate_requests_total{result}for failures.
Rolling restart of a follower causes prolonged unavailability
After restart, the node must rehydrate hot payloads from cold storage before serving. Wait for /readyz/cluster on the restarted node before restarting the next one. Track progress with ursula_rehydrate_bytes_total and ursula_rehydrate_duration_seconds.
Write path
429 Too Many Requests with Retry-After
Backpressure is active. Which threshold triggered determines the fix. Check ursula_write_backpressure_reason_active{reason}:
| Reason | Meaning | Fix |
|---|---|---|
hot_bytes | Global hot tier exceeds FDS_V3_WRITE_BACKPRESSURE_MAX_HOT_BYTES | Cold flush isn't keeping up; check S3 / cold backend health |
stream_hot_bytes | A single stream exceeds its per-stream cap | Slow down writes on that stream, or raise FDS_V3_WRITE_BACKPRESSURE_MAX_STREAM_HOT_BYTES |
flush_backlog_bytes | Flush queue too large | Same as hot_bytes; cold backend is the bottleneck |
apply_lag | State machine apply lag too large | Followers can't keep up; investigate Raft RPC and CPU pressure |
Honor Retry-After. Hammering the server during backpressure will not get the write through faster.
409 Conflict with producer_seq_conflict
The producer sent the same Producer-Id with an unexpected Producer-Epoch / Producer-Seq. The response body contains expected_seq. Common patterns:
- A client crashed and restarted without incrementing
Producer-Epoch. Bump epoch on restart. - Two writers share the same
Producer-Id. Each writer needs a distinct ID. - A retry skipped a sequence number. Producer sequences must be contiguous within an epoch.
412 Precondition Failed on conditional write
Someone else wrote to the stream since you last fetched the ETag. Re-read with HEAD, re-derive the new state, retry with the updated If-Match.
404 Not Found on POST /{bucket}/{stream}
The bucket doesn't exist or the stream hasn't been created. PUT /{bucket} and PUT /{bucket}/{stream} first, then append. PUT on existing resources is idempotent.
503 on writes with growing apply_lag
Followers can't apply log entries fast enough. Causes:
- Quorum lost (some voters down).
/cluster/statuswill show fewer than ⌈n/2⌉+1 reachable voters. Bring down nodes back or formally remove them with/cluster/remove-node. - Disk or CPU pressure on followers.
ursula_raft_rpc_duration_secondswill be elevated on the affected peer. - Snapshot install in progress for a follower. Wait, or raise
raft.install_snapshot_timeout_msif you have a slow link.
Cold flush failures (writes succeed but data isn't reaching S3)
Check the logs for cold-flush errors. Most common:
- S3 bucket doesn't exist or wrong region.
- IAM permissions missing. Required:
s3:GetObject,s3:PutObject,s3:ListBucket,s3:DeleteObject. s3_endpointset to a value the host can't reach (typo, private endpoint not on the route).
Hot-tier-only operation will eventually trigger hot_bytes backpressure, so this masquerades as a write-path issue once the buffer fills.
Read path
410 Gone with stream-earliest-offset header
The requested offset has been trimmed by a snapshot. The response includes the earliest still-available offset. Two options:
- Use
/bootstrapto fetch the latest snapshot plus retained updates, then continue from there. - Seek forward to
stream-earliest-offsetand accept the gap.
Either way, do not request the original offset again. It's gone.
404 Not Found on read of a previously valid stream
Either the stream was deleted, or its Stream-TTL / Stream-Expires-At expired. TTL is set at creation; expired streams return 404 rather than 410.
SSE connections drop after ~30–60 seconds
A proxy or load balancer is closing idle connections faster than Ursula's keepalive heartbeat. Either:
- Raise the proxy's idle timeout above the heartbeat interval, or
- Place a TCP-level keepalive on the connection.
Heartbeats are emitted as SSE comment lines, so a working connection should show traffic even with no data events.
Live tail seems "stuck" but the stream is being appended to
The reader is connected to a follower that's lagging behind. The Stream-Up-To-Date response header indicates whether the serving node has caught up. To force leader-side reads, target the leader directly (its ID is in /cluster/status.leader_id).
Replication and consensus
/cluster/status.leader_id is null for more than a few seconds
The cluster has no leader. Either an election is in progress or quorum is lost. Check ursula_raft_current_leader across nodes:
- If multiple nodes briefly claim leader, election is flapping. Likely network or clock issues. Inspect
ursula_raft_rpc_*for high RPC durations. - If no node claims leader, fewer than ⌈n/2⌉+1 voters are reachable. Restore reachability or remove unreachable voters from membership.
Snapshot install times out
ursula_raft_snapshot_duration_seconds{result="error"} ticks up; the affected follower stays behind.
- Network bandwidth between leader and follower is too low for the snapshot size within
raft.install_snapshot_timeout_ms. - Cold storage on the follower is unreachable (snapshot transfer sends a blob reference; the follower pulls from cold directly).
Raise the timeout, or fix the underlying network / cold-storage path.
ursula_raft_apply_lag_entries grows unbounded on a follower
That follower can't apply log entries as fast as the leader produces them. Causes:
- Disk pressure on the follower.
iostaton the data volume will show it. - CPU saturation. The follower's state machine apply runs in user space.
- A pathologically large single stream that's flushing slowly.
Trigger a snapshot on the leader if the lag exceeds what catch-up can recover, or take the follower out of the voter set until it catches up.
Still stuck?
Open a GitHub issue with:
/cluster/statusoutput from every nodecurl http://NODE:4437/metrics | grep -E 'ursula_raft_|ursula_write_backpressure|ursula_rehydrate'- The last ~200 log lines at
RUST_LOG=ursula=debug - Your config (with secrets redacted)