Troubleshooting

Common Ursula failure modes, what they look like, how to diagnose them, and what to do.

When something goes wrong, start with the three diagnostic endpoints rather than guessing from logs. Most operational symptoms have a clear signal in /cluster/status or /metrics.

Diagnostic endpoints

# Membership, leader, log/apply indices, payload readiness
curl http://NODE:4437/cluster/status

# Prometheus series for HTTP, Raft, snapshot, rehydrate, backpressure
curl http://NODE:4437/metrics

# Readiness: distinguishes local-recovery state from cluster-quorum state
curl http://NODE:4437/readyz/local     # local payload recovered?
curl http://NODE:4437/readyz/cluster   # voters set AND leader known?

Key fields in /cluster/status:

Field	What it tells you
`state`	`initialized` once membership is set; `uninitialized` means bootstrap hasn't happened on any reachable peer
`is_leader`, `leader_id`	Who's driving replication right now
`voters`, `learners`	Cluster composition
`last_log_index`, `last_applied_index`	Replication progress; gap = apply lag
`local_ready`, `payload_incomplete`	Hot payload rehydration state on this node
`membership_change_in_progress`	Add/remove voter operation in flight

Key Prometheus series:

Metric	Use for
`ursula_raft_apply_lag_entries`	How far this node is behind committed log entries
`ursula_raft_current_leader`	`0` when no leader is known
`ursula_write_backpressure_active`	`1` when 429s are being issued
`ursula_write_backpressure_reason_active{reason}`	Which threshold tripped (`hot_bytes`, `apply_lag`, …)
`ursula_raft_rpc_*`	Peer RPC health; start here when replication stalls
`ursula_raft_snapshot_*`	Snapshot install volume and duration
`ursula_rehydrate_*`	Hot-payload rehydrate pulls (post-restart)

Startup failures

Process exits immediately with config error

Common causes:

Missing $ENV_VAR referenced in [cold]. Example error: URSULA_AWS_ACCESS_KEY_ID is not set. Ursula fails fast rather than running with a half-set credential. Export the variable or use the AWS SDK credential chain (omit the explicit s3_* keys).
cold.backend = "fs" in multi-node mode. Cluster mode requires shared cold storage. Switch to s3 or remove cluster.routes for single-node.
bootstrap_node_id missing in multi-node config. When cluster.routes is non-empty, every node must set cluster.bootstrap_node_id explicitly.
election_timeout_max_ms ≤ election_timeout_min_ms. Raft needs a non-empty randomization window.

Process exits with I/O error on data directory

The server.data_dir must be writable. Check ownership and permissions on the directory and its subpaths (wal/, meta/, snapshots/, cold_cache/). If you're running under systemd, confirm ReadWritePaths= covers the data dir.

Process panics with RocksDB lock-held error

Another ursula process is still holding the RocksDB lock on server.data_dir. Confirm with lsof / fuser. If the previous process crashed, the lock file may need manual removal, but check the on-disk state isn't being concurrently used first.

Port already in use

server.listen is bound by another process. Pick another port or stop the conflicting one. Note that the public HTTP and inter-node gRPC share a single listener.

Bootstrap and cluster join

`/readyz/cluster` returns 503 indefinitely on a new cluster

Check /cluster/status. If state is uninitialized and voters is empty:

Only the node whose cluster.bootstrap_node_id matches its own node_id is allowed to initialize the cluster. Confirm exactly one node has that match.
The bootstrap node needs reachable routes to its declared peers. From the bootstrap node, curl http://PEER:4437/healthz should succeed.
If the bootstrap node is reachable but other peers can't reach it, fix the server.advertise value; it must be the address peers should connect to, not the loopback.

New node added via `/cluster/add-learner` never becomes voter

Check the learner's /cluster/status:

payload_incomplete: true means hot-payload rehydration isn't finished. Wait. Adding a voter while the learner is still rehydrating risks losing quorum if another node fails simultaneously.
local_ready: false after a long wait usually points to a problem reading from cold storage; check S3 permissions and ursula_rehydrate_requests_total{result} for failures.

Rolling restart of a follower causes prolonged unavailability

After restart, the node must rehydrate hot payloads from cold storage before serving. Wait for /readyz/cluster on the restarted node before restarting the next one. Track progress with ursula_rehydrate_bytes_total and ursula_rehydrate_duration_seconds.

Write path

`429 Too Many Requests` with `Retry-After`

Backpressure is active. Which threshold triggered determines the fix. Check ursula_write_backpressure_reason_active{reason}:

Reason	Meaning	Fix
`hot_bytes`	Global hot tier exceeds `FDS_V3_WRITE_BACKPRESSURE_MAX_HOT_BYTES`	Cold flush isn't keeping up; check S3 / cold backend health
`stream_hot_bytes`	A single stream exceeds its per-stream cap	Slow down writes on that stream, or raise `FDS_V3_WRITE_BACKPRESSURE_MAX_STREAM_HOT_BYTES`
`flush_backlog_bytes`	Flush queue too large	Same as `hot_bytes`; cold backend is the bottleneck
`apply_lag`	State machine apply lag too large	Followers can't keep up; investigate Raft RPC and CPU pressure

Honor Retry-After. Hammering the server during backpressure will not get the write through faster.

`409 Conflict` with `producer_seq_conflict`

The producer sent the same Producer-Id with an unexpected Producer-Epoch / Producer-Seq. The response body contains expected_seq. Common patterns:

A client crashed and restarted without incrementing Producer-Epoch. Bump epoch on restart.
Two writers share the same Producer-Id. Each writer needs a distinct ID.
A retry skipped a sequence number. Producer sequences must be contiguous within an epoch.

`412 Precondition Failed` on conditional write

Someone else wrote to the stream since you last fetched the ETag. Re-read with HEAD, re-derive the new state, retry with the updated If-Match.

`404 Not Found` on `POST /{bucket}/{stream}`

The bucket doesn't exist or the stream hasn't been created. PUT /{bucket} and PUT /{bucket}/{stream} first, then append. PUT on existing resources is idempotent.

`503` on writes with growing `apply_lag`

Followers can't apply log entries fast enough. Causes:

Quorum lost (some voters down). /cluster/status will show fewer than ⌈n/2⌉+1 reachable voters. Bring down nodes back or formally remove them with /cluster/remove-node.
Disk or CPU pressure on followers. ursula_raft_rpc_duration_seconds will be elevated on the affected peer.
Snapshot install in progress for a follower. Wait, or raise raft.install_snapshot_timeout_ms if you have a slow link.

Cold flush failures (writes succeed but data isn't reaching S3)

Check the logs for cold-flush errors. Most common:

S3 bucket doesn't exist or wrong region.
IAM permissions missing. Required: s3:GetObject, s3:PutObject, s3:ListBucket, s3:DeleteObject.
s3_endpoint set to a value the host can't reach (typo, private endpoint not on the route).

Hot-tier-only operation will eventually trigger hot_bytes backpressure, so this masquerades as a write-path issue once the buffer fills.

Read path

`410 Gone` with `stream-earliest-offset` header

The requested offset has been trimmed by a snapshot. The response includes the earliest still-available offset. Two options:

Use /bootstrap to fetch the latest snapshot plus retained updates, then continue from there.
Seek forward to stream-earliest-offset and accept the gap.

Either way, do not request the original offset again. It's gone.

`404 Not Found` on read of a previously valid stream

Either the stream was deleted, or its Stream-TTL / Stream-Expires-At expired. TTL is set at creation; expired streams return 404 rather than 410.

SSE connections drop after ~30–60 seconds

A proxy or load balancer is closing idle connections faster than Ursula's keepalive heartbeat. Either:

Raise the proxy's idle timeout above the heartbeat interval, or
Place a TCP-level keepalive on the connection.

Heartbeats are emitted as SSE comment lines, so a working connection should show traffic even with no data events.

Live tail seems "stuck" but the stream is being appended to

The reader is connected to a follower that's lagging behind. The Stream-Up-To-Date response header indicates whether the serving node has caught up. To force leader-side reads, target the leader directly (its ID is in /cluster/status.leader_id).

Replication and consensus

`/cluster/status.leader_id` is `null` for more than a few seconds

The cluster has no leader. Either an election is in progress or quorum is lost. Check ursula_raft_current_leader across nodes:

If multiple nodes briefly claim leader, election is flapping. Likely network or clock issues. Inspect ursula_raft_rpc_* for high RPC durations.
If no node claims leader, fewer than ⌈n/2⌉+1 voters are reachable. Restore reachability or remove unreachable voters from membership.

Snapshot install times out

ursula_raft_snapshot_duration_seconds{result="error"} ticks up; the affected follower stays behind.

Network bandwidth between leader and follower is too low for the snapshot size within raft.install_snapshot_timeout_ms.
Cold storage on the follower is unreachable (snapshot transfer sends a blob reference; the follower pulls from cold directly).

Raise the timeout, or fix the underlying network / cold-storage path.

`ursula_raft_apply_lag_entries` grows unbounded on a follower

That follower can't apply log entries as fast as the leader produces them. Causes:

Disk pressure on the follower. iostat on the data volume will show it.
CPU saturation. The follower's state machine apply runs in user space.
A pathologically large single stream that's flushing slowly.

Trigger a snapshot on the leader if the lag exceeds what catch-up can recover, or take the follower out of the voter set until it catches up.

Still stuck?

Open a GitHub issue with:

/cluster/status output from every node
curl http://NODE:4437/metrics | grep -E 'ursula_raft_|ursula_write_backpressure|ursula_rehydrate'
The last ~200 log lines at RUST_LOG=ursula=debug
Your config (with secrets redacted)