Troubleshooting
Start at /__ursula/metrics. Most operational symptoms have a clear signal in the JSON snapshot.
Diagnostic surface
curl http://NODE:4437/__ursula/metrics | jq .
Useful jq selectors:
| What you need | jq filter |
|---|---|
| Per-group leader / term | `.raft_groups[] |
| Hot bytes per group | `.raft_groups[] |
| Cold backpressure events | .cold_backpressure_events_total |
| Per-core mailbox depth | `.cores[] |
| Live-read watcher counts | `.cores[] |
| HTTP error counters | .http.responses_by_status |
Ursula does not expose /healthz, /readyz, Prometheus /metrics, or /cluster/status. The JSON snapshot is the source of truth.
Startup failures
raft.wal.path is required when WAL backend is 'disk'
Set raft.wal.path when raft.wal.backend = "disk". Use raft.wal.backend = "memory" for volatile tests.
raft.node_id must be non-zero
Static Raft mode requires a unique raft.node_id in the config file or a --node-id override.
raft.peers must include this node id
The local raft.node_id must appear in [[raft.peers]]. The peer list always includes self.
storage.cold.s3.bucket is required when cold backend is 's3'
Set storage.cold.s3.bucket, or set storage.cold.backend = "none" if you don't want cold storage.
Port already in use
server.listen is taken. Public HTTP and inter-node gRPC share the same port unless server.cluster_listen is configured separately.
I/O error on raft.wal.path
The directory must be writable. No lock file to clear, Ursula trusts the directory.
Bootstrap and cluster join
A fresh cluster never elects leaders
/__ursula/metrics on each node should show non-zero leader_id and a stable current_term. If groups stay leaderless:
- First start must include
raft.init_membership_per_group = true(orraft.init_membership = true) on every voting node. - Every peer URL must be reachable from every node.
- All peers must use the same
raft.wal.backendmode. Mismatch causes silent join failures.
Replacing or restarting a node leaves it leaderless
Restart with the same raft.node_id and peer list. A fresh raft.wal.path means rehydrating from peers, which can be slow on cold-storage-only history.
Write path
503 with Retry-After on append
Cold-write backpressure for that group. Either the hot ring exceeds storage.cold.max_hot_size_per_group (cold flush isn't keeping up), or the cold backend is slow or erroring out.
Raise the per-group ceiling, lower storage.cold.flush_interval, raise storage.cold.flush_max_concurrency, or fix the cold backend.
409 Conflict with producer-expected-seq / producer-received-seq
Out-of-order producer headers. Response tells you what was expected:
producer-expected-seq: Nis the next allowed sequenceproducer-received-seq: Mis what the client sent
Common causes:
- Producer restart without bumping
Producer-Epoch. Bump every restart. - Two writers share the same
Producer-Id. Use distinct IDs. - A retry skipped a sequence number. Producer sequences must be contiguous within an epoch.
409 Conflict from Stream-Seq
The supplied Stream-Seq is not lexicographically greater than the last accepted value. Re-read and re-derive, or use producer dedup instead.
404 Not Found on POST /{bucket}/{stream}
The stream hasn't been created. Create with PUT /{bucket}/{stream} first; there is no implicit creation on POST.
400 Bad Request on JSON appends
JSON streams are parsed and normalized on the server. Malformed JSON or empty arrays without allow_empty_array are rejected. Send valid JSON or switch to application/octet-stream.
Read path
Stream-Up-To-Date: false
More committed data exists past this response. Page forward with Stream-Next-Offset. Not an error.
SSE drops after 30-60 s idle
A proxy or load balancer is closing idle TCP. Raise its idle timeout or enable TCP keepalive on the proxy-to-Ursula leg. Ursula doesn't emit SSE keepalive comments.
SSE on a binary stream returns base64 text
Expected. SSE wire format is text-only, so binary stream data events carry raw base64 text and the Stream-Sse-Data-Encoding: base64 header signals it. See binary SSE.
Reads stall for hundreds of ms for cold offsets
The first read of a cold offset triggers an S3 GetObject range read. It runs off the actor turn so it doesn't block other commands, but the request still pays S3 latency.
Replication and consensus
A group has no leader (leader_id == 0)
Either an election is in progress or quorum is lost for that group. Only streams hashed to that group are affected.
- Multiple nodes briefly claim leader: election is flapping. Check peer reachability and CPU pressure.
- No node claims leader: fewer than
n/2+1voters are reachable. Restore reachability.
One follower lags behind
State-machine apply runs on the follower's owner core. If CPU is pinned, last_applied trails. Check:
- Per-core mailbox depth (sustained queueing means saturation).
- Disk pressure if
raft.wal.backend = "disk"(fsync latency). - A pathological hot stream forcing constant cold flushes (
hot_bytes_totalper group).
Manually trigger a snapshot
curl -X POST http://NODE:4437/__ursula/raft/{group_id}/snapshot
curl -X POST http://NODE:4437/__ursula/raft/{group_id}/purge
Purge only after the snapshot replicates. Otherwise a slow follower may need log entries you just dropped.
Cold flush issues
Writes succeed but data never reaches S3
storage.cold.backendisnone, or set tomemorywhen you expecteds3.- IAM permissions missing. Required:
s3:GetObject,s3:PutObject,s3:ListBucket,s3:DeleteObject. storage.cold.s3.endpointunreachable.- Cold flush worker stalled (
cold_backpressure_events_totalclimbing).
If hot bytes grow unbounded, you'll eventually hit storage.cold.max_hot_size_per_group and writes start returning 503. Fix the backend before that.
Still stuck?
Open a GitHub issue with:
/__ursula/metricsfrom every node (with timestamps)- Config file and CLI overrides per node (redact credentials)
- Last ~200 log lines at
RUST_LOG=ursula_runtime=debug,ursula_raft=debug,ursula=info