Troubleshooting

Start at /__ursula/metrics. Most operational symptoms have a clear signal in the JSON snapshot.

Diagnostic surface

curl http://NODE:4437/__ursula/metrics | jq .

Useful jq selectors:

What you need	jq filter
Per-group leader / term	`.raft_groups[]
Hot bytes per group	`.raft_groups[]
Cold backpressure events	`.cold_backpressure_events_total`
Per-core mailbox depth	`.cores[]
Live-read watcher counts	`.cores[]
HTTP error counters	`.http.responses_by_status`

Ursula does not expose /healthz, /readyz, Prometheus /metrics, or /cluster/status. The JSON snapshot is the source of truth.

Startup failures

`raft.wal.path is required when WAL backend is 'disk'`

Set raft.wal.path when raft.wal.backend = "disk". Use raft.wal.backend = "memory" for volatile tests.

`raft.node_id must be non-zero`

Static Raft mode requires a unique raft.node_id in the config file or a --node-id override.

`raft.peers` must include this node id

The local raft.node_id must appear in [[raft.peers]]. The peer list always includes self.

`storage.cold.s3.bucket is required when cold backend is 's3'`

Set storage.cold.s3.bucket, or set storage.cold.backend = "none" if you don't want cold storage.

Port already in use

server.listen is taken. Public HTTP and inter-node gRPC share the same port unless server.cluster_listen is configured separately.

I/O error on `raft.wal.path`

The directory must be writable. No lock file to clear, Ursula trusts the directory.

Bootstrap and cluster join

A fresh cluster never elects leaders

/__ursula/metrics on each node should show non-zero leader_id and a stable current_term. If groups stay leaderless:

First start must include raft.init_membership_per_group = true (or raft.init_membership = true) on every voting node.
Every peer URL must be reachable from every node.
All peers must use the same raft.wal.backend mode. Mismatch causes silent join failures.

Replacing or restarting a node leaves it leaderless

Restart with the same raft.node_id and peer list. A fresh raft.wal.path means rehydrating from peers, which can be slow on cold-storage-only history.

Write path

`503` with `Retry-After` on append

Cold-write backpressure for that group. Either the hot ring exceeds storage.cold.max_hot_size_per_group (cold flush isn't keeping up), or the cold backend is slow or erroring out.

Raise the per-group ceiling, lower storage.cold.flush_interval, raise storage.cold.flush_max_concurrency, or fix the cold backend.

`409 Conflict` with `producer-expected-seq` / `producer-received-seq`

Out-of-order producer headers. Response tells you what was expected:

producer-expected-seq: N is the next allowed sequence
producer-received-seq: M is what the client sent

Common causes:

Producer restart without bumping Producer-Epoch. Bump every restart.
Two writers share the same Producer-Id. Use distinct IDs.
A retry skipped a sequence number. Producer sequences must be contiguous within an epoch.

`409 Conflict` from `Stream-Seq`

The supplied Stream-Seq is not lexicographically greater than the last accepted value. Re-read and re-derive, or use producer dedup instead.

`404 Not Found` on `POST /{bucket}/{stream}`

The stream hasn't been created. Create with PUT /{bucket}/{stream} first; there is no implicit creation on POST.

`400 Bad Request` on JSON appends

JSON streams are parsed and normalized on the server. Malformed JSON or empty arrays without allow_empty_array are rejected. Send valid JSON or switch to application/octet-stream.

Read path

`Stream-Up-To-Date: false`

More committed data exists past this response. Page forward with Stream-Next-Offset. Not an error.

SSE drops after 30-60 s idle

A proxy or load balancer is closing idle TCP. Raise its idle timeout or enable TCP keepalive on the proxy-to-Ursula leg. Ursula doesn't emit SSE keepalive comments.

SSE on a binary stream returns base64 text

Expected. SSE wire format is text-only, so binary stream data events carry raw base64 text and the Stream-Sse-Data-Encoding: base64 header signals it. See binary SSE.

Reads stall for hundreds of ms for cold offsets

The first read of a cold offset triggers an S3 GetObject range read. It runs off the actor turn so it doesn't block other commands, but the request still pays S3 latency.

Replication and consensus

A group has no leader (`leader_id == 0`)

Either an election is in progress or quorum is lost for that group. Only streams hashed to that group are affected.

Multiple nodes briefly claim leader: election is flapping. Check peer reachability and CPU pressure.
No node claims leader: fewer than n/2+1 voters are reachable. Restore reachability.

One follower lags behind

State-machine apply runs on the follower's owner core. If CPU is pinned, last_applied trails. Check:

Per-core mailbox depth (sustained queueing means saturation).
Disk pressure if raft.wal.backend = "disk" (fsync latency).
A pathological hot stream forcing constant cold flushes (hot_bytes_total per group).

Manually trigger a snapshot

curl -X POST http://NODE:4437/__ursula/raft/{group_id}/snapshot
curl -X POST http://NODE:4437/__ursula/raft/{group_id}/purge

Purge only after the snapshot replicates. Otherwise a slow follower may need log entries you just dropped.

Cold flush issues

Writes succeed but data never reaches S3

storage.cold.backend is none, or set to memory when you expected s3.
IAM permissions missing. Required: s3:GetObject, s3:PutObject, s3:ListBucket, s3:DeleteObject.
storage.cold.s3.endpoint unreachable.
Cold flush worker stalled (cold_backpressure_events_total climbing).

If hot bytes grow unbounded, you'll eventually hit storage.cold.max_hot_size_per_group and writes start returning 503. Fix the backend before that.

Still stuck?

Open a GitHub issue with:

/__ursula/metrics from every node (with timestamps)
Config file and CLI overrides per node (redact credentials)
Last ~200 log lines at RUST_LOG=ursula_runtime=debug,ursula_raft=debug,ursula=info