Troubleshooting

Start at /__ursula/metrics. Most operational symptoms have a clear signal in the JSON snapshot.

Diagnostic surface

curl http://NODE:4437/__ursula/metrics | jq .

Useful jq selectors:

What you needjq filter
Per-group leader / term`.raft_groups[]
Hot bytes per group`.raft_groups[]
Cold backpressure events.cold_backpressure_events_total
Per-core mailbox depth`.cores[]
Live-read watcher counts`.cores[]
HTTP error counters.http.responses_by_status

Ursula does not expose /healthz, /readyz, Prometheus /metrics, or /cluster/status. The JSON snapshot is the source of truth.

Startup failures

raft.wal.path is required when WAL backend is 'disk'

Set raft.wal.path when raft.wal.backend = "disk". Use raft.wal.backend = "memory" for volatile tests.

raft.node_id must be non-zero

Static Raft mode requires a unique raft.node_id in the config file or a --node-id override.

raft.peers must include this node id

The local raft.node_id must appear in [[raft.peers]]. The peer list always includes self.

storage.cold.s3.bucket is required when cold backend is 's3'

Set storage.cold.s3.bucket, or set storage.cold.backend = "none" if you don't want cold storage.

Port already in use

server.listen is taken. Public HTTP and inter-node gRPC share the same port unless server.cluster_listen is configured separately.

I/O error on raft.wal.path

The directory must be writable. No lock file to clear, Ursula trusts the directory.

Bootstrap and cluster join

A fresh cluster never elects leaders

/__ursula/metrics on each node should show non-zero leader_id and a stable current_term. If groups stay leaderless:

  • First start must include raft.init_membership_per_group = true (or raft.init_membership = true) on every voting node.
  • Every peer URL must be reachable from every node.
  • All peers must use the same raft.wal.backend mode. Mismatch causes silent join failures.

Replacing or restarting a node leaves it leaderless

Restart with the same raft.node_id and peer list. A fresh raft.wal.path means rehydrating from peers, which can be slow on cold-storage-only history.

Write path

503 with Retry-After on append

Cold-write backpressure for that group. Either the hot ring exceeds storage.cold.max_hot_size_per_group (cold flush isn't keeping up), or the cold backend is slow or erroring out.

Raise the per-group ceiling, lower storage.cold.flush_interval, raise storage.cold.flush_max_concurrency, or fix the cold backend.

409 Conflict with producer-expected-seq / producer-received-seq

Out-of-order producer headers. Response tells you what was expected:

  • producer-expected-seq: N is the next allowed sequence
  • producer-received-seq: M is what the client sent

Common causes:

  • Producer restart without bumping Producer-Epoch. Bump every restart.
  • Two writers share the same Producer-Id. Use distinct IDs.
  • A retry skipped a sequence number. Producer sequences must be contiguous within an epoch.

409 Conflict from Stream-Seq

The supplied Stream-Seq is not lexicographically greater than the last accepted value. Re-read and re-derive, or use producer dedup instead.

404 Not Found on POST /{bucket}/{stream}

The stream hasn't been created. Create with PUT /{bucket}/{stream} first; there is no implicit creation on POST.

400 Bad Request on JSON appends

JSON streams are parsed and normalized on the server. Malformed JSON or empty arrays without allow_empty_array are rejected. Send valid JSON or switch to application/octet-stream.

Read path

Stream-Up-To-Date: false

More committed data exists past this response. Page forward with Stream-Next-Offset. Not an error.

SSE drops after 30-60 s idle

A proxy or load balancer is closing idle TCP. Raise its idle timeout or enable TCP keepalive on the proxy-to-Ursula leg. Ursula doesn't emit SSE keepalive comments.

SSE on a binary stream returns base64 text

Expected. SSE wire format is text-only, so binary stream data events carry raw base64 text and the Stream-Sse-Data-Encoding: base64 header signals it. See binary SSE.

Reads stall for hundreds of ms for cold offsets

The first read of a cold offset triggers an S3 GetObject range read. It runs off the actor turn so it doesn't block other commands, but the request still pays S3 latency.

Replication and consensus

A group has no leader (leader_id == 0)

Either an election is in progress or quorum is lost for that group. Only streams hashed to that group are affected.

  • Multiple nodes briefly claim leader: election is flapping. Check peer reachability and CPU pressure.
  • No node claims leader: fewer than n/2+1 voters are reachable. Restore reachability.

One follower lags behind

State-machine apply runs on the follower's owner core. If CPU is pinned, last_applied trails. Check:

  • Per-core mailbox depth (sustained queueing means saturation).
  • Disk pressure if raft.wal.backend = "disk" (fsync latency).
  • A pathological hot stream forcing constant cold flushes (hot_bytes_total per group).

Manually trigger a snapshot

curl -X POST http://NODE:4437/__ursula/raft/{group_id}/snapshot
curl -X POST http://NODE:4437/__ursula/raft/{group_id}/purge

Purge only after the snapshot replicates. Otherwise a slow follower may need log entries you just dropped.

Cold flush issues

Writes succeed but data never reaches S3

  • storage.cold.backend is none, or set to memory when you expected s3.
  • IAM permissions missing. Required: s3:GetObject, s3:PutObject, s3:ListBucket, s3:DeleteObject.
  • storage.cold.s3.endpoint unreachable.
  • Cold flush worker stalled (cold_backpressure_events_total climbing).

If hot bytes grow unbounded, you'll eventually hit storage.cold.max_hot_size_per_group and writes start returning 503. Fix the backend before that.

Still stuck?

Open a GitHub issue with:

  • /__ursula/metrics from every node (with timestamps)
  • Config file and CLI overrides per node (redact credentials)
  • Last ~200 log lines at RUST_LOG=ursula_runtime=debug,ursula_raft=debug,ursula=info