Operations

The first tool to reach for is ursulactl. It covers the raft-aware verbs operators run most often — rolling restart, status, readiness gate. This page covers everything around it: the metrics shape ursulactl reads, the admin endpoints it (and your custom tooling) can call, the SSH-side lifecycle that still lives in Python, and the operational policies ursulactl does not encode (backups, log levels, S3 cleanup).

Tooling map

SurfaceWhen to reach for it
ursulactlDay-2 cluster ops over HTTP: restart, status, wait-ready.
scripts/ursula_ec2.pySSH/IAM/EC2 lifecycle: push binaries, write systemd units, S3 cleanup, drive the benchmark client.
/__ursula/metrics and the /__ursula/raft/... admin endpointsCustom tooling. ursulactl uses these underneath; the surface is small and stable enough to script directly.

There is no Prometheus scrape and no general-purpose orchestrator yet.

Metrics

curl http://127.0.0.1:4437/__ursula/metrics | jq .

The JSON snapshot covers per-core mailbox depth, append/read counters, latency histograms, per-group leader and last_applied, hot/cold bytes, cold-flush backlog, HTTP status counters, live-read watchers, and cold-write admission state. Start here when triaging slowness, lag, or 503s. ursulactl status is a friendlier read of the leadership-related fields across the whole cluster.

Admin endpoints

These are the primitives ursulactl and any custom operator tooling builds on. Each call is local to one node; to act on every group, loop over the IDs in metrics.

# Force a cold flush for one stream (skip the timer)
curl -X POST http://127.0.0.1:4437/__ursula/flush-cold/demo/hello

# Trigger a Raft snapshot for one group
curl -X POST http://127.0.0.1:4437/__ursula/raft/42/snapshot

# Purge log entries below the last snapshot index for one group
curl -X POST http://127.0.0.1:4437/__ursula/raft/42/purge

# Add a learner (non-voting replica) to one group
curl -X POST http://127.0.0.1:4437/__ursula/raft/42/learners/4

# Hand leadership of one group to another voter (used by `ursulactl restart`)
curl -X POST http://127.0.0.1:4437/__ursula/raft/42/leader/transfer/2

The leader-transfer endpoint refuses with 409 Conflict if the receiving node isn't the current leader and 400 Bad Request if the target isn't a voter — this is why ursulactl is the safer way to chain these calls: it consults metrics first.

SSH-side lifecycle

scripts/ursula_ec2.py drives a static EC2 manifest (instance IDs, IPs, port, binary path, cold env, peer list) over SSH. It complements ursulactl — push binaries and start daemons here, then switch to ursulactl for raft-aware verbs.

# Push a fresh binary to every server in the manifest
python3 scripts/ursula_ec2.py --config cluster.json upload-binary \
  --target servers --local ./target/release/ursula --remote /tmp/ursula

# Bring daemons up, then gate on readiness via ursulactl
python3 scripts/ursula_ec2.py --config cluster.json start
ursulactl wait-ready --config cluster.json --expected-groups 256

# Stop the cluster (kills the pid recorded in the configured pid file)
python3 scripts/ursula_ec2.py --config cluster.json stop

# Drive the benchmark workload from the configured client host
python3 scripts/ursula_ec2.py --config cluster.json perf-many \
  --processes 4 --bucket-prefix benchcmp-mp

perf-many rotates entrypoints across service nodes by default; use --target-mode first only to reproduce a single-ingress run. stop kills only the recorded PID rather than running pkill, because broad patterns can match the SSH cleanup command itself. If a pid file is stale, kill by hand.

The Python script still exposes status and wait-ready, kept around for environments where ursulactl isn't deployed yet. They print the same numbers but via SSH-curl — prefer ursulactl when you have a choice. Verbs scheduled to migrate to ursulactl once SSH transport lands: upload-binary, install-binary, install-service, install-chaos-agent, install-faultd, deploy-chaos. AWS deployment scaffolding (IAM / EC2 lifecycle / security groups) stays in Python permanently.

Cleaning S3

python3 scripts/ursula_ec2.py --config cluster.json cleanup-s3 \
  --root ursula-test-20260518T000000Z

Deletes everything under one storage.cold.root prefix in the manifest's bucket. Run after benchmark sweeps.

Backup and disaster recovery

Ursula has no backup or restore tool: no --export, no cluster dump, no "restore from snapshot file". Plan recovery accordingly.

What you can rely on:

  • Quorum durability. Acknowledged writes survive as long as a majority of voters survives. Three voters across AZs tolerate any single-AZ outage.
  • Cold-tier durability. Once flushed to S3, a chunk inherits S3-grade durability. The unflushed window is bounded by the flush interval (seconds by default).
  • No on-disk migration. v0.x does not promise stable on-disk formats. The runtime won't refuse to start on stale data, but it won't migrate either. Cross-version upgrades currently mean rebuild + replay from external truth.

Node-level loss: replace the host with the same node_id and cluster config; it rehydrates from peers. Use ursulactl wait-ready afterwards to confirm the replacement is voting and caught up before declaring the recovery done. Total-cluster loss: the durable data on S3 is what you have, and there is no tooling yet to materialize a fresh cluster from those objects.

Logs

RUST_LOG=ursula=info,ursula_runtime=info,ursula_raft=info is the baseline. Bump to debug for one crate when chasing a subsystem:

RUST_LOG=ursula_raft=debug ./target/release/ursula ...

debug is verbose under sustained load; redirect to a file.