Observability
Ursula emits OpenTelemetry traces and metrics over OTLP, and keeps a built-in JSON metrics endpoint for quick inspection. Telemetry export is off by default: with no collector configured the server runs exactly as before and the hot read/write path pays nothing.
Enabling OTLP export
Set the standard OpenTelemetry environment variables before starting ursula:
# Send traces and metrics to a collector (Tempo, Jaeger, the OTel Collector, …).
export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4318"
# Optional: head-sampling ratio for new root traces (0.0–1.0, default 1.0).
export OTEL_TRACES_SAMPLER_ARG="0.1"
ursula --listen 0.0.0.0:8080 # …
The exporter uses OTLP over HTTP/protobuf (port 4318), reusing the server's
HTTP stack rather than opening a second gRPC tree alongside Raft. When the
endpoint is unset, the process stays on the stderr fmt logger driven by
RUST_LOG (default info).
Spans batch-export continuously and metrics export on a fixed interval
(OTEL_METRIC_EXPORT_INTERVAL, milliseconds). A final flush on clean shutdown
is only a backstop — don't rely on it to observe data.
Identifying the node
Every span and metric carries the process's OpenTelemetry resource. The
server sets service.name=ursula and, when running with a Raft node id,
service.instance.id=<node_id> so a cluster's nodes are distinguishable. Add
more resource attributes (e.g. host.name, region, AZ) with the standard
variable — bounded by cluster size, so this is a resource dimension, not a
high-cardinality label:
export OTEL_RESOURCE_ATTRIBUTES="host.name=$(hostname),deployment.environment=prod"
Traces
A single request produces one boundary span (http.append, http.read, or
http.head) tagged with bucket and stream — never the payload. That span
is propagated across the internal actor mailboxes, so on-core work
(core.append, core.read, …) nests underneath it. When a follower forwards a
read to the leader, the W3C traceparent rides the Raft gRPC request, so the
follower and leader appear in the same trace end to end.
Trace context travels only in gRPC transport metadata; it is never written to the Raft log, so state-machine determinism and replay are unaffected.
Sampling and the hot path
Request-boundary spans (http.*) are info level, so production traces always
carry one span per request. The on-core spans below them (core.append,
core.read, …) are debug level: filtered out at the default info and
compiled out of release builds entirely (release-max-info, on by default), so
the hot read/write path pays nothing for them. Measured cost:
| Operation | Cost |
|---|---|
info boundary span (export off, fmt only) | ~285 ns |
debug span filtered at info | ~1 ns |
| cross-mailbox context capture | ~5 ns |
For high-throughput deployments, lower OTEL_TRACES_SAMPLER_ARG to bound
export volume.
Seeing on-core detail
The core.* spans only materialize in a build without release-max-info (e.g.
a debug build), and only when their target is enabled. Use a crate-scoped
filter — never a global debug:
# Good: only Ursula's own crates emit debug spans.
export RUST_LOG="info,ursula_runtime=debug,ursula_raft=debug"
A global RUST_LOG=debug turns on debug spans/logs from OpenRaft, the
OpenTelemetry SDK, reqwest, and every other dependency, which floods the
collector and bloats the OTLP payload. Scope the filter to the ursula_*
crates instead.
tokio-consoleneeds tokio'sTRACE-level instrumentation, whichrelease-max-infocompiles out. Build with--no-default-features --features tokio-consolewhen using it.
Metrics
Two surfaces expose the same counters:
- OTLP — when an endpoint is configured, runtime metrics are exported
periodically as OpenTelemetry instruments (
ursula.appends.accepted,ursula.mutations.applied,ursula.raft_apply.ns,ursula.wal.records,ursula.group_mailbox.depth, …), ready for Prometheus/Grafana via a collector. - JSON —
GET /__ursula/metricsreturns the full per-core/per-group snapshot for quick inspection and is whatursulactl statusreads.
Metrics are gathered on the hot path with lock-free per-core counters. The OTLP layer is an export-time bridge: it reads a snapshot at the collection interval rather than calling the metrics SDK per record, so enabling export adds nothing to the append/read path.
The OTLP collection interval defaults to 60s. For a quick check, shorten it so readings arrive without waiting (and without relying on the shutdown flush):
export OTEL_METRIC_EXPORT_INTERVAL=1000 # milliseconds