Observability

Ursula emits OpenTelemetry traces and metrics over OTLP, and keeps a built-in JSON metrics endpoint for quick inspection. Telemetry export is off by default: with no collector configured the server runs exactly as before and the hot read/write path pays nothing.

Enabling OTLP export

Set the standard OpenTelemetry environment variables before starting ursula:

# Send traces and metrics to a collector (Tempo, Jaeger, the OTel Collector, …).
export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4318"

# Optional: head-sampling ratio for new root traces (0.0–1.0, default 1.0).
export OTEL_TRACES_SAMPLER_ARG="0.1"

ursula --listen 0.0.0.0:8080 # …

The exporter uses OTLP over HTTP/protobuf (port 4318), reusing the server's HTTP stack rather than opening a second gRPC tree alongside Raft. When the endpoint is unset, the process stays on the stderr fmt logger driven by RUST_LOG (default info).

Spans batch-export continuously and metrics export on a fixed interval (OTEL_METRIC_EXPORT_INTERVAL, milliseconds). A final flush on clean shutdown is only a backstop — don't rely on it to observe data.

Identifying the node

Every span and metric carries the process's OpenTelemetry resource. The server sets service.name=ursula and, when running with a Raft node id, service.instance.id=<node_id> so a cluster's nodes are distinguishable. Add more resource attributes (e.g. host.name, region, AZ) with the standard variable — bounded by cluster size, so this is a resource dimension, not a high-cardinality label:

export OTEL_RESOURCE_ATTRIBUTES="host.name=$(hostname),deployment.environment=prod"

Traces

A single request produces one boundary span (http.append, http.read, or http.head) tagged with bucket and stream — never the payload. That span is propagated across the internal actor mailboxes, so on-core work (core.append, core.read, …) nests underneath it. When a follower forwards a read to the leader, the W3C traceparent rides the Raft gRPC request, so the follower and leader appear in the same trace end to end.

Trace context travels only in gRPC transport metadata; it is never written to the Raft log, so state-machine determinism and replay are unaffected.

Sampling and the hot path

Request-boundary spans (http.*) are info level, so production traces always carry one span per request. The on-core spans below them (core.append, core.read, …) are debug level: filtered out at the default info and compiled out of release builds entirely (release-max-info, on by default), so the hot read/write path pays nothing for them. Measured cost:

OperationCost
info boundary span (export off, fmt only)~285 ns
debug span filtered at info~1 ns
cross-mailbox context capture~5 ns

For high-throughput deployments, lower OTEL_TRACES_SAMPLER_ARG to bound export volume.

Seeing on-core detail

The core.* spans only materialize in a build without release-max-info (e.g. a debug build), and only when their target is enabled. Use a crate-scoped filter — never a global debug:

# Good: only Ursula's own crates emit debug spans.
export RUST_LOG="info,ursula_runtime=debug,ursula_raft=debug"

A global RUST_LOG=debug turns on debug spans/logs from OpenRaft, the OpenTelemetry SDK, reqwest, and every other dependency, which floods the collector and bloats the OTLP payload. Scope the filter to the ursula_* crates instead.

tokio-console needs tokio's TRACE-level instrumentation, which release-max-info compiles out. Build with --no-default-features --features tokio-console when using it.

Metrics

Two surfaces expose the same counters:

  • OTLP — when an endpoint is configured, runtime metrics are exported periodically as OpenTelemetry instruments (ursula.appends.accepted, ursula.mutations.applied, ursula.raft_apply.ns, ursula.wal.records, ursula.group_mailbox.depth, …), ready for Prometheus/Grafana via a collector.
  • JSONGET /__ursula/metrics returns the full per-core/per-group snapshot for quick inspection and is what ursulactl status reads.

Metrics are gathered on the hot path with lock-free per-core counters. The OTLP layer is an export-time bridge: it reads a snapshot at the collection interval rather than calling the metrics SDK per record, so enabling export adds nothing to the append/read path.

The OTLP collection interval defaults to 60s. For a quick check, shorten it so readings arrive without waiting (and without relying on the shutdown flush):

export OTEL_METRIC_EXPORT_INTERVAL=1000  # milliseconds