Overview

The problem

Building a durable stream directly on S3 hits two walls: S3 has no append (every write either pays per-PUT or batches and eats the latency), and its conditional writes are optimistic compare-and-set, so concurrent writers retry-storm with exponential latency (Chroma's wal3 post covers the mechanics). The market splits on which wall to lean against: S2 writes through S3 Express to get under 50 ms but pays around 7x per-GB; WarpStream batches to S3 Standard and accepts 250 ms+ p50.

How Ursula works

Ursula keeps S3 off the write path entirely. Writes go to a cluster of nodes that coordinates through Raft consensus. Once a majority of replicas has persisted a record, it is committed, with no compare-and-set races and no retry storms. S3 is the cold tier.

Two structural choices shape the rest of the design:

  1. Thread-per-core shard ownership. Every stream is statically hashed to one specific CPU core on each node. That core handles all reads and writes for its streams, in order, with no cross-core synchronization on the hot path.
  2. Multi-Raft. Ursula runs hundreds to thousands of small Raft groups per cluster, not one log per node. Each stream hashes to a group, each group pins to a core. An unhealthy follower for one group does not stall traffic for another.

Together, write throughput scales with healthy cores across the cluster instead of with any single leader's bandwidth, and failure domains stay per-group rather than per-node.

The stack at a glance

Ursula is async Rust. The pieces that show up in the diagrams below:

  • axum is the HTTP server library that terminates public client requests.
  • tokio is the async runtime. Each core runs its own single-threaded executor (tokio::current_thread) so work on that core never migrates.
  • tonic is the gRPC library used for peer-to-peer Raft traffic between nodes.
  • OpenRaft is the Raft consensus implementation Ursula plugs into.

The Rust-named components in diagrams (ShardRuntime, GroupActor, GroupEngine, StreamStateMachine) are the structures inside one node, glossed in the sections below.

Interface

The API is the Durable Streams protocol, an open MIT-licensed spec published by Electric: create a stream, append, read from any offset, tail in real time over Server-Sent Events (SSE), with exactly-once write semantics. Plain HTTP plus SSE, with no custom binary protocol or mandatory client library. See the API overview and the extensions spec for the append-batch path.

Cluster topology

                          Clients (HTTP / SSE)

                ┌──────────────────┼──────────────────┐
                v                  v                  v
          ┌──────────┐       ┌──────────┐       ┌──────────┐
          │  Node A  │       │  Node B  │       │  Node C  │
          └────┬─────┘       └────┬─────┘       └────┬─────┘
            cores 0..N         cores 0..N         cores 0..N
               │                  │                  │
               └─── gRPC peer-to-peer Raft traffic ──┘

                                   v
                       ┌────────────────────────┐
                       │ Cold storage (S3)      │
                       │ per-stream chunks      │
                       └────────────────────────┘

Every stream is replicated to all voting members of its Raft group. Hot data lives in memory on each replica. Cold data lives in a shared object store any replica can read from on demand.

Inside a node

   HTTP / SSE request

          v
   route(bucket_id, stream_id)

          v
   owner core

          v
   Raft group
      │       │
      │       └── replicate to peer nodes

      ├── hot state and live watchers

      └── cold chunks in S3

A request enters the HTTP server, routing maps the bucket and stream name to one owner core and one Raft group, and the rest of the request runs inside that group. The group owns its hot bytes, live-tail watchers, producer dedup state, and Raft state machine, so mutable stream state does not cross cores after routing.

The replaceable group engine

The GroupEngine trait is the seam between the runtime and a group's storage and replication strategy. Three engines ship today:

  • In-memory, non-replicated. Used in tests and as a performance baseline.
  • Disk-backed Raft engine (--storage-backend disk --disk-path DIR). OpenRaft over a per-group write-ahead log of protobuf-framed records under DIR/raft-log, plus a shared per-core journal. Durable across restarts.
  • Memory-log Raft engine (--storage-backend memory). OpenRaft with an in-memory log store. Fast iteration, no persistence.

A factory selected at startup picks which engine each group opens, leaving the runtime code identical across modes.

Write path

HTTP ---> Runtime ---> core mailbox ---> GroupActor

                                       ├─ leader? ──no─---> forward to leader (gRPC)

                                       v yes
                                  Raft.client_write

                                       v
                            WAL append + fsync, replicate to followers

                                       v commit
                                  StateMachine::apply

                                       v
                          notify_read_watchers (same turn, no lock)

                                       v
                                AppendResponse -> client

The leader writes the entry to its write-ahead log (WAL), fsyncs, and replicates to followers in parallel. Once a quorum has persisted the entry, the state machine applies it and the response goes back to the client. Producer dedup headers (exactly-once writes) are checked inside the state machine, so retries of an already-applied append return the original offset rather than appending again. Stream-Seq provides a single-writer ordering guard for callers that prefer that style (conditional writes).

Storage layout

Each Raft group has two storage planes: a replicated log and a stream state machine.

              Raft group

        ┌─────────┴─────────┐
        v                   v
  replicated log       state machine
  command order        streams in this group
  quorum durability    offsets / metadata / dedup

                    ┌─────────┴─────────┐
                    v                   v
              hot ring memory      cold chunk refs
              recent bytes         S3 objects

The Raft log is the ordering and durability boundary for writes. A write is acknowledged only after a majority of that group's replicas persist the entry. The entry carries the stream command and metadata needed to apply it deterministically.

The stream data plane lives in the group's state machine. A group owns many streams, keyed by (bucket_id, stream_id). For each stream it tracks metadata, offsets, producer dedup state, live snapshots, hot byte segments, and cold chunk references.

Recent bytes stay in an in-memory hot ring on every replica. When a group's hot bytes exceed the configured cold-flush threshold, a background worker uploads older contiguous segments to the cold backend, usually S3. After upload succeeds, Ursula commits a metadata update through Raft that publishes the cold chunk reference and advances the stream's hot start offset.

Reads assemble one logical stream from both tiers. The state machine first builds a read plan: which byte ranges are still hot and which ranges are cold chunk references. Hot bytes are copied from memory; cold ranges are fetched with object-store range reads. Any replica that has applied the cold chunk metadata can serve those cold reads, as long as all nodes point at the same S3 bucket/prefix.

Read path and cold offload

HTTP ---> Runtime ---> core mailbox ---> GroupActor

                                       ├─ leader? ──no─---> forward to leader

                                       v yes
                          engine.read_stream_parts (under state-machine lock)

                                       v release lock
                       GroupReadStreamParts { plan, cold_store }

                                       v tokio::spawn (off the actor turn)
                          materialize: hot bytes + S3 range reads

                                       v
                                ReadStreamResponse -> client

Under the state-machine lock the runtime computes only a read plan: which segments are still in the hot ring and which live as cold chunks in S3. Then it releases the lock. Materialization, including any S3 GetObject range reads, runs in a separately spawned task, so a slow cold fetch never blocks other commands waiting on the same group. Followers serve replicated historical catch-up bytes from their local applied state when the read does not need a leader-owned access mutation. Tail-sensitive responses such as HEAD, open-stream Stream-Up-To-Date: true, and live reads stay on the leader path (see offsets).

SSE live tailing

When a read finds no new data and the response carries Stream-Up-To-Date: true, the runtime registers a watcher inside the owning group actor. The watcher map is per-actor and mutated only inside that actor's turn, with no mutex and no awaiting while holding a lock. When a later append commits on the same group, the actor wakes the matching watchers in the same turn, deduplicates them by request shape, and dispatches materialization through the same fast path the read uses.

For deployment topology (voting layout across availability zones, non-voting replicas, S3 prefix layout), see the operations guide.