# Ursula - full documentation > Open-source Distributed Durable Streams over HTTP, backed by S3. Source: https://ursula.tonbo.io --- # Ursula > A self-hosted, distributed server for append-only event timelines. Thread-per-core, multi-Raft, S3-backed. > [!NOTE] > Ursula is built by

Tonbo. Ursula is a self-hosted, distributed server for the replayable, append-only event timelines behind document edits, agent runs, workflows, and chat. It speaks the [Durable Streams Protocol](/docs/specs/durable-stream) over plain HTTP and SSE. Quorum-replicated in-memory front-ends give you single-digit-millisecond appends and live tail on top of S3 durability, with no separate broker, no batched 250 ms commits, and no S3 Express bill. - **HTTP-native.** `PUT` creates, `POST` appends, `GET` replays or tails with long-poll/SSE. Any HTTP client is a valid client. - **One timeline per resource.** A stream per document, session, workflow, room, or agent run, instead of a few high-throughput pipelines. - **Thread-per-core × multi-Raft.** Each stream hashes to one Raft group and one owner core. Hot-path requests touch that core only, with no cross-core synchronization. Hundreds to thousands of small groups co-exist per node, so a slow follower for one group never stalls another. - **Hot ring + S3 cold tier.** Writes commit at Raft quorum in an in-memory ring. Older segments flush to S3 in the background. A single `GET` transparently spans both tiers. ## Run locally For now, Ursula builds from Rust source. Pre-built release binaries are on the way. Start a single in-memory node (fast iteration, no persistence): ```bash cargo run --bin ursula ``` Create a stream, append bytes, and read them back: ```bash curl -X PUT http://127.0.0.1:4437/demo curl -X PUT http://127.0.0.1:4437/demo/hello curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'hello world' curl 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` ## Learn more - **[Why Ursula](/docs/why-ursula):** the four properties Ursula keeps that other servers force you to trade - **[Architecture](/docs/architecture/overview):** thread-per-core, multi-Raft, hot/cold tiers - **[Protocol Spec](/docs/specs/durable-stream):** Durable Streams plus [Ursula's extensions](/docs/specs/extensions) - **[ursulactl](/docs/cli):** the operator CLI for a running cluster — rolling restart, status, readiness gate ## Credits - **[ElectricSQL](https://electric-sql.com/)** for the original Durable Streams Protocol that Ursula implements. - **[Loro](https://loro.dev/)** for the snapshot and replay extension design that Ursula adopted on top of the base protocol. --- # Quick Start > Run Ursula locally, create a stream, append data, and subscribe with SSE. > For now, Ursula builds from Rust source. Pre-built release binaries are on the way. Start a local node from the workspace root. The default is in-memory (no on-disk state, good for kicking the tires): ```bash cargo run --bin ursula ``` It binds `127.0.0.1:4437`, picks a core count from your CPU, and uses the in-memory engine. For local WAL-backed persistence instead, point `--wal-dir` at an empty directory: ```bash cargo run --bin ursula -- --wal-dir ./data/wal ``` Then drive the HTTP API with `curl` in another terminal. ## Acknowledge the bucket Bucket creation in the current build is an idempotent acknowledgement - Ursula returns `201` whether or not the name has been seen before. It's still worth issuing because clients and docs assume the call: ```bash curl -X PUT http://127.0.0.1:4437/demo ``` ## Create a stream ```bash curl -X PUT http://127.0.0.1:4437/demo/hello ``` ## Append data ```bash curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'first message' curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'second message' ``` Each successful append returns `204 No Content` with a `Stream-Next-Offset` header. Add `Producer-Id` / `Producer-Epoch` / `Producer-Seq` if you need [exactly-once retries](/docs/concepts/exactly-once-writes). ## Read everything from the beginning ```bash curl -i 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` The body contains the appended bytes. Response headers include `Stream-Next-Offset`, `Stream-Up-To-Date`, and an `ETag`. ## Subscribe for live updates Open a second terminal: ```bash curl -N 'http://127.0.0.1:4437/demo/hello?offset=-1&live=sse' ``` `-N` keeps `curl` from line-buffering the SSE stream. Append more data from the first terminal and you'll see it arrive immediately as `event: data` lines. Binary streams are delivered as a JSON envelope with a base64-encoded payload (`Stream-Sse-Data-Encoding: base64`). See [binary SSE](/docs/concepts/binary-sse) for details. ## Inspect runtime state For a multi-node cluster the canonical day-2 tool is [`ursulactl`](/docs/cli) — it speaks to every node, summarises leadership, and is what you'll use for rolling restarts. For a single local node you can either point it at a one-line manifest: ```bash cat > /tmp/local.json <<'JSON' {"nodes": [{"id": 1, "http_url": "http://127.0.0.1:4437", "host": "127.0.0.1"}]} JSON ursulactl status --config /tmp/local.json ``` Or hit the underlying JSON endpoint directly: ```bash curl http://127.0.0.1:4437/__ursula/metrics ``` The raw endpoint is also what `ursulactl` consumes; reach for it when you want the full snapshot or are building custom tooling. ## Next steps - [ursulactl](/docs/cli): the operator CLI you'll use once a cluster is up - [Run locally](/docs/run-locally): flags, environment variables, and durability modes - [Deploy a cluster](/docs/deploy-cluster): static gRPC Raft cluster shape - [Configure S3](/docs/configure-s3): shared cold storage for clusters - [API overview](/docs/api/overview): the HTTP surface Ursula currently exposes - [Streams](/docs/concepts/streams): the core stream abstraction and lifecycle - [Architecture](/docs/architecture/overview): thread-per-core, multi-Raft internals --- # Why Ursula > The four properties Ursula keeps that other servers force you to trade, and the per-entity timeline model that makes the pattern simple to build on. ## What Ursula keeps A new generation of event streams lives outside the broker network. Document editors, agents, and durable workflows need timelines that browsers, mobile apps, and serverless functions can read, write, and tail over the public internet. That asks for HTTP-native, distributed, S3-backed infrastructure, not the SDK-locked, single-network shape Kafka-style brokers were built for. The [Durable Streams Protocol](/docs/specs/durable-stream) nails that wire format, but its reference server is a single process: a node loss is data loss. The other servers we evaluated each force you to give up one of four things this primitive deserves to keep: - **Open-source self-hosting.** - **Low write latency** (sub-50 ms appends, no batching window required). - **Plain S3 economics** (cold tier on standard S3, no S3 Express tier, no per-GB SaaS markup). - **Quorum-replicated durability** (acknowledged writes survive a single-node failure). Ursula keeps all four. For the head-to-head numbers, see [How Ursula compares](/docs/competitive-comparison). ## One timeline per entity Most event systems multiplex many entities into shared channels and rebuild per-entity state downstream. Ursula inverts that: each document, session, task, room, or agent run gets its own durable timeline. Writers append directly to it; readers resume from its offsets. Treat it as a per-entity durable log runtime, purpose-built for that pattern rather than a general event backbone. ## What you get from one stream per entity - **Replayable recovery.** When a worker, sandbox, or agent restarts, replay its log and continue. - **Live tails with simple clients.** Catch-up reads, long-poll, or SSE, all over plain HTTP. - **Lifecycle in the same primitive.** Snapshots, bootstrap, and TTL ride the timeline instead of being separate infrastructure. ## The tradeoff Ursula makes one timeline per entity cheap. In exchange, it isn't the most general primitive for cross-system event distribution or arbitrary stream processing. Keeping the per-entity model simple is what makes recovery, inspection, and operation tractable for long-running application state. For latency and durability numbers, see [Architecture](/docs/architecture/overview) and [Competitive comparison](/docs/competitive-comparison). --- # Clients > Talking to Ursula from any HTTP client. Ursula speaks plain HTTP and Server-Sent Events. There is no required client library. Any HTTP client in any language works. The examples elsewhere in these docs use `curl` because it's universal. The same routes, headers, and query parameters apply to every other client. ## Minimal examples ```bash # create a bucket and stream curl -X PUT http://127.0.0.1:4437/demo curl -X PUT http://127.0.0.1:4437/demo/hello # append curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'hello world' # catch-up read curl 'http://127.0.0.1:4437/demo/hello?offset=-1' # live tail curl 'http://127.0.0.1:4437/demo/hello?offset=-1&live=sse' ``` ```python import requests base = "http://127.0.0.1:4437" requests.put(f"{base}/demo") requests.put(f"{base}/demo/hello") requests.post( f"{base}/demo/hello", headers={"Content-Type": "application/octet-stream"}, data=b"hello world", ) # catch-up read resp = requests.get(f"{base}/demo/hello", params={"offset": -1}) print(resp.content) # live tail with SSE with requests.get( f"{base}/demo/hello", params={"offset": -1, "live": "sse"}, stream=True, ) as r: for line in r.iter_lines(): if line: print(line.decode()) ``` ```ts const base = "http://127.0.0.1:4437"; await fetch(`${base}/demo`, { method: "PUT" }); await fetch(`${base}/demo/hello`, { method: "PUT" }); await fetch(`${base}/demo/hello`, { method: "POST", headers: { "Content-Type": "application/octet-stream" }, body: "hello world", }); // catch-up read const data = await (await fetch(`${base}/demo/hello?offset=-1`)).text(); // live tail with native EventSource const es = new EventSource(`${base}/demo/hello?offset=-1&live=sse`); es.addEventListener("data", (e) => console.log(e.data)); ``` ## Notes for client implementers - After every read and append, the server returns `Stream-Next-Offset`. Track it. Use it as the `offset` query parameter on the next read. Don't construct offsets manually. They're opaque. - For binary streams over SSE, `data` events arrive as a JSON envelope (`{ encoding, contentType, payload }`) with a base64 payload. See [Binary SSE](/docs/concepts/binary-sse). - For exactly-once writes, send `Producer-Id`, `Producer-Epoch`, `Producer-Seq` headers and retry on network errors. The server deduplicates. See [Exactly-once writes](/docs/concepts/exactly-once-writes). - For conditional writes, use `Stream-Seq` with a monotonically increasing token. (`If-Match` is part of the Durable Streams Protocol but is not yet implemented in Ursula.) See [Conditional writes](/docs/concepts/conditional-writes). --- # Overview > How Ursula realizes the Durable Streams protocol with thread-per-core shards, multi-Raft replication, and an in-memory hot ring backed by S3 cold storage. ## The problem Building a durable stream directly on S3 hits two walls: S3 has no append (every write either pays per-PUT or batches and eats the latency), and its conditional writes are optimistic compare-and-set, so concurrent writers retry-storm with exponential latency ([Chroma's wal3 post](https://www.trychroma.com/engineering/wal3) covers the mechanics). The market splits on which wall to lean against: [S2](https://s2.dev/blog/intro) writes through S3 Express to get under 50 ms but pays around 7x per-GB; [WarpStream](https://www.warpstream.com/blog/minimizing-s3-api-costs-with-warpstream) batches to S3 Standard and accepts 250 ms+ p50. ## How Ursula works Ursula keeps S3 off the write path entirely. Writes go to a cluster of nodes that coordinates through [Raft consensus](https://en.wikipedia.org/wiki/Raft_(algorithm)). Once a majority of replicas has persisted a record, it is committed, with no compare-and-set races and no retry storms. S3 is the cold tier. Two structural choices shape the rest of the design: 1. **[Thread-per-core](https://seastar.io/shared-nothing/) shard ownership.** Every stream is statically hashed to one specific CPU core on each node. That core handles all reads and writes for its streams, in order, with no cross-core synchronization on the hot path. 2. **[Multi-Raft](https://tikv.org/deep-dive/scalability/multi-raft/).** Ursula runs hundreds to thousands of small Raft groups per cluster, not one log per node. Each stream hashes to a group, each group pins to a core. An unhealthy follower for one group does not stall traffic for another. Together, write throughput scales with healthy cores across the cluster instead of with any single leader's bandwidth, and failure domains stay per-group rather than per-node. ## The stack at a glance Ursula is async Rust. The pieces that show up in the diagrams below: - **axum** is the HTTP server library that terminates public client requests. - **tokio** is the async runtime. Each core runs its own single-threaded executor (`tokio::current_thread`) so work on that core never migrates. - **tonic** is the gRPC library used for peer-to-peer Raft traffic between nodes. - **OpenRaft** is the Raft consensus implementation Ursula plugs into. The Rust-named components in diagrams (`ShardRuntime`, `GroupActor`, `GroupEngine`, `StreamStateMachine`) are the structures inside one node, glossed in the sections below. ## Interface The API is the [Durable Streams protocol](https://durablestreams.com/), an open MIT-licensed spec published by [Electric](https://electric-sql.com/): create a stream, append, read from any offset, tail in real time over Server-Sent Events (SSE), with exactly-once write semantics. Plain HTTP plus SSE, with no custom binary protocol or mandatory client library. See the [API overview](/docs/api/overview) and the [extensions spec](/docs/specs/extensions) for the append-batch path. ## Cluster topology ``` Clients (HTTP / SSE) │ ┌──────────────────┼──────────────────┐ v v v ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node A │ │ Node B │ │ Node C │ └────┬─────┘ └────┬─────┘ └────┬─────┘ cores 0..N cores 0..N cores 0..N │ │ │ └─── gRPC peer-to-peer Raft traffic ──┘ │ v ┌────────────────────────┐ │ Cold storage (S3 / fs) │ │ per-stream chunks │ └────────────────────────┘ ``` Every stream is replicated to all voting members of its Raft group. Hot data lives in memory on each replica. Cold data lives in a shared object store any replica can read from on demand. ## Inside a node ``` HTTP / SSE request │ v route(bucket_id, stream_id) │ v owner core │ v Raft group │ │ │ └── replicate to peer nodes │ ├── hot state and live watchers │ └── cold chunks in S3 / fs ``` A request enters the HTTP server, routing maps the bucket and stream name to one owner core and one Raft group, and the rest of the request runs inside that group. The group owns its hot bytes, live-tail watchers, producer dedup state, and Raft state machine, so mutable stream state does not cross cores after routing. ## The replaceable group engine The `GroupEngine` trait is the seam between the runtime and a group's storage and replication strategy. Three engines ship today: - **In-memory, non-replicated.** Used in tests and as a performance baseline. - **WAL-backed Raft engine** (`--wal-dir`). OpenRaft over a per-group write-ahead log of protobuf-framed records, plus a shared per-core journal. Durable across restarts. - **Memory-log Raft engine** (`--raft-memory`). OpenRaft with an in-memory log store. Fast iteration, no persistence. A factory selected at startup picks which engine each group opens, leaving the runtime code identical across modes. ## Write path ``` HTTP ---> Runtime ---> core mailbox ---> GroupActor │ ├─ leader? ──no─---> forward to leader (gRPC) │ v yes Raft.client_write │ v WAL append + fsync, replicate to followers │ v commit StateMachine::apply │ v notify_read_watchers (same turn, no lock) │ v AppendResponse -> client ``` The leader writes the entry to its write-ahead log (WAL), fsyncs, and replicates to followers in parallel. Once a quorum has persisted the entry, the state machine applies it and the response goes back to the client. Producer dedup headers ([exactly-once writes](/docs/concepts/exactly-once-writes)) are checked inside the state machine, so retries of an already-applied append return the original offset rather than appending again. `Stream-Seq` provides a single-writer ordering guard for callers that prefer that style ([conditional writes](/docs/concepts/conditional-writes)). ## Storage layout Each Raft group has two storage planes: a replicated log and a stream state machine. ``` Raft group │ ┌─────────┴─────────┐ v v replicated log state machine command order streams in this group quorum durability offsets / metadata / dedup │ ┌─────────┴─────────┐ v v hot ring memory cold chunk refs recent bytes S3 / fs objects ``` The Raft log is the ordering and durability boundary for writes. A write is acknowledged only after a majority of that group's replicas persist the entry. The entry carries the stream command and metadata needed to apply it deterministically. The stream data plane lives in the group's state machine. A group owns many streams, keyed by `(bucket_id, stream_id)`. For each stream it tracks metadata, offsets, producer dedup state, live snapshots, hot byte segments, and cold chunk references. Recent bytes stay in an in-memory hot ring on every replica. When a group's hot bytes exceed the configured cold-flush threshold, a background worker uploads older contiguous segments to the cold backend, usually S3. After upload succeeds, Ursula commits a metadata update through Raft that publishes the cold chunk reference and advances the stream's hot start offset. Reads assemble one logical stream from both tiers. The state machine first builds a read plan: which byte ranges are still hot and which ranges are cold chunk references. Hot bytes are copied from memory; cold ranges are fetched with object-store range reads. Any replica that has applied the cold chunk metadata can serve those cold reads, as long as all nodes point at the same S3 bucket/prefix. ## Read path and cold offload ``` HTTP ---> Runtime ---> core mailbox ---> GroupActor │ ├─ leader? ──no─---> forward to leader │ v yes engine.read_stream_parts (under state-machine lock) │ v release lock GroupReadStreamParts { plan, cold_store } │ v tokio::spawn (off the actor turn) materialize: hot bytes + S3 range reads │ v ReadStreamResponse -> client ``` Under the state-machine lock the runtime computes only a read plan: which segments are still in the hot ring and which live as cold chunks in S3. Then it releases the lock. Materialization, including any S3 `GetObject` range reads, runs in a separately spawned task, so a slow cold fetch never blocks other commands waiting on the same group. Followers serve replicated historical catch-up bytes from their local applied state when the read does not need a leader-owned access mutation. Tail-sensitive responses such as `HEAD`, open-stream `Stream-Up-To-Date: true`, and live reads stay on the leader path (see [offsets](/docs/concepts/offsets)). ## SSE live tailing When a read finds no new data and the response carries `Stream-Up-To-Date: true`, the runtime registers a watcher inside the owning group actor. The watcher map is per-actor and mutated only inside that actor's turn, with no mutex and no awaiting while holding a lock. When a later append commits on the same group, the actor wakes the matching watchers in the same turn, deduplicates them by request shape, and dispatches materialization through the same fast path the read uses. For deployment topology (voting layout across availability zones, non-voting replicas, S3 prefix layout), see the [operations guide](/docs/operations). --- # Comparisons > How Ursula relates to other stream and storage systems for application-facing timelines. Ursula's niche: distributed Durable Streams over HTTP, one timeline per application resource, served from a thread-per-core multi-Raft cluster. The frame that drove the design: every other server we evaluated forces you to give up one of four things this primitive deserves to keep: open-source self-hosting, low write latency, plain S3 economics, or quorum-replicated durability. Each section below maps a competitor against that four-way frame. ## Object storage (S3 directly) S3 is excellent for durable immutable blobs. As an append interface it isn't: every append is a new object, conditional writes are optimistic and retry-storm under contention, and live readers must poll. Ursula uses S3 for the cold tier and snapshots, never the write path. Appends commit at Raft quorum in an in-memory hot ring; a background flush carries older segments to S3. The hot/cold boundary is invisible: one `GET` serves both. ## Durable Streams reference server (Electric) The reference server implements the same protocol and is designed to embed and run as a single process. HA requires building another layer on top. Ursula provides that layer natively (per-group Raft, quorum replication, transparent leader forwarding, snapshot publication, cold offload), at the cost of running a multi-node cluster. Both speak the same protocol, so clients are interchangeable. ## S2 / S2 Lite S2 is an object-storage-backed stream API. Strong acknowledged durability comes from writing through S3 Express One Zone. Self-hosted S2 Lite is a single serving process; the managed product uses its own API. | | Ursula | S2 | | --- | --- | --- | | Protocol | Durable Streams HTTP | Custom | | Write path | Raft quorum in replicated hot ring, no S3 | S3 Express | | Cold tier | S3 Standard | S3 Express (~7× per-GB) | | Shape | Raft cluster | Single-process self-host (Lite) | ## Kafka / Redpanda Kafka and Redpanda target high-throughput pipelines: topics, partitions, consumer groups. Redpanda shares one idea with Ursula: **thread-per-core**, with each core owning its partitions. Differences: - **Primitive.** Topic + partition + consumer group vs. one independently-addressable durable stream per resource. Ursula is built for thousands of small streams, not a few high-throughput pipelines. - **Protocol.** Kafka wire protocol with partition-aware clients vs. plain HTTP + SSE. - **Group placement.** Redpanda strictly owns one Raft group per core (Seastar reactor). Ursula co-schedules many small groups on one `tokio::current_thread` runtime per core, trading a bit of within-core isolation for the flexibility of group count ≫ core count. - **Maturity.** Kafka and Redpanda have years of production hardening; Ursula is `v0.x` with intentionally minimal operations. ## etcd / single-Raft KV etcd is strongly-consistent KV over a single Raft log: exemplary at small consistent state, but capped at one leader's serialization throughput. Ursula's multi-Raft layout (hundreds to thousands of small groups across cores and nodes) lets aggregate throughput scale with cluster size. The cost: more groups to manage and no single-log linearizability across the namespace. Per-stream linearizability is preserved. ## When to pick which Reach for **Ursula** when the model is many small timelines (one per resource), clients are diverse HTTP callers, and you want replay + live tail + snapshot in one primitive on commodity infrastructure with S3 Standard as the cold tier. Reach for something else when: - A few high-throughput topics with consumer groups -> **Kafka / Redpanda** - Single-process deployment -> **Durable Streams reference server** - Managed service, willing to change API -> **S2** - Durable immutable blobs, no append/tail -> **S3 directly** - Small consistent KV across the whole cluster -> **etcd** --- # Streams > Ursula's core append-only stream primitive, with naming, content type, and lifecycle. A stream is an append-only byte sequence addressed by a URL: `/{bucket}/{stream}`. Once data is written it cannot be modified or reordered. Only new data can be appended. ## Naming Stream IDs may contain any byte that is not `/`, `\0`, or `..`. Maximum stream ID length is 122 bytes, and the combined bucket-plus-stream path is also capped at 122 bytes. The literal name `streams` is reserved (it's the bucket-level listing endpoint). A workspace app might use `/workspace-a/doc-123`. An agent system might use `/agents/run-2026-05-13-abc`. Anything within the rules above is fair. ## Content type A stream's content type is set on creation (PUT) or on first append (POST) and defaults to `application/octet-stream` if no `Content-Type` header is supplied. Subsequent appends must declare the same content type, or the server rejects them with `400`. The content type rides along on reads so clients can dispatch on it without inspecting payloads. ## Lifecycle Every stream is in one of these states: - **Open.** Accepts appends. The default on creation. - **Closed.** Sealed by `Stream-Closed: true` on a POST. Readers receive an EOF signal. Further appends return `409`. Close is irreversible. - **Expired.** Past its `Stream-TTL` or `Stream-Expires-At` deadline. Reads and writes return `404`. Expired streams are eventually garbage-collected. - **Deleted.** Removed by `DELETE`. Reads return `404`. `Stream-TTL` (seconds, relative) and `Stream-Expires-At` (RFC 3339 absolute) are mutually exclusive. Sending both yields `400`. ## Related - [Buckets](/docs/concepts/buckets): how streams are grouped - [Snapshots](/docs/concepts/snapshots): compacting a long stream - [Bootstrap](/docs/concepts/bootstrap): efficient first-load for new clients --- # Buckets > Buckets group related streams into namespaces. Streams are organized into buckets. A bucket is a namespace, like a folder that holds related streams under one URL prefix: ``` /{bucket}/{stream} ``` A collaborative editing app might use one bucket per workspace, with one stream per document: `/workspace-a/doc-1`, `/workspace-a/doc-2`. An agent platform might use one bucket per tenant: `/tenant-acme/run-xyz`. ## Naming Bucket IDs match the regex `[a-z0-9_-]{4,64}`: 4–64 bytes, lowercase ASCII letters and digits plus `_` and `-`. Uppercase letters, `/`, `.`, and most punctuation are rejected with `400`. The combined bucket-plus-stream URL path also has a 122-byte ceiling. With a 64-byte bucket name, you have 58 bytes left for the stream ID. ## Lifecycle - **`PUT /{bucket}`** creates the bucket. Idempotent if it already exists. - **`GET /{bucket}`** returns metadata. - **`DELETE /{bucket}`** removes the bucket but only when empty. If any stream still exists, `DELETE` returns `409 Conflict` with `bucket_not_empty`. There is no cascading delete. Remove streams first. - **`GET /{bucket}/streams`** lists streams in the bucket with optional prefix filtering and cursor-based pagination (page size up to 1000, default 1000). ## Related - [Streams](/docs/concepts/streams) - [API: list streams](/docs/api/list-streams) --- # Offsets > Position tokens inside a stream, with how to read, resume, and follow safely. An offset is a position inside a stream. Clients use offsets to read from a specific point or resume where they left off. ## Format Offsets are numeric (`u64` byte positions internally) but are returned to clients as 20-character zero-padded decimal strings, for example `"00000000000000000042"`, so they sort lexicographically. Treat them as opaque tokens: read the value from the server's response and pass it back unchanged. Two special values are accepted on read requests: - `offset=-1`: start from the very beginning of retained data (the earliest still-available offset). - `offset=now`: start from the stream's current tail. Useful for "only new data" subscriptions. ## Response headers After every read or append, the server returns two related headers: - **`Stream-Next-Offset`**: always present. The numeric position to use for the next request. - **`Stream-Cursor`**: set on live (long-poll, SSE) responses. An opaque token that bundles the stream identity and epoch with the offset, so a reconnecting client lands on the same stream version it was reading before. For pure catch-up reads, `Stream-Next-Offset` is enough. For live tailing across reconnects, prefer `Stream-Cursor` (passed as `?cursor=`). It surfaces stream re-creation as a clean error rather than silently re-reading new data under the same name. ## Stability across snapshots Offsets are byte positions, not sequence numbers, and they're stable across snapshot publishes. Publishing a snapshot at offset 100 doesn't renumber later offsets. It only enables the server to garbage-collect data before offset 100. Reads to trimmed offsets return `410 Gone` with a `stream-earliest-offset` header pointing at the first still-available position. ## Related - [Read modes](/docs/concepts/read-modes) - [Snapshots](/docs/concepts/snapshots) - [Bootstrap](/docs/concepts/bootstrap): recovery when offsets you remember have been trimmed --- # Read Modes > Compare catch-up, long-poll, and SSE reads for Ursula clients. Streams support three read modes, all via `GET`: - **Catch-up**: `GET /b/s?offset=-1` returns all data from the given offset immediately. Use this to sync historical data. - **Long-poll**: `GET /b/s?offset=...&live=long-poll` returns immediately if data is available, otherwise holds the connection until new data arrives or a timeout. Good for simple polling loops. - **SSE**: `GET /b/s?offset=...&live=sse` opens a persistent Server-Sent Events connection. The server pushes new data as it arrives. This is the recommended mode for real-time frontends (`EventSource` in the browser works out of the box). --- # Exactly-Once Writes > Producer identifiers, epochs, and sequence numbers deliver deduplicated appends. Network retries can produce duplicate appends. Ursula provides server-side deduplication so your application doesn't need its own bookkeeping. Three headers form the identity: - **`Producer-Id`**: a stable client identifier (UUID, hostname, etc.). - **`Producer-Epoch`**: bumped on producer restart. Must be ≥ the last epoch the server saw from this producer. - **`Producer-Seq`**: a per-epoch sequence number. Starts at `0` for a new epoch and must increase by exactly `1` per append. Both epoch and seq are capped at `2^53 − 1` so they survive a round-trip through JSON. ## How dedup works The server records the highest `(epoch, seq)` accepted from each `Producer-Id` per stream. On an append: - **Exact duplicate** (same epoch and same seq as already accepted): silently deduplicated. The response carries the existing offset. - **Next in sequence** (`seq = last_seq + 1`): accepted. - **Out of order** (seq skips ahead or goes backward): rejected with `409 producer_seq_conflict`. The response body includes the `expected_seq` the server wanted. - **Epoch regression** (new epoch less than last accepted): rejected with `409`. ## Restart and epoch hygiene When a producer crashes and restarts, **bump `Producer-Epoch`**. Otherwise the server still expects the next contiguous seq within the old epoch, and the producer (which has lost its in-memory seq counter) will collide. A new epoch starts the seq counter back at `0`. ## Per-stream state Dedup state is scoped per stream. The same `Producer-Id` writing to two different streams maintains two independent epoch/seq counters, so you do not have to elect a single global writer per producer. The server retains only the highest accepted `(epoch, seq)` per `(producer_id, stream)` pair. Older epochs are dropped once a higher epoch is observed. ## When you don't need this For append-only workloads where each write is independent (event logging, audit trails), exactly-once headers are optional. Just POST without them. Add them when retries are real and double-applies would be visible. ## Related - [Conditional writes](/docs/concepts/conditional-writes): coordinate multiple writers, different problem - [API: append](/docs/api/append) --- # Conditional Writes > Coordinate writers using Stream-Seq to enforce ordering without locks or transactions. Writers appending to the same stream sometimes need coordination. Ursula currently supports one form of optimistic concurrency control via request headers, without locks or transactions. > [!NOTE] > The Durable Streams Protocol also specifies an `If-Match` ETag-based precondition for multi-writer guards. Ursula does not yet implement `If-Match`. For current deployments use `Stream-Seq` or producer epoch/seq headers ([exactly-once writes](/docs/concepts/exactly-once-writes)) for ordering and dedup. ## Stream-Seq `Stream-Seq` is a client-supplied monotonic sequence token. The server tracks the last accepted value per stream and rejects any append whose `Stream-Seq` is not lexicographically greater than the previous one. This is useful when a single logical writer wants to enforce ordering without relying on server-side ETags. For instance, an agent that numbers its steps and wants the server to reject out-of-order delivery. ## When to use - **Stream-Seq**: single-writer ordering. "Reject this if my writes arrive out of order." - **Producer-Id / Producer-Epoch / Producer-Seq** ([exactly-once writes](/docs/concepts/exactly-once-writes)): deduplicate retries from a producer that may resend the same logical append after a network hiccup or restart. - **Neither**: append-only workloads where every write is independent (e.g. event logging). Just POST. --- # Snapshots > Compact stream history and accelerate first-load for new clients. As a stream grows, replaying it from `offset=-1` becomes expensive. Snapshots solve this by storing a compacted representation of all data up to a given offset. ``` PUT /{bucket}/{stream}/snapshot/{offset} publish a snapshot GET /{bucket}/{stream}/snapshot read the latest snapshot GET /{bucket}/{stream}/snapshot/{offset} read a specific snapshot ``` ## Who publishes Snapshots are an **application-level** operation. Ursula stores them but does not produce them. The writer that knows how to compute the merged state is responsible for publishing. A CRDT editor, for example, periodically computes the merged document and publishes it as the latest snapshot. An agent system might snapshot accumulated tool-call state. Application snapshots are separate from the Raft-internal snapshots Ursula takes for replication and recovery. The two never collide. ## Size and content type Snapshot bodies are capped at **128 MiB**. The snapshot has its own content type (separate from the stream's), stored as `snapshot_content_type` and defaulting to `application/octet-stream`. A binary CRDT document can be snapshotted under `application/octet-stream` even when the stream itself carries `application/json` deltas. ## Garbage collection Publishing a snapshot at offset N tells Ursula that data before N is no longer needed for application replay. Cold-tier chunks below N are enqueued for asynchronous GC immediately on publish. Reads to trimmed offsets return `410 Gone`. Clients should switch to the snapshot via [`/bootstrap`](/docs/concepts/bootstrap) or seek forward. ## Immutability Snapshots are immutable once published. The `DELETE /{bucket}/{stream}/snapshot/{offset}` endpoint exists for protocol-level completeness, but Ursula refuses every delete: targeting the latest snapshot returns `409`, anything else returns `404`. To replace a snapshot, publish a new one at a higher offset. ## Related - [Bootstrap](/docs/concepts/bootstrap): fetch latest snapshot plus post-snapshot updates in one request - [API: publish snapshot](/docs/api/publish-snapshot) - [API: read snapshot](/docs/api/read-snapshot) --- # Bootstrap > Initialize a new client with one request, fetching snapshot plus post-snapshot updates. Bootstrap is the new-client initialization endpoint. A single request returns everything a client needs to catch up: 1. The latest snapshot (if any) 2. All updates after the snapshot, up to the stream's current tail ``` GET /{bucket}/{stream}/bootstrap ``` The response is `multipart/mixed`: - **First part.** The snapshot. If no snapshot exists, this part is present but empty. - **Subsequent parts.** Incremental updates from the snapshot offset (or from the earliest retained offset, if no snapshot exists). Each part has its own `Content-Type`: the snapshot part uses `snapshot_content_type`. Update parts use the stream's content type. The client does not need to know in advance whether a snapshot exists. The same parsing path handles both cases. ## When to use bootstrap vs catch-up | You want… | Use | | --- | --- | | Latest state, fastest first-load | `/bootstrap` | | Replay every byte from the beginning | `GET /{b}/{s}?offset=-1` | | Resume from a known offset that's still retained | `GET /{b}/{s}?offset=N` | | Resume from an offset the server has since trimmed (`410 Gone`) | `/bootstrap`, then continue from the snapshot offset | ## Live continuation Bootstrap does not accept `?live=sse`. Combining the multipart body with an SSE event stream is rejected with `400`. To go live after bootstrap, finish reading the bootstrap response, then open a separate `GET /{b}/{s}?offset=&live=sse` (or pass the `cursor` returned by bootstrap). ## Related - [Snapshots](/docs/concepts/snapshots) - [Read modes](/docs/concepts/read-modes) --- # Durability and Consistency > Understand Ursula's durability guarantees, linearizable writes, and hot-cold storage model. ## Durability An append is acknowledged once a **majority of voters** has replicated it. A single node failure cannot lose an acknowledged write. Acknowledged data lives in replicated hot state across the cluster, and a background worker flushes older segments to S3 on a configurable interval, after which the data inherits S3-grade durability. In a cross-region deployment with a 5-second flush interval, per-message durability is approximately 9-10 nines. The window where acknowledged-but-unflushed data is at risk from a simultaneous multi-region failure is measured in seconds. ## Consistency Writes are **linearizable**. Each Raft group serializes appends through its current leader, which assigns a total order. Catch-up reads may be served by any replica that has applied the relevant stream state. Followers can return already replicated historical bytes locally because stream positions are immutable. Tail-sensitive reads still preserve protocol-visible semantics: `HEAD`, `Stream-Up-To-Date: true`, `Stream-Closed: true`, and `offset=now` are generated by the leader path unless the follower has applied a terminal closed state. Requests that need a write-side access transition, such as expiry or TTL touch, are also routed to the leader. `Stream-Up-To-Date` on each read tells the client whether more committed data exists past `next_offset`. `false` means keep paging. ## Availability and durability (standard layout) Three voting replicas across availability zones, plus two non-voting replicas in a second region. | Property | What it means | |---|---| | **Write availability** | Writes continue as long as a majority of voters is healthy. The layout tolerates any single voting-AZ failure. Non-voting replicas hold extra copies but do not vote. | | **Read availability** | Any reachable replica can serve replicated historical catch-up data locally. Reads that need fresh tail metadata, leader-owned state changes, or live watcher ownership still track write availability per group, not per cluster. | | **Per-message durability** | ~9-10 nines with a 5-second flush interval. | | **Annual zero-loss probability** | ~3-4 nines. Probability of no data-loss events in a year. | --- # Binary SSE > How Ursula delivers binary payloads over the Server-Sent Events transport. The SSE wire format is text-only: every `data:` field must be valid UTF-8. For streams with text-shaped content types (`text/*`, `application/json`), SSE works out of the box: each event's `data` field carries the bytes verbatim. For other content types (`application/octet-stream`, custom binary), Ursula wraps each event in a JSON envelope and base64-encodes the payload: ```json { "encoding": "base64", "contentType": "application/octet-stream", "payload": "aGVsbG8=" } ``` The choice is automatic, determined by the stream's content type. There is no client opt-in. What you get is what the stream's content type implied. The response advertises the choice with the `Stream-Sse-Data-Encoding` header: `base64` for wrapped streams, absent for raw. ## Decoding For wrapped streams, decode each event as JSON, then base64-decode `payload`. For raw streams, the `data` field is already the payload (UTF-8 text or JSON). Browser `EventSource` works for both modes. Only the `data` handler differs. ## Related - [Read modes](/docs/concepts/read-modes) - [API: read](/docs/api/read) - [Length-prefixed framing](/docs/concepts/len-prefixed-framing): when one SSE event contains multiple application records --- # Length-Prefixed Framing > A simple convention for delimiting messages inside raw byte streams. Streams are raw byte sequences. The protocol does not impose message boundaries. Every append and every read returns whatever bytes are in flight, possibly mid-record. For streams carrying many small messages (CRDT operations, agent events), Ursula recommends a simple framing convention: ``` [4-byte big-endian length][payload bytes] ``` Each application-level append writes one framed record. Each read returns a concatenation of framed records that the client parses incrementally. ## Server is frame-agnostic The server **does not** validate or enforce frames. It stores and serves raw bytes. That means: - A read may end mid-frame. The client must handle partial frames and resume parsing on the next read. - An ill-formed frame (wrong length, truncated payload) is the producer's bug, not a server-rejectable error. - The framing convention is per-application. Writers and readers of a stream must agree on it. ## When you don't need framing - Streams using `application/json` typically use newline-delimited JSON or a single self-describing document. JSON has its own delimiters. - Streams holding one logical blob per stream don't need framing at all. Use framing when one stream carries many independently meaningful messages and the client wants to dispatch on each one without scanning. ## Related - [Streams](/docs/concepts/streams) - [Binary SSE](/docs/concepts/binary-sse): over SSE, each event is one delivery. Framing lets one event carry multiple records. --- # API overview > Learn the public Ursula HTTP routes for creating, appending, reading, and replaying streams. Ursula exposes a small public HTTP API for durable append-only streams. Most users only need the `/{bucket}/{stream}` route family, which maps cleanly to buckets and stream IDs. This API is Ursula's implementation of the [Durable Streams Protocol](/docs/specs/durable-stream), defined by the durable-streams community, with additional route families for compatibility and deployment needs. ## Bucket operations | Method | Path | Description | | -------- | ----------------------------- | ------------------------------------------------------------ | | `PUT` | `/{bucket}` | [Create a bucket](/docs/api/create-bucket) | Bucket-level `GET` (metadata), `DELETE`, and `GET /{bucket}/streams` (list) are part of the Durable Streams Protocol but are not yet implemented in Ursula. Track stream existence at the application layer for now. ## Stream operations | Method | Path | Description | | -------- | ----------------------------- | ------------------------------------------------------------ | | `PUT` | `/{bucket}/{stream}` | [Create a stream](/docs/api/create-stream) | | `POST` | `/{bucket}/{stream}` | [Append data or close](/docs/api/append) | | `POST` | `/{bucket}/{stream}/append-batch` | Batched append. See [extensions spec](/docs/specs/extensions#append-batch) | | `GET` | `/{bucket}/{stream}` | [Read (catch-up, long-poll, SSE)](/docs/api/read) | | `HEAD` | `/{bucket}/{stream}` | [Get stream metadata](/docs/api/head-stream) | | `DELETE` | `/{bucket}/{stream}` | [Delete a stream](/docs/api/delete-stream) | ## Bootstrap and snapshots | Method | Path | Description | | -------- | --------------------------------------------- | ---------------------------------------------------- | | `GET` | `/{bucket}/{stream}/bootstrap` | [Snapshot + tail replay](/docs/api/bootstrap) | | `GET` | `/{bucket}/{stream}/snapshot` | [Read latest snapshot](/docs/api/read-snapshot) | | `GET` | `/{bucket}/{stream}/snapshot/{offset}` | [Read snapshot at offset](/docs/api/read-snapshot) | | `PUT` | `/{bucket}/{stream}/snapshot/{offset}` | [Publish a snapshot](/docs/api/publish-snapshot) | ## v1 compatibility A flat [v1 compatibility layer](/docs/api/v1-compatibility) is also available under `/v1/stream/{path}` for simpler integrations that don't need explicit bucket management. ## Common request patterns ### Create a bucket and stream ```bash curl -X PUT http://127.0.0.1:4437/demo curl -X PUT http://127.0.0.1:4437/demo/hello ``` ### Append data ```bash curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'hello world' ``` ### Read from the beginning ```bash curl 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` ### Subscribe with SSE ```bash curl 'http://127.0.0.1:4437/demo/hello?offset=-1&live=sse' ``` ## Related concepts - [Read modes](/docs/concepts/read-modes): catch-up vs long-poll vs SSE - [Bootstrap](/docs/concepts/bootstrap): snapshot + tail recovery - [Snapshots](/docs/concepts/snapshots): snapshot lifecycle - [Exactly-once writes](/docs/concepts/exactly-once-writes): producer deduplication - [Conditional writes](/docs/concepts/conditional-writes): ETag and sequence-based writes --- # Create bucket > Create a bucket namespace for streams. Bucket ID. The Durable Streams Protocol specifies `[a-z0-9_-]{4,64}`. Client-side conformance is recommended even though Ursula does not currently enforce the regex. ## Response | Status | Meaning | | ------ | -------------------------------------- | | `201` | Bucket created (idempotent - Ursula returns `201` whether or not the bucket existed before). | > [!NOTE] > In the current Ursula implementation `PUT /{bucket}` is a no-op acknowledgement and always returns `201`. Bucket existence is implicit. Streams created under any bucket name will succeed. Validation of the bucket ID (`400`) and conflict detection (`409`) are part of the Durable Streams Protocol but are not yet wired up here. ```bash Example curl -X PUT http://127.0.0.1:4437/demo ``` --- # Create stream > Create a new append-only stream within a bucket. Bucket ID. The Durable Streams Protocol specifies `[a-z0-9_-]{4,64}`. Ursula does not currently enforce this regex but client-side conformance is recommended. Stream ID within the bucket. Cannot contain `\0` and segments cannot equal `..`. Note: Ursula does not currently enforce the Durable Streams Protocol's 122-byte stream-ID ceiling - clients should still respect it for conformance. Content type of the initial payload (e.g. `application/json`). Becomes the stream's content type. Set to `true` to close the stream immediately after creation. Time-to-live in seconds. The stream will expire after this duration. Absolute expiration timestamp (RFC 3339). Mutually exclusive with `Stream-TTL`. Client-supplied monotonic sequence token. Rejects creates whose `Stream-Seq` is not lexicographically greater than the previous value seen for this stream. Producer identity for [exactly-once writes](/docs/concepts/exactly-once-writes). Producer epoch (must accompany `Producer-Id`). Producer sequence number (must accompany `Producer-Id`). Optional initial payload. If provided, becomes the first entry in the stream. ## Response | Status | Meaning | | ------ | ------------------------------------------------------------------------- | | `201` | Stream created. | | `200` | Stream already exists (idempotent). | | `400` | Invalid stream ID, invalid headers, or bad JSON payload. | | `409` | Stream already exists with different content type, or sequence conflict. | Response headers include `Location`, `Content-Type`, `Stream-Next-Offset`, and lifetime headers (`Stream-TTL` / `Stream-Expires-At`) when set. `Stream-Closed: true` is set if the create request also closed the stream. `ETag` is set on reads only. ```bash Create empty stream curl -X PUT http://127.0.0.1:4437/demo/hello ``` ```bash Create with initial payload curl -X PUT http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/json' \ --data-binary '{"msg": "first entry"}' ``` ```bash Create with TTL curl -X PUT http://127.0.0.1:4437/demo/ephemeral \ -H 'Stream-TTL: 3600' ``` > [!NOTE] > If a stream with the same ID already exists and has identical configuration, the response is `200 OK` (idempotent). If the existing stream differs in content type, closed state, or retention, the response is `409 Conflict`. --- # Append > Append data to an existing stream, or close it. Bucket ID. Stream ID within the bucket. Must match the stream's content type (set at PUT or first POST). Required when the body is non-empty. Mismatch returns `400`. Set to `true` to close the stream after this append. Client-supplied monotonic sequence token. The server tracks the last accepted value per stream and rejects any append whose `Stream-Seq` is not lexicographically greater than the previous one. Stable producer identity (UUID, hostname, etc.) for [exactly-once writes](/docs/concepts/exactly-once-writes). Dedup state is per-stream. Producer epoch. Bumped on producer restart. Must be ≥ the last epoch the server accepted for this `Producer-Id`. A new epoch resets the seq counter. Max value `2^53 − 1`. Producer sequence number. Starts at `0` for a new epoch and must increase by exactly `1` per append. Exact `(epoch, seq)` duplicates are silently deduplicated. Max value `2^53 − 1`. The bytes to append. Must not be empty unless `Stream-Closed: true` is set (close-only request). ## Response | Status | Meaning | | ------ | ------------------------------------------------------------------ | | `204` | Append successful (default success response, no body). | | `200` | Append successful with body - returned when a `Producer-Id` was supplied and the append was not deduplicated, so the response carries producer ack headers. | | `400` | Empty body without `Stream-Closed: true`, missing content type, or bad JSON. | | `404` | Stream not found. | | `409` | Stream is already closed, or sequence/producer conflict. | | `503` | Cold-write backpressure. Retry after the duration in `Retry-After`. | Response headers include `Stream-Next-Offset` (always). When a `Producer-Id` was supplied, the server echoes the accepted `Producer-Epoch` and `Producer-Seq` so the producer can confirm what was durably recorded. `Stream-Closed: true` is set if this request closed the stream. `ETag` is set on reads only, not on appends. ```bash Append binary curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'hello world' ``` ```bash Append JSON curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/json' \ --data-binary '{"event": "click", "ts": 1711234567}' ``` ```bash Close a stream curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Stream-Closed: true' ``` > [!NOTE] > Appends to JSON streams are validated and normalized. The server may coalesce multiple concurrent appends into a single batch for performance. --- # Read stream > Read data from a stream using catch-up, long-poll, or SSE modes. Bucket ID. Stream ID within the bucket. Starting offset. Use `-1` to read from the beginning, or a numeric offset. Opaque cursor token returned by a previous read. Alternative to `offset`. Alias for `cursor`. Live mode: `sse` for Server-Sent Events, `long-poll` for long-polling. Omit for catch-up read. Maximum bytes to return in a single response (catch-up and long-poll only). ## Read modes No `live` parameter. Returns all available data from the given offset immediately. ```bash curl 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` `live=long-poll`. Returns immediately if data is available, otherwise holds the connection until new data arrives or a ~3 second timeout. ```bash curl 'http://127.0.0.1:4437/demo/hello?offset=42&live=long-poll' ``` `live=sse`. Opens a persistent Server-Sent Events connection. The server pushes data events as new entries are appended. Includes periodic heartbeat comments. ```bash curl 'http://127.0.0.1:4437/demo/hello?offset=-1&live=sse' ``` ## Response | Status | Meaning | | ------ | ------------------------------------------------------------- | | `200` | Data returned (catch-up or long-poll with data). | | `204` | No new data at the requested offset (catch-up only). | | `400` | Invalid offset or live mode. | | `404` | Stream not found or expired. | | `410` | Requested offset has been trimmed (data no longer available). | Response headers include `Stream-Next-Offset`, `Stream-Cursor`, `ETag`, `Stream-Up-To-Date`, `Stream-Closed`, and `Content-Type`. ## SSE event format In SSE mode, the server sends: - **Data events** (`event: data`): stream payload in the `data` field. For binary streams, data is base64-encoded (controlled by the `Stream-Sse-Data-Encoding` header). - **Control events** (`event: control`): JSON metadata including the current offset and stream state. - **Heartbeat comments**: periodic `:` lines to keep the connection alive through proxies. ```bash Catch-up curl 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` ```bash Long-poll curl 'http://127.0.0.1:4437/demo/hello?offset=42&live=long-poll' ``` ```bash SSE tail curl 'http://127.0.0.1:4437/demo/hello?offset=-1&live=sse' ``` > [!NOTE] > See [read modes](/docs/concepts/read-modes), [binary SSE](/docs/concepts/binary-sse), and [offsets](/docs/concepts/offsets) for more details. --- # Head stream > Get stream metadata without reading its content. Bucket ID. Stream ID within the bucket. ETag for conditional request. Returns `304` if the stream state has not changed. ## Response | Status | Meaning | | ------ | ------------------------------------------------------ | | `200` | Stream found. | | `304` | Stream state unchanged (when `If-None-Match` matches). | | `404` | Stream not found or expired. | Response headers include: | Header | Description | | ------------------------ | --------------------------------------------------------- | | `Content-Type` | The stream's content type. | | `Stream-Next-Offset` | The next writable offset (= current length). | | `ETag` | Stream state ETag (encodes offset and open/closed state). | | `Stream-Closed` | Present and `true` if the stream is closed. | | `Stream-Snapshot-Offset` | Present if a snapshot exists, showing its offset. | | `Stream-TTL` | Remaining TTL in seconds (if a TTL was set). | | `Stream-Expires-At` | Expiration timestamp (if set). | | `Cache-Control` | `no-store`. | ```bash Example curl -I http://127.0.0.1:4437/demo/hello ``` > [!TIP] > Use `If-None-Match` with a previously received `ETag` to efficiently poll for state changes without transferring data. --- # Publish snapshot > Upload a snapshot blob at a specific stream offset. Publishes a new snapshot blob at the given offset. The snapshot replaces any previously published snapshot and enables cold storage cleanup of data before this offset. Bucket ID. Stream ID. Stream offset this snapshot represents: the 20-character zero-padded decimal token. Content type of the snapshot blob, stored separately from the stream's own content type. Defaults to `application/octet-stream`. The snapshot blob bytes. Maximum size is **128 MiB**. Larger bodies return `413`. ## Response | Status | Meaning | | ------ | --------------------------------------------------------------- | | `204` | Snapshot published successfully. | | `400` | Invalid offset or content type. | | `404` | Stream not found or expired. | | `409` | Stale publish (a newer snapshot exists) or offset out of range. | | `410` | Snapshot offset is too old (data already garbage-collected). | | `413` | Snapshot body exceeds the maximum allowed size. | Response headers include `Stream-Next-Offset` and `Stream-Snapshot-Offset`. ```bash Example curl -X PUT 'http://127.0.0.1:4437/demo/hello/snapshot/00000000000000000042' \ -H 'Content-Type: application/json' \ --data-binary '{"state": "aggregated snapshot data"}' ``` --- ## Delete snapshot `DELETE /{bucket}/{stream}/snapshot/{offset}` Attempting to delete the current visible snapshot is not allowed. | Status | Meaning | | ------ | ----------------------------------- | | `404` | No snapshot at the given offset. | | `409` | Cannot delete the current snapshot. | > [!NOTE] > Snapshots are immutable once published. Each publish creates a new blob with a unique key, so retries cannot overwrite a committed snapshot. After a snapshot is committed, cold storage chunks before the snapshot offset are automatically cleaned up. --- # Read snapshot > Read the latest snapshot or a snapshot at a specific offset. ## Latest snapshot `GET /{bucket}/{stream}/snapshot` redirects to the latest published snapshot's offset-specific URL. Bucket ID. Stream ID. | Status | Meaning | | ------ | ------------------------------------------------------------- | | `307` | Redirect to `/{bucket}/{stream}/snapshot/{offset}`. | | `404` | Stream not found, expired, or no snapshot has been published. | Response headers include `Location`, `Stream-Next-Offset`, `Stream-Snapshot-Offset`, and `Stream-Up-To-Date`. --- ## Snapshot at offset `GET /{bucket}/{stream}/snapshot/{offset}` returns the snapshot blob at a specific offset. Snapshot offset: the 20-character zero-padded decimal token returned by previous reads or `Stream-Snapshot-Offset` headers. | Status | Meaning | | ------ | -------------------------------------- | | `200` | Snapshot blob returned. | | `404` | Stream, snapshot, or offset not found. | Response headers include `Content-Type`, `Stream-Next-Offset`, `Stream-Snapshot-Offset`, `Stream-Up-To-Date`, and `Stream-Closed`. ```bash Follow redirect to latest curl -L 'http://127.0.0.1:4437/demo/hello/snapshot' ``` ```bash Read specific offset curl 'http://127.0.0.1:4437/demo/hello/snapshot/00000000000000000042' ``` > [!NOTE] > Snapshot reads go through a linearizable freshness check to ensure you see the latest published snapshot. If the snapshot blob hasn't replicated to the current node yet, the request may be redirected to the leader. > [!NOTE] > See [snapshots](/docs/concepts/snapshots) for the snapshot lifecycle. --- # Bootstrap > Recover full stream state from snapshot plus retained updates in a single request. Returns the stream's latest snapshot (if any) plus all retained updates after the snapshot offset, packed as a `multipart/mixed` response. This is the recommended way to initialize a client that needs the complete current state of a stream. Bucket ID. Stream ID within the bucket. Bootstrap does not accept `?live=sse`. Combining the multipart body with an SSE event stream is rejected with `400`. To go live after bootstrap, finish the multipart response, then open a separate `GET /{bucket}/{stream}?offset=&live=sse`. ## Response | Status | Meaning | | ------ | --------------------------------------------- | | `200` | Bootstrap response with snapshot and updates. | | `400` | Invalid query parameters (including `live=sse`). | | `404` | Stream not found or expired. | | `410` | Requested offset has been trimmed. | Response headers include: | Header | Description | | ------------------------ | ---------------------------------------------------------------- | | `Content-Type` | `multipart/mixed; boundary=...` | | `Stream-Next-Offset` | The offset after the last included update. | | `Stream-Snapshot-Offset` | The snapshot offset (or `none` if no snapshot exists). | | `Stream-Up-To-Date` | `true` if the response includes all data up to the tail. | | `Stream-Closed` | Present and `true` if the stream is closed and fully caught up. | ## Response body The body is a `multipart/mixed` message: **Snapshot part** The first part is the snapshot blob (or an empty part if no snapshot exists). **Update parts** Subsequent parts are individual update messages appended after the snapshot offset. For JSON streams, each update is a separate `application/json` part. ```bash Example curl 'http://127.0.0.1:4437/demo/hello/bootstrap' ``` > [!TIP] > After bootstrapping, switch to [SSE reads](/docs/api/read) with `live=sse` starting from the `Stream-Next-Offset` to receive real-time updates. > [!NOTE] > See [bootstrap](/docs/concepts/bootstrap) and [snapshots](/docs/concepts/snapshots) for the conceptual model. --- # Delete stream > Delete a stream and its data. Bucket ID. Stream ID within the bucket. ## Response | Status | Meaning | | ------ | ----------------- | | `204` | Stream deleted. | | `404` | Stream not found. | ```bash Example curl -X DELETE http://127.0.0.1:4437/demo/hello ``` > [!WARNING] > Deletion is permanent. The stream's data will be asynchronously garbage-collected after the delete is committed. --- # v1 compatibility > Flat Durable Streams protocol routes under /v1/stream/. Ursula provides a compatibility layer under `/v1/stream/` that implements the [Durable Streams Protocol](/docs/specs/durable-stream), defined by the durable-streams community, with a flat path model. This is useful when you want a simpler path structure without explicit bucket management. ## Path mapping The v1 path is mapped to the bucket/stream model using the first `/` as the separator: | v1 path | Bucket | Stream | | ------------------------------- | ---------- | ---------- | | `/v1/stream/mystream` | `_default` | `mystream` | | `/v1/stream/workspace/mystream` | `workspace`| `mystream` | | `/v1/stream/a/b/c` | `a` | `b/c` | Paths without a `/` are placed in the `_default` bucket. ## Routes These routes accept the same headers, query parameters, and body formats as their `/{bucket}/{stream}` equivalents: | Method | Path | Equivalent | | -------- | ------------------- | ----------------------------------- | | `PUT` | `/v1/stream/{path}` | [Create stream](/docs/api/create-stream) | | `POST` | `/v1/stream/{path}` | [Append](/docs/api/append) | | `GET` | `/v1/stream/{path}` | [Read](/docs/api/read) | | `HEAD` | `/v1/stream/{path}` | [Head stream](/docs/api/head-stream) | | `DELETE` | `/v1/stream/{path}` | [Delete stream](/docs/api/delete-stream) | ## Differences from bucketed routes - Buckets do not need to be created beforehand. Streams are created without bucket-existence validation. - No bucket-level operations (create/get/delete bucket, list streams). Use the `/` routes for those. - No snapshot or bootstrap routes. Use the `/` equivalents. - Stream close uses the `Stream-Closed: true` header on POST, not a path suffix. ```bash Create curl -X PUT http://127.0.0.1:4437/v1/stream/mystream ``` ```bash Append curl -X POST http://127.0.0.1:4437/v1/stream/mystream \ -H 'Content-Type: application/octet-stream' \ --data-binary 'hello' ``` ```bash Read curl 'http://127.0.0.1:4437/v1/stream/mystream?offset=-1' ``` ```bash SSE tail curl 'http://127.0.0.1:4437/v1/stream/mystream?offset=-1&live=sse' ``` --- # Install > Build Ursula from source. Ships two binaries — the `ursula` server and the `ursulactl` operator CLI. Ursula is currently installed from source. Pre-built release binaries are on the way. The build is plain Rust plus the standard C toolchain that Cargo expects — no embedded KV dependency. Ursula ships **two binaries** you will use side by side: - **`ursula`** — the server daemon. Runs on every node in the cluster. - **`ursulactl`** — the operator CLI. Lives on your control machine and talks to each node's HTTP surface. See [ursulactl](/docs/cli) for the verbs it exposes. ## Prerequisites Required: - Rust stable and Cargo - a C compiler (for transitive native crates) - `pkg-config` On macOS: ```bash brew install rust pkg-config ``` On Debian or Ubuntu: ```bash sudo apt-get update sudo apt-get install -y build-essential pkg-config curl curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh ``` ## Clone and build ```bash git clone https://github.com/tonbo-io/ursula.git cd ursula cargo build --release -p ursula -p ursula-ctl ``` The resulting binaries land at `target/release/ursula` and `target/release/ursulactl`. Ship `ursula` to every node; keep `ursulactl` wherever you run operational tooling (laptop, CI runner, bastion host). To work on the whole workspace: ```bash cargo build --workspace cargo test --workspace cargo clippy --workspace --all-targets -- -D warnings ``` `cargo build --workspace` compiles every crate. For just the server use `cargo build -p ursula`; for just the CLI use `cargo build -p ursula-ctl`. ## Verify ```bash ./target/release/ursula --help ./target/release/ursulactl --help ``` Both should print supported CLI flags. Then run a short smoke server (defaults to `127.0.0.1:4437`, in-memory engine, core count from your CPU): ```bash ./target/release/ursula & ``` For a single local node, `ursulactl` needs a one-node manifest to talk to: ```bash cat > /tmp/local.json <<'JSON' {"nodes": [{"id": 1, "http_url": "http://127.0.0.1:4437", "host": "127.0.0.1"}]} JSON ./target/release/ursulactl status --config /tmp/local.json ``` Or, for a quick eyeball without writing a manifest, hit the HTTP surface directly: ```bash curl -X PUT http://127.0.0.1:4437/demo curl -X PUT http://127.0.0.1:4437/demo/hello curl -X POST -H 'Content-Type: text/plain' --data 'hi' http://127.0.0.1:4437/demo/hello curl 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` Kill the background process when you're done. ## Next - [Quick start](/docs/quick-start) — drive the HTTP API end-to-end - [Run locally](/docs/run-locally) — durability modes and flags - [Deploy a cluster](/docs/deploy-cluster) — static gRPC Raft cluster - [ursulactl](/docs/cli) — the verbs you'll use once the cluster is up - [Configure S3](/docs/configure-s3) — shared cold storage --- # Run locally > Start a single-node Ursula server and verify the HTTP API. Ursula's HTTP server is the `ursula` binary. It takes its configuration from CLI flags and a small set of environment variables - there is no config file yet. ## Default (in-process, non-replicated) With no storage flag, Ursula runs an in-process non-replicated default runtime (no Raft at all). Nothing is persisted. Best for state-machine smoke tests and first-impression testing: ```bash cargo run --bin ursula ``` It binds `127.0.0.1:4437`, picks a core count from your CPU, and uses the in-memory engine. ## In-memory Raft OpenRaft with an in-memory log. Fast iteration, no persistence, but the runtime exercises the real Raft path: ```bash cargo run --bin ursula -- --raft-memory ``` ## WAL-backed single node Persistent local storage. Streams survive a restart: ```bash cargo run --bin ursula -- --wal-dir ./data/wal ``` The `--wal-dir` directory will contain protobuf-framed Raft log records per group plus a shared `journal.bin` per core. Removing the directory wipes state cleanly. ## Smoke test ```bash curl -X PUT http://127.0.0.1:4437/demo curl -X PUT http://127.0.0.1:4437/demo/hello curl -X POST http://127.0.0.1:4437/demo/hello \ -H 'Content-Type: application/octet-stream' \ --data-binary 'hello' curl 'http://127.0.0.1:4437/demo/hello?offset=-1' ``` Inspect runtime metrics (placement, mailbox depth, per-core counters): ```bash curl http://127.0.0.1:4437/__ursula/metrics | jq . ``` ## Flags | Flag | Default | Notes | | ---- | ------- | ----- | | `--listen ADDR` | `127.0.0.1:4437` | TCP bind address | | `--core-count N` | `available_parallelism` | Worker threads (each pinned to its mailbox event loop) | | `--raft-group-count N` | `core_count × 16` | Total Raft groups. Higher = better stream-to-group hash spread | | `--raft-memory` | off | In-memory Raft log. No persistence | | `--wal-dir DIR` | none | Per-group WAL + per-core journal under `DIR` | | `--raft-log-dir DIR` | none | Durable Raft log directory (used with static gRPC cluster mode) | | `--raft-node-id ID` | none | Required for static gRPC cluster mode | | `--raft-peer ID=URL` | none (repeatable) | Static peer list for the cluster | | `--raft-cluster-config FILE` | none | JSON file equivalent of `--raft-node-id` + `--raft-peer` | | `--raft-init-membership[-per-group]` | off | One-time membership bootstrap | Storage flags (`--raft-memory`, `--wal-dir`, `--raft-log-dir`) are mutually exclusive. ## Environment variables A small set of knobs are read at startup: - `URSULA_LIVE_READ_MAX_WAITERS_PER_CORE` (default `65536`) - SSE waiter cap per core - `URSULA_COLD_MAX_HOT_BYTES_PER_GROUP` (default 64 MiB) - when a group's hot ring exceeds this, cold flush kicks in - `URSULA_COLD_BACKEND` (`fs` or `s3`), `URSULA_COLD_S3_*`, `URSULA_COLD_ROOT`, `URSULA_COLD_FLUSH_*` - cold-storage configuration. See [configure S3](/docs/configure-s3) - `URSULA_TOKIO_CONSOLE` - when set, initializes the tokio-console subscriber (requires building with the `tokio-console` feature) ## Next - [Deploy a cluster](/docs/deploy-cluster) - three-node static gRPC Raft - [Configure S3](/docs/configure-s3) - shared cold storage - [Operations](/docs/operations) - the EC2 helper and runtime introspection --- # Configuration > Every Ursula configuration knob - CLI flags and environment variables. Ursula is configured via CLI flags and environment variables. There is no TOML or YAML configuration file today. ``` defaults < environment variables < CLI flags ``` ## CLI flags | Flag | Default | Notes | | ---- | ------- | ----- | | `--listen ADDR` | `127.0.0.1:4437` | TCP bind address for the HTTP + gRPC-Raft listener | | `--core-count N` | `available_parallelism` | Worker threads (each pinned to its mailbox event loop) | | `--raft-group-count N` | `core_count × 16` | Total Raft groups. Higher = better stream-to-group hash spread | | `--raft-memory` | off | In-memory Raft log. Volatile | | `--wal-dir DIR` | none | Per-group protobuf Raft log records + per-core `journal.bin` for single-node persistence | | `--raft-log-dir DIR` | none | Durable Raft log directory (use with static gRPC cluster) | | `--raft-node-id ID` | none | Required for static gRPC cluster mode | | `--raft-peer ID=URL` | none | Static peer list (repeatable: one per node, including self) | | `--raft-cluster-config FILE` | none | JSON file equivalent to `--raft-node-id` + `--raft-peer` lines | | `--raft-init-membership` | off | One-time cluster-wide membership bootstrap (first start only) | | `--raft-init-membership-per-group` | off | Same, but bootstrap each Raft group independently | Storage flags (`--raft-memory`, `--wal-dir`, `--raft-log-dir`) are mutually exclusive. Static gRPC cluster mode requires either `--raft-memory` or `--raft-log-dir`. It does not accept `--wal-dir`. ## Cluster config file The `--raft-cluster-config FILE` JSON shape: ```json { "node_id": 1, "init_membership_per_group": true, "peers": [ {"node_id": 1, "url": "http://10.0.0.1:4437"}, {"node_id": 2, "url": "http://10.0.0.2:4437"}, {"node_id": 3, "url": "http://10.0.0.3:4437"} ] } ``` Same file on every host. Only `node_id` at the top level changes. ## Cold storage environment variables Cold backend is selected entirely via `URSULA_COLD_*` env vars. See [Configure S3](/docs/configure-s3) for the full reference. Minimum set: | Variable | Required when | Notes | | -------- | ------------- | ----- | | `URSULA_COLD_BACKEND` | optional | `s3` or `fs`. Default: no cold backend configured. | | `URSULA_COLD_S3_BUCKET` | `BACKEND=s3` | Bucket name. | | `URSULA_COLD_S3_REGION` | `BACKEND=s3` | AWS region. | | `URSULA_COLD_S3_ENDPOINT` | optional | Set for S3-compatible stores. | | `URSULA_COLD_S3_ACCESS_KEY_ID`, `_SECRET_ACCESS_KEY`, `_SESSION_TOKEN` | optional | Explicit credentials. Omit to use the AWS SDK credential chain. | | `URSULA_COLD_ROOT` | optional | Prefix for all cold keys. Defaults to `data/cold` (`fs`) or unprefixed (`s3`). | ## Flush and admission tuning | Variable | Default | Purpose | | -------- | ------- | ------- | | `URSULA_COLD_FLUSH_INTERVAL_MS` | `1000` | Background flush worker tick interval | | `URSULA_COLD_FLUSH_MIN_HOT_BYTES` | `8 MiB` | Skip flush if a group's hot ring is smaller | | `URSULA_COLD_FLUSH_MAX_BYTES` | `8 MiB` | Max bytes flushed per group per tick | | `URSULA_COLD_FLUSH_MAX_CONCURRENCY` | `4` | Parallel cold writes | | `URSULA_COLD_MAX_HOT_BYTES_PER_GROUP` | `64 MiB` | Per-group hot ceiling - appends return `503` (cold backpressure) above this | | `URSULA_LIVE_READ_MAX_WAITERS_PER_CORE` | `65536` | Hard cap on SSE waiters held per core | ## Diagnostic | Variable | Purpose | | -------- | ------- | | `URSULA_TOKIO_CONSOLE` | When set and the binary was built with the `tokio-console` feature, initializes the console subscriber on startup | | `RUST_LOG` | Standard tracing filter `ursula=info,ursula_runtime=info,ursula_raft=info` is a reasonable baseline | ## Minimal examples In-memory single node: ```bash ursula \ --listen 127.0.0.1:4437 \ --core-count 4 \ --raft-group-count 64 \ --raft-memory ``` WAL-backed single node: ```bash ursula \ --listen 127.0.0.1:4437 \ --core-count 4 \ --raft-group-count 64 \ --wal-dir ./data/wal ``` Three-node durable cluster (same command on each host, `node_id` from the cluster config): ```bash URSULA_COLD_BACKEND=s3 \ URSULA_COLD_S3_BUCKET=my-ursula-bucket \ URSULA_COLD_S3_REGION=us-east-1 \ ursula \ --listen 0.0.0.0:4437 \ --core-count 16 \ --raft-group-count 256 \ --raft-log-dir /var/lib/ursula/raft \ --raft-cluster-config /etc/ursula/cluster.json ``` See [Deploy a cluster](/docs/deploy-cluster) for the full walk-through and [Configure S3](/docs/configure-s3) for cold-tier specifics. --- # Configure cold storage > Configure Ursula cold storage with S3-compatible object storage or a local filesystem. Ursula keeps recent data in an in-memory hot ring on every replica and flushes older segments and snapshot blobs to a cold backend. Multi-node clusters should use S3 or an S3-compatible object store so every replica can read chunks flushed by any leader. Configuration is entirely environment-variable based. There is no config file. ## S3 backend ```bash export URSULA_COLD_BACKEND=s3 export URSULA_COLD_S3_BUCKET=my-ursula-bucket export URSULA_COLD_S3_REGION=us-east-1 export URSULA_COLD_ROOT=ursula-prod-20260518 # optional prefix ``` Standard AWS credential discovery applies - Ursula reads credentials via the AWS SDK chain (instance profile, env vars, profile, etc.). Explicit credentials can be supplied via: ```bash export URSULA_COLD_S3_ACCESS_KEY_ID=AKIA... export URSULA_COLD_S3_SECRET_ACCESS_KEY=... export URSULA_COLD_S3_SESSION_TOKEN=... # optional, for STS ``` For S3-compatible stores (MinIO, R2, etc.): ```bash export URSULA_COLD_S3_ENDPOINT=http://127.0.0.1:9000 ``` ## Filesystem backend For single-node smoke testing only - multi-node clusters cannot share filesystem chunks across hosts: ```bash export URSULA_COLD_BACKEND=fs export URSULA_COLD_ROOT=./data/cold # optional; defaults to data/cold ``` ## Flush tuning The cold-flush worker runs in the background and decides when to move hot segments to the cold backend. Defaults are conservative. Tune them under load. | Variable | Default | Meaning | | -------- | ------- | ------- | | `URSULA_COLD_FLUSH_INTERVAL_MS` | `1000` | Background tick interval | | `URSULA_COLD_FLUSH_MIN_HOT_BYTES` | `8 MiB` | Don't flush a group whose hot ring is below this size | | `URSULA_COLD_FLUSH_MAX_BYTES` | `8 MiB` | Max bytes flushed per group per tick | | `URSULA_COLD_FLUSH_MAX_CONCURRENCY` | `4` | Parallel cold writes in flight | | `URSULA_COLD_MAX_HOT_BYTES_PER_GROUP` | `64 MiB` | Backpressure ceiling: appends to a group above this size return `503` until cold flush catches up | A typical benchmark profile drops the interval to `200ms`, raises concurrency to `32`, and bumps the per-group ceiling. ## Operating notes - The `URSULA_COLD_ROOT` prefix is appended to every cold key. Use a date-stamped value (`ursula-prod-20260518T000000Z`) when running benchmarks so you can clean up afterwards without touching production data. - `scripts/ursula_ec2.py --config cluster.json cleanup-s3 --root ` deletes a single root prefix from the bucket configured in the manifest. There is no built-in retention policy beyond Ursula's snapshot GC. - Cold reads can be served by any replica that has applied the stream metadata referencing the cold chunks. All replicas must point at the same shared bucket so follower reads and post-leadership-change reads can fetch the same objects on demand. --- # Deploy a cluster > Run a three-node Ursula cluster with static-membership Raft and shared cold storage. Production-style Ursula runs as a static-membership Raft cluster: a fixed list of nodes with stable IDs, gRPC peer-to-peer Raft traffic, and a shared S3 cold backend. The migration benchmark target is three voting nodes across availability zones. ## Topology Each node needs: - a stable Raft `node_id` - a TCP listen address (the same address every peer dials over gRPC) - the full peer list (including itself) - the same storage mode (`--raft-memory` or `--raft-log-dir DIR`) on every peer - a one-time membership initializer on the first start (`--raft-init-membership-per-group`) - a shared cold backend (filesystem path for single-host smoke, S3 for production) There is no separate "advertise" address. The `--listen` address must be reachable by other peers. ## Cluster config file The `--raft-cluster-config` flag accepts a small JSON file shared across all nodes - only `node_id` differs per host: ```json { "node_id": 1, "init_membership_per_group": true, "peers": [ {"node_id": 1, "url": "http://10.0.0.1:4437"}, {"node_id": 2, "url": "http://10.0.0.2:4437"}, {"node_id": 3, "url": "http://10.0.0.3:4437"} ] } ``` Same file on each host, only the top-level `node_id` field changes. ## Start node 1 ```bash ursula \ --listen 0.0.0.0:4437 \ --core-count 16 \ --raft-group-count 256 \ --raft-log-dir /var/lib/ursula/raft \ --raft-cluster-config /etc/ursula/cluster.json ``` `init_membership_per_group: true` only needs to be present on the very first start of a fresh cluster. After that, Ursula remembers membership. Flip it to `false` (or remove it) for subsequent starts. ## Start nodes 2 and 3 Same command, same JSON file with `node_id` adjusted to 2 and 3 respectively. Use `--raft-memory` instead of `--raft-log-dir` if you want a non-durable test cluster. ## Verify with ursulactl Once daemons are up on every node, point [`ursulactl`](/docs/cli) at a one-file manifest of the cluster and block until each Raft group has elected a leader: ```bash cat > cluster-manifest.json <<'JSON' { "nodes": [ {"id": 1, "http_url": "http://10.0.0.1:4437", "host": "10.0.0.1"}, {"id": 2, "http_url": "http://10.0.0.2:4437", "host": "10.0.0.2"}, {"id": 3, "http_url": "http://10.0.0.3:4437", "host": "10.0.0.3"} ] } JSON ursulactl wait-ready --config cluster-manifest.json --expected-groups 256 ursulactl status --config cluster-manifest.json ``` `wait-ready` returns non-zero with a single-line reason if the timeout elapses. `status` prints one line per node showing the raft group count and per-leader group counts as observed from that node's metrics. Healthy clusters report the same distribution from every reporter. If you don't have `ursulactl` handy, the raw metrics endpoint works as a fallback: ```bash for host in 10.0.0.1 10.0.0.2 10.0.0.3; do curl -s "http://$host:4437/__ursula/metrics" | jq '.raft_groups | length' done ``` ## Storage exclusivity These flags are mutually exclusive: pick one storage mode per cluster. - `--raft-memory` - fast, volatile. Survives no restart - `--raft-log-dir DIR` - durable gRPC-Raft mode (recommended for clusters) - `--wal-dir DIR` - single-node WAL mode **not compatible with static gRPC Raft** ## Cold storage A shared object store is recommended for any multi-node deployment because peers need to read each other's flushed chunks. Configure via environment variables on every node: ```bash URSULA_COLD_BACKEND=s3 URSULA_COLD_S3_BUCKET=my-ursula-bucket URSULA_COLD_S3_REGION=us-east-1 URSULA_COLD_ROOT=ursula-prod-20260518 ``` See [configure S3](/docs/configure-s3) for the full set of `URSULA_COLD_*` variables. ## Operating the cluster The first tool for day-2 work is [`ursulactl`](/docs/cli): - `ursulactl restart` — drain-aware rolling restart with applied-index catch-up gates. - `ursulactl status` — leadership distribution per node. - `ursulactl wait-ready` — gate scripts on group + leader counts. For SSH/AWS-side plumbing — pushing binaries, writing systemd units, EC2 Instance Connect, S3 cleanup — use `scripts/ursula_ec2.py`. See [operations](/docs/operations) for the full split. The HTTP admin surface underneath both tools is small and stable enough to script against directly: - `GET /__ursula/metrics` — per-node JSON snapshot. - `POST /__ursula/raft/{group_id}/snapshot` — manually trigger a Raft snapshot. - `POST /__ursula/raft/{group_id}/purge` — purge stale log entries. - `POST /__ursula/raft/{group_id}/learners/{node_id}` — add a learner (non-voting) replica. - `POST /__ursula/raft/{group_id}/leader/transfer/{node_id}` — hand off leadership to another voter (the primitive `ursulactl restart` builds on). - `POST /__ursula/flush-cold/{bucket}/{stream}` — force a cold flush for one stream. ## Limits in the current build - Membership changes (adding/removing voters after bootstrap) are not yet exposed as a routine workflow. Learners can be added via the admin endpoint above. - `ursulactl restart` packages the safe rolling restart loop (drain → restart-cmd → wait-ready); version-upgrade tooling that diffs binaries before restart is not yet packaged. - There is no zero-downtime config reload. Restart the process to pick up new flags or env vars. --- # Security > What Ursula does and does not handle in the security model, and how to deploy it safely. > [!WARNING] > Ursula does not terminate TLS, authenticate clients, or restrict admin endpoints. Treat the listening port as fully trusted. Run it on a private network behind a reverse proxy that owns TLS termination and request authentication. The current `v0.x` security model is deliberately narrow. Ursula is built to slot behind your existing edge layer, not to be one. ## What Ursula does - **Quorum-acknowledged writes.** An append is acknowledged only after a majority of voters has replicated it. - **Per-group backpressure.** When a group's hot ring exceeds `URSULA_COLD_MAX_HOT_BYTES_PER_GROUP`, appends to that group return `503` with `Retry-After` until cold flush catches up. Per-group, not global or per-client. - **Stream-level isolation.** Streams hash to disjoint Raft groups and disjoint owner cores. A hot stream on one group cannot starve writes on a different group on a different core. ## What Ursula does not do Handle the following outside Ursula: - **TLS / HTTPS.** The public listener serves plain HTTP. No built-in `rustls`. - **Inter-node encryption.** Peer gRPC (Raft heartbeats, append-entries, snapshots, forwarded writes) runs over h2c. Peers must share a private network. - **API authentication.** No bearer-token, API-key, mTLS, or signed-request validation on the public API. Any caller with network reach can create buckets, append, read, delete. - **Authorization / multi-tenancy.** No per-user, per-bucket, or per-scope ACLs. Enforce isolation upstream. - **Admin endpoint isolation.** `/__ursula/metrics`, `/__ursula/flush-cold/*`, `/__ursula/raft/*`, and the public stream endpoints share the same listener with no auth gate. - **Per-client rate limiting.** A single noisy client can saturate a core's mailbox or a group's hot ring. - **Health/readiness endpoints.** No `/healthz` or `/readyz`. Use `/__ursula/metrics` as a process-alive probe (it serves only after the runtime initializes). - **At-rest encryption.** Hot ring is in memory. WAL and Raft log directories live on disk in plaintext. Use full-disk encryption at the host level. CORS is permissive (`Access-Control-Allow-Origin: *`). Restrict at the proxy for browser traffic. ## Recommended deployment ``` Untrusted internet │ v ┌─────────────────────┐ │ Reverse proxy │ TLS, authn, per-client │ (nginx / Envoy / …) │ rate limiting, CORS └──────────┬──────────┘ │ plain HTTP, private network ┌────────────┼────────────┐ v v v ┌────────┐ ┌────────┐ ┌────────┐ │ Ursula │ │ Ursula │ │ Ursula │ │ node │ │ node │ │ node │ └────────┘ └────────┘ └────────┘ ↕ gRPC h2c on private network (Raft replication) ``` ### Checklist - **Bind to the private interface.** `--listen 10.0.0.X:4437` or a security group / firewall so the listener is unreachable from public networks. - **Terminate TLS at the proxy.** Ursula stays plain HTTP on the internal side. - **Authenticate at the proxy.** Validate the caller (OAuth2, mTLS, signed requests) and reject unauthenticated traffic before it reaches Ursula. - **Block admin paths from public traffic.** Deny `/__ursula/*` from the public listener; allow it only on an internal or ops network. - **Use IAM roles for S3.** Omit the static `URSULA_COLD_S3_ACCESS_KEY_ID` / `_SECRET_ACCESS_KEY` and let the AWS SDK credential chain discover credentials. - **Encrypt data volumes.** Apply full-disk encryption to `--wal-dir` / `--raft-log-dir`. - **Keep peer traffic private.** Never route gRPC peer traffic across the public internet. ## Reporting vulnerabilities Open a GitHub Security Advisory on [tonbo-io/ursula](https://github.com/tonbo-io/ursula/security/advisories) rather than a public issue. --- # Operations > Day-2 surfaces around `ursulactl` — metrics, admin endpoints, SSH-side lifecycle, S3 cleanup, backup posture, and logs. The first tool to reach for is [`ursulactl`](/docs/cli). It covers the raft-aware verbs operators run most often — rolling restart, status, readiness gate. This page covers everything **around** it: the metrics shape ursulactl reads, the admin endpoints it (and your custom tooling) can call, the SSH-side lifecycle that still lives in Python, and the operational policies ursulactl does not encode (backups, log levels, S3 cleanup). ## Tooling map | Surface | When to reach for it | |---------|---------------------| | [`ursulactl`](/docs/cli) | Day-2 cluster ops over HTTP: restart, status, wait-ready. | | `scripts/ursula_ec2.py` | SSH/IAM/EC2 lifecycle: push binaries, write systemd units, S3 cleanup, drive the benchmark client. | | `/__ursula/metrics` and the `/__ursula/raft/...` admin endpoints | Custom tooling. ursulactl uses these underneath; the surface is small and stable enough to script directly. | There is no Prometheus scrape and no general-purpose orchestrator yet. ## Metrics ```bash curl http://127.0.0.1:4437/__ursula/metrics | jq . ``` The JSON snapshot covers per-core mailbox depth, append/read counters, latency histograms, per-group leader and `last_applied`, hot/cold bytes, cold-flush backlog, HTTP status counters, live-read watchers, and cold-write admission state. Start here when triaging slowness, lag, or `503`s. `ursulactl status` is a friendlier read of the leadership-related fields across the whole cluster. ## Admin endpoints These are the primitives `ursulactl` and any custom operator tooling builds on. Each call is local to one node; to act on every group, loop over the IDs in metrics. ```bash # Force a cold flush for one stream (skip the timer) curl -X POST http://127.0.0.1:4437/__ursula/flush-cold/demo/hello # Trigger a Raft snapshot for one group curl -X POST http://127.0.0.1:4437/__ursula/raft/42/snapshot # Purge log entries below the last snapshot index for one group curl -X POST http://127.0.0.1:4437/__ursula/raft/42/purge # Add a learner (non-voting replica) to one group curl -X POST http://127.0.0.1:4437/__ursula/raft/42/learners/4 # Hand leadership of one group to another voter (used by `ursulactl restart`) curl -X POST http://127.0.0.1:4437/__ursula/raft/42/leader/transfer/2 ``` The leader-transfer endpoint refuses with `409 Conflict` if the receiving node isn't the current leader and `400 Bad Request` if the target isn't a voter — this is why `ursulactl` is the safer way to chain these calls: it consults metrics first. ## SSH-side lifecycle `scripts/ursula_ec2.py` drives a static EC2 manifest (instance IDs, IPs, port, binary path, cold env, peer list) over SSH. It complements `ursulactl` — push binaries and start daemons here, then switch to `ursulactl` for raft-aware verbs. ```bash # Push a fresh binary to every server in the manifest python3 scripts/ursula_ec2.py --config cluster.json upload-binary \ --target servers --local ./target/release/ursula --remote /tmp/ursula # Bring daemons up, then gate on readiness via ursulactl python3 scripts/ursula_ec2.py --config cluster.json start ursulactl wait-ready --config cluster.json --expected-groups 256 # Stop the cluster (kills the pid recorded in the configured pid file) python3 scripts/ursula_ec2.py --config cluster.json stop # Drive the benchmark workload from the configured client host python3 scripts/ursula_ec2.py --config cluster.json perf-many \ --processes 4 --bucket-prefix benchcmp-mp ``` `perf-many` rotates entrypoints across service nodes by default; use `--target-mode first` only to reproduce a single-ingress run. `stop` kills only the recorded PID rather than running `pkill`, because broad patterns can match the SSH cleanup command itself. If a pid file is stale, kill by hand. The Python script still exposes `status` and `wait-ready`, kept around for environments where `ursulactl` isn't deployed yet. They print the same numbers but via SSH-curl — prefer `ursulactl` when you have a choice. Verbs scheduled to migrate to `ursulactl` once SSH transport lands: `upload-binary`, `install-binary`, `install-service`, `install-chaos-agent`, `install-faultd`, `deploy-chaos`. AWS deployment scaffolding (IAM / EC2 lifecycle / security groups) stays in Python permanently. ## Cleaning S3 ```bash python3 scripts/ursula_ec2.py --config cluster.json cleanup-s3 \ --root ursula-test-20260518T000000Z ``` Deletes everything under one `URSULA_COLD_ROOT` prefix in the manifest's bucket. Run after benchmark sweeps. ## Backup and disaster recovery > [!WARNING] > Ursula has no backup or restore tool: no `--export`, no cluster dump, no "restore from snapshot file". Plan recovery accordingly. What you can rely on: - **Quorum durability.** Acknowledged writes survive as long as a majority of voters survives. Three voters across AZs tolerate any single-AZ outage. - **Cold-tier durability.** Once flushed to S3, a chunk inherits S3-grade durability. The unflushed window is bounded by the flush interval (seconds by default). - **No on-disk migration.** `v0.x` does not promise stable on-disk formats. The runtime won't refuse to start on stale data, but it won't migrate either. Cross-version upgrades currently mean rebuild + replay from external truth. Node-level loss: replace the host with the same `node_id` and cluster config; it rehydrates from peers. Use `ursulactl wait-ready` afterwards to confirm the replacement is voting and caught up before declaring the recovery done. Total-cluster loss: the durable data on S3 is what you have, and there is no tooling yet to materialize a fresh cluster from those objects. ## Logs `RUST_LOG=ursula=info,ursula_runtime=info,ursula_raft=info` is the baseline. Bump to `debug` for one crate when chasing a subsystem: ```bash RUST_LOG=ursula_raft=debug ./target/release/ursula ... ``` `debug` is verbose under sustained load; redirect to a file. --- # Troubleshooting > Common Ursula failure modes, what they look like, how to diagnose them, and what to do. Start at `/__ursula/metrics`. Most operational symptoms have a clear signal in the JSON snapshot. ## Diagnostic surface ```bash curl http://NODE:4437/__ursula/metrics | jq . ``` Useful jq selectors: | What you need | jq filter | | ------------- | --------- | | Per-group leader / term | `.raft_groups[] | {id, leader_id, current_term, last_applied}` | | Hot bytes per group | `.raft_groups[] | {id, hot_bytes_total}` | | Cold backpressure events | `.cold_backpressure_events_total` | | Per-core mailbox depth | `.cores[] | {core_id, mailbox_pending}` | | Live-read watcher counts | `.cores[] | {core_id, live_read_watchers_active}` | | HTTP error counters | `.http.responses_by_status` | Ursula does **not** expose `/healthz`, `/readyz`, Prometheus `/metrics`, or `/cluster/status`. The JSON snapshot is the source of truth. ## Startup failures ### `use only one of --wal-dir or --raft-log-dir` Storage flags are mutually exclusive. Pick one of `--raft-memory`, `--wal-dir`, `--raft-log-dir`. ### `static gRPC Raft does not support --wal-dir` `--wal-dir` is the standalone WAL runtime. Cluster mode needs `--raft-log-dir` or `--raft-memory`. ### `static gRPC Raft requires --raft-node-id` / `--raft-peer ID=URL` Any of `--raft-node-id`, `--raft-peer`, `--raft-init-membership`, or `--raft-cluster-config` flips Ursula to static gRPC mode and requires the full peer list plus local node ID. Use `--raft-cluster-config FILE` to avoid repeating peer lines on every host. ### `--raft-peer must include this node id N` The local `--raft-node-id` must appear in the peer list. The peer list always includes self. ### `URSULA_COLD_S3_BUCKET is required when URSULA_COLD_BACKEND=s3` Leave `URSULA_COLD_BACKEND` unset if you don't want cold storage. ### Port already in use `--listen` is taken. Public HTTP and inter-node gRPC share the same port. ### I/O error on `--wal-dir` or `--raft-log-dir` The directory must be writable. No lock file to clear, Ursula trusts the directory. ## Bootstrap and cluster join ### A fresh cluster never elects leaders `/__ursula/metrics` on each node should show non-zero `leader_id` and a stable `current_term`. If groups stay leaderless: - First start must include `--raft-init-membership-per-group` (or `--raft-init-membership`) on every voting node. - Every peer URL must be reachable from every node. - All peers must use the same storage mode. Mismatch causes silent join failures. ### Replacing or restarting a node leaves it leaderless Restart with the **same** `--raft-node-id` and `--raft-cluster-config`. A fresh storage directory means rehydrating from peers, which can be slow on cold-storage-only history. ## Write path ### `503` with `Retry-After` on append Cold-write backpressure for that group. Either the hot ring exceeds `URSULA_COLD_MAX_HOT_BYTES_PER_GROUP` (cold flush isn't keeping up), or the cold backend is slow or erroring out. Raise the per-group ceiling, lower `URSULA_COLD_FLUSH_INTERVAL_MS`, raise `URSULA_COLD_FLUSH_MAX_CONCURRENCY`, or fix the cold backend. ### `409 Conflict` with `producer-expected-seq` / `producer-received-seq` Out-of-order producer headers. Response tells you what was expected: - `producer-expected-seq: N` is the next allowed sequence - `producer-received-seq: M` is what the client sent Common causes: - Producer restart without bumping `Producer-Epoch`. Bump every restart. - Two writers share the same `Producer-Id`. Use distinct IDs. - A retry skipped a sequence number. Producer sequences must be contiguous within an epoch. ### `409 Conflict` from `Stream-Seq` The supplied `Stream-Seq` is not lexicographically greater than the last accepted value. Re-read and re-derive, or use producer dedup instead. ### `404 Not Found` on `POST /{bucket}/{stream}` The stream hasn't been created. Create with `PUT /{bucket}/{stream}` first; there is no implicit creation on POST. ### `400 Bad Request` on JSON appends JSON streams are parsed and normalized on the server. Malformed JSON or empty arrays without `allow_empty_array` are rejected. Send valid JSON or switch to `application/octet-stream`. ## Read path ### `Stream-Up-To-Date: false` More committed data exists past this response. Page forward with `Stream-Next-Offset`. Not an error. ### SSE drops after 30-60 s idle A proxy or load balancer is closing idle TCP. Raise its idle timeout or enable TCP keepalive on the proxy-to-Ursula leg. Ursula doesn't emit SSE keepalive comments. ### SSE on a binary stream returns base64 JSON Expected. SSE wire format is text-only, so binary streams are wrapped in JSON and the `Stream-Sse-Data-Encoding: base64` header signals it. See [binary SSE](/docs/concepts/binary-sse). ### Reads stall for hundreds of ms for cold offsets The first read of a cold offset triggers an S3 `GetObject` range read. It runs off the actor turn so it doesn't block other commands, but the request still pays S3 latency. ## Replication and consensus ### A group has no leader (`leader_id == 0`) Either an election is in progress or quorum is lost for that group. Only streams hashed to that group are affected. - Multiple nodes briefly claim leader: election is flapping. Check peer reachability and CPU pressure. - No node claims leader: fewer than `n/2+1` voters are reachable. Restore reachability. ### One follower lags behind State-machine apply runs on the follower's owner core. If CPU is pinned, `last_applied` trails. Check: - Per-core mailbox depth (sustained queueing means saturation). - Disk pressure if `--raft-log-dir` (fsync latency). - A pathological hot stream forcing constant cold flushes (`hot_bytes_total` per group). ### Manually trigger a snapshot ```bash curl -X POST http://NODE:4437/__ursula/raft/{group_id}/snapshot curl -X POST http://NODE:4437/__ursula/raft/{group_id}/purge ``` Purge only after the snapshot replicates. Otherwise a slow follower may need log entries you just dropped. ## Cold flush issues ### Writes succeed but data never reaches S3 - `URSULA_COLD_BACKEND` not set, or set to `fs` when you expected `s3`. - IAM permissions missing. Required: `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, `s3:DeleteObject`. - `URSULA_COLD_S3_ENDPOINT` unreachable. - Cold flush worker stalled (`cold_backpressure_events_total` climbing). If hot bytes grow unbounded, you'll eventually hit `URSULA_COLD_MAX_HOT_BYTES_PER_GROUP` and writes start returning `503`. Fix the backend before that. ## Still stuck? Open a [GitHub issue](https://github.com/tonbo-io/ursula/issues) with: - `/__ursula/metrics` from every node (with timestamps) - CLI flags per node - `URSULA_*` env vars (redact credentials) - Last ~200 log lines at `RUST_LOG=ursula_runtime=debug,ursula_raft=debug,ursula=info` --- # Durable Streams Protocol > The base HTTP protocol specification for creating, appending to, and reading from durable, append-only byte streams. > [!NOTE] > This page is a verbatim mirror of the upstream Durable Streams Protocol specification, authored by ElectricSQL. The authoritative source is [github.com/durable-streams/durable-streams](https://github.com/durable-streams/durable-streams/blob/main/PROTOCOL.md). Please open issues and pull requests against that repository, not Ursula's. Ursula's own [extensions](/docs/specs/extensions) are documented separately. **Document:** Durable Streams Protocol **Version:** 1.0 **Author:** ElectricSQL --- ## Abstract This document specifies the Durable Streams Protocol, an HTTP-based protocol for creating, appending to, and reading from durable, append-only byte streams. The protocol provides a simple, web-native primitive for applications requiring ordered, replayable data streams with support for catch-up reads, live tailing, and explicit stream closure (EOF). It is designed to be a foundation for higher-level abstractions such as event sourcing, database synchronization, collaborative editing, AI conversation histories, and finite response streaming. ## Copyright Notice Copyright (c) 2025 ElectricSQL ## Table of Contents 1. [Introduction](#1-introduction) 2. [Terminology](#2-terminology) 3. [Protocol Overview](#3-protocol-overview) 4. [Stream Model](#4-stream-model) - 4.1. [Stream Closure](#41-stream-closure) 5. [HTTP Operations](#5-http-operations) - 5.1. [Create Stream](#51-create-stream) - 5.2. [Append to Stream](#52-append-to-stream) - 5.2.1. [Idempotent Producers](#521-idempotent-producers) - 5.3. [Close Stream](#53-close-stream) - 5.4. [Delete Stream](#54-delete-stream) - 5.5. [Stream Metadata](#55-stream-metadata) - 5.6. [Read Stream - Catch-up](#56-read-stream---catch-up) - 5.7. [Read Stream - Live (Long-poll)](#57-read-stream---live-long-poll) - 5.8. [Read Stream - Live (SSE)](#58-read-stream---live-sse) 6. [Offsets](#6-offsets) 7. [Content Types](#7-content-types) 8. [Caching and Collapsing](#8-caching-and-collapsing) 9. [Extensibility](#9-extensibility) 10. [Security Considerations](#10-security-considerations) 11. [IANA Considerations](#11-iana-considerations) 12. [References](#12-references) --- ## 1. Introduction Modern web and cloud applications frequently require ordered, durable sequences of data that can be replayed from arbitrary points and tailed in real time. Common use cases include: - Database synchronization and change feeds - Event-sourced architectures - Collaborative editing and CRDTs - AI conversation histories and token streaming - Workflow execution histories - Real-time application state updates - Finite response streaming (proxied HTTP responses, job outputs, file transfers) While these patterns are widespread, the web platform lacks a simple, first-class primitive for durable streams. Applications typically implement ad-hoc solutions using combinations of databases, queues, and polling mechanisms, each reinventing similar offset-based replay semantics. The Durable Streams Protocol provides a minimal HTTP-based interface for durable, append-only byte streams. It is intentionally low-level and byte-oriented, allowing higher-level abstractions to be built on top without protocol changes. ## 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. **Stream**: A URL-addressable, append-only byte stream that can be read and written to. A stream is simply a URL. The protocol defines how to interact with that URL using HTTP methods, query parameters, and headers. Streams are durable and immutable by position. New data can only be appended. **Offset**: An opaque, lexicographically sortable token that identifies a position within a stream. Clients use offsets to resume reading from a specific previously reached point. **Content Type**: A MIME type set on stream creation that describes the format of the stream's bytes. The content type is returned on reads and may be used by clients to interpret message boundaries. **Tail Offset**: The offset immediately after the last byte in the stream. This is the position where new appends will be written. **Closed Stream**: A stream that has been explicitly closed by a writer. Once closed, a stream is in a terminal state: no further appends are permitted, and readers can observe the closure as an end-of-stream (EOF) signal. Closure is durable and monotonic - once closed, a stream remains closed. ## 3. Protocol Overview The Durable Streams Protocol is an HTTP-based protocol that operates on URLs. A stream is simply a URL. The protocol defines how to interact with that URL using standard HTTP methods, query parameters, and custom headers. The protocol defines operations to create, append to, read, close, delete, and query metadata for streams. Reads have three modes: catch-up, long-poll, and Server-Sent Events (SSE). The primary operations are: 1. **Create**: Establish a new stream at a URL with optional initial content (PUT) 2. **Append**: Add bytes to the end of an existing stream (POST) 3. **Close**: Transition a stream to closed state, optionally with a final append (POST with `Stream-Closed: true`) 4. **Read**: Retrieve bytes starting from a given offset, with support for catch-up and live modes (GET) 5. **Delete**: Remove a stream (DELETE) 6. **Head**: Query stream metadata without transferring data (HEAD) The protocol does not prescribe a specific URL structure. Servers may organize streams using any URL scheme they choose (e.g., `/v1/stream/{path}`, `/{id}`, or domain-specific paths). The protocol is defined by the HTTP methods, query parameters, and headers applied to any stream URL. Streams support arbitrary content types. The protocol operates at the byte level, leaving message framing and schema interpretation to clients. **Independent Read/Write Implementation**: Servers **MAY** implement the read and write paths independently. For example, a database synchronization server may only implement the read path and use its own injection system for writes, while a collaborative editing service might implement both paths. ## 4. Stream Model A stream is an append-only sequence of bytes with the following properties: - **Durability**: Once written and acknowledged, bytes persist until the stream is deleted or expired - **Immutability by Position**: Bytes at a given offset never change. New data is only appended - **Ordering**: Bytes are strictly ordered by offset - **Content Type**: Each stream has a MIME content type set at creation - **TTL/Expiry**: Streams may have optional time-to-live or absolute expiry times - **Retention**: Servers **MAY** implement retention policies that drop data older than a certain age while the stream continues. If a stream is deleted a new stream **SHOULD NOT** be created at the same URL. - **Stream State**: A stream is either **open** (accepts appends) or **closed** (no further appends permitted). Streams start in the open state and transition to closed via an explicit close operation. This transition is **durable** (persisted) and **monotonic** (once closed, a stream cannot be reopened). Clients track their position in a stream using offsets. Offsets are opaque to clients but are lexicographically sortable, allowing clients to determine ordering and resume from any point. ### 4.1. Stream Closure Stream closure provides an explicit end-of-stream (EOF) signal that allows readers to distinguish between "no data yet" and "no more data ever." This is essential for finite streams where writers need to signal completion, such as: - Proxied HTTP responses that have finished streaming - Completed job outputs or workflow executions - Finalized conversation histories or document streams **Properties of stream closure:** - **Durable**: The closed state is persisted and survives server restarts - **Monotonic**: Once closed, a stream cannot be reopened - **Idempotent**: Closing an already-closed stream succeeds (or returns a stable "already closed" response) - **Observable**: Readers can detect closure and treat it as EOF - **Atomic with final append**: Writers can atomically append a final message and close in a single operation After closure, the stream's data remains fully readable. Only new appends are rejected. **Stream-Closed Header Value:** The `Stream-Closed` header uses the value `true` (case-insensitive) to indicate closure. Servers **MUST** treat the header as present only when its value is exactly `true` (case-insensitive comparison). Other values such as `false`, `yes`, `1`, or empty string **MUST** be treated as if the header were absent. Servers **SHOULD NOT** return error responses for non-`true` values. They simply ignore the header. ## 5. HTTP Operations The protocol defines operations that are applied to a stream URL. The examples in this section use `{stream-url}` to represent any stream URL. Servers may implement any URL structure they choose. The protocol is defined by the HTTP methods, query parameters, and headers. ### 5.1. Create Stream #### Request ``` PUT {stream-url} ``` Where `{stream-url}` is any URL that identifies the stream to be created. Creates a new stream. If the stream already exists at `{stream-url}`, the server **MUST** either: - return `200 OK` if the existing stream's configuration (content type, TTL/expiry, and closure status) matches the request, or - return `409 Conflict` if it does not. This provides idempotent "create or ensure exists" semantics aligned with HTTP PUT expectations. **Closure status matching:** When checking for idempotent success (200 OK), servers **MUST** compare the `Stream-Closed` header in the request against the stream's current closure status. For example: - `PUT /stream` (no `Stream-Closed`) to an **open** stream with matching config -> `200 OK` - `PUT /stream` (no `Stream-Closed`) to a **closed** stream -> `409 Conflict` (closure status mismatch) - `PUT /stream + Stream-Closed: true` to a **closed** stream with matching config -> `200 OK` - `PUT /stream + Stream-Closed: true` to an **open** stream -> `409 Conflict` (closure status mismatch) #### Request Headers (Optional) - `Content-Type: ` - Sets the stream's content type. If omitted, the server **MAY** default to `application/octet-stream`. - `Stream-TTL: ` - Sets a relative time-to-live in seconds from creation. The value **MUST** be a non-negative integer in decimal notation without leading zeros, plus signs, decimal points, or scientific notation (e.g., `3600` is valid `+3600`, `03600`, `3600.0`, and `3.6e3` are not). - `Stream-Expires-At: ` - Sets an absolute expiry time as an RFC 3339 timestamp. - If both `Stream-TTL` and `Stream-Expires-At` are supplied, servers **SHOULD** reject the request with `400 Bad Request`. Implementations **MAY** define a deterministic precedence rule, but **MUST** document it. - `Stream-Closed: true` (optional) - When present, the stream is created in the **closed** state. Any body provided becomes the complete and final content of the stream. - This enables atomic "create and close" semantics for single-message or empty streams that are immediately complete (e.g., cached responses, placeholder errors, pre-computed results). - **Examples:** - `PUT /stream + Stream-Closed: true` (empty body): Creates an empty, immediately-closed stream (useful for "completed with no output" or error placeholders). - `PUT /stream + Stream-Closed: true + body`: Creates a single-shot stream with the body as its complete content (useful for cached responses, pre-computed results). #### Request Body (Optional) - Initial stream bytes. If provided, these bytes form the first content of the stream. #### Response Codes - `201 Created`: Stream created successfully - `200 OK`: Stream already exists with matching configuration (idempotent success) - `409 Conflict`: Stream already exists with different configuration - `400 Bad Request`: Invalid headers or parameters (including conflicting TTL/expiry) - `429 Too Many Requests`: Rate limit exceeded #### Response Headers (on 201 or 200) - `Location: {stream-url}` (on 201): Servers **SHOULD** include a `Location` header equal to `{stream-url}` in `201 Created` responses. - `Content-Type: `: The stream's content type - `Stream-Next-Offset: `: The tail offset after any initial content - `Stream-Closed: true`: Present when the stream was created in the closed state ### 5.2. Append to Stream #### Request ``` POST {stream-url} ``` Where `{stream-url}` is the URL of an existing stream. Appends bytes to the end of an existing stream. Supports both full-body and streaming (chunked) append operations. Optionally closes the stream atomically with the append. Servers that do not support appends for a given stream **SHOULD** return `405 Method Not Allowed` or `501 Not Implemented` to `POST` requests on that URL. #### Request Headers - `Content-Type: ` - **MUST** match the stream's existing content type when a body is provided. Servers **MUST** return `409 Conflict` when the content type is valid but does not match the stream's configured type. - **MAY** be omitted when the request body is empty (i.e., close-only requests with `Stream-Closed: true`). When the request body is empty, servers **MUST NOT** reject based on `Content-Type` and **MAY** ignore it entirely. This ensures close-only requests remain robust even when clients/libraries attach default `Content-Type` headers. - `Transfer-Encoding: chunked` (optional) - Indicates a streaming body. Servers **SHOULD** support HTTP/1.1 chunked encoding and HTTP/2 streaming semantics. - `Stream-Seq: ` (optional) - A monotonic, lexicographic writer sequence number for coordination. - `Stream-Seq` values are opaque strings that **MUST** compare using simple byte-wise lexicographic ordering. Sequence numbers are scoped per authenticated writer identity (or per stream, depending on implementation). Servers **MUST** document the scope they enforce. - If provided and less than or equal to the last appended sequence (as determined by lexicographic comparison), the server **MUST** return `409 Conflict`. Sequence numbers **MUST** be strictly increasing. - `Stream-Closed: true` (optional) - When present with value `true`, the stream is **closed** after the append completes. This is an atomic operation: the body (if any) is appended as the final data, and the stream transitions to the closed state in the same step. - If the request body is empty (Content-Length: 0 or no body), the stream is closed without appending any data. This is the only case where an empty POST body is valid. - Once closed, the stream rejects all subsequent appends with `409 Conflict` (see below). - **Close-only requests are idempotent**: if the stream is already closed and the request includes `Stream-Closed: true` with an empty body, servers **SHOULD** return `204 No Content` with `Stream-Closed: true`. - **Append-and-close requests are NOT idempotent** (without idempotent producer headers): if the stream is already closed and the request includes a body but no idempotent producer headers, servers **MUST** return `409 Conflict` with `Stream-Closed: true`, since the body cannot be appended. However, if idempotent producer semantics apply and the request matches the `(producerId, epoch, seq)` tuple that performed the closing append, servers treat it as a deduplicated success (see Section 5.2.1). #### Request Body - Bytes to append to the stream. Servers **MUST** reject POST requests with an empty body (Content-Length: 0 or no body) with `400 Bad Request`, **unless** the `Stream-Closed: true` header is present (which allows closing without appending data). #### Response Codes - `204 No Content`: Append successful (or stream already closed when closing idempotently) - `400 Bad Request`: Malformed request (invalid header syntax, missing Content-Type, empty body without `Stream-Closed: true`) - `404 Not Found`: Stream does not exist - `405 Method Not Allowed` or `501 Not Implemented`: Append not supported for this stream - `409 Conflict`: Content type mismatch with stream's configured type, sequence regression (if `Stream-Seq` provided), or **stream is closed** (when attempting to append without `Stream-Closed: true`) - `413 Payload Too Large`: Request body exceeds server limits - `429 Too Many Requests`: Rate limit exceeded #### Response Headers (on success) - `Stream-Next-Offset: `: The new tail offset after the append - `Stream-Closed: true`: Present when the stream is now closed (either by this request or previously) #### Response Headers (on 409 Conflict due to closed stream) When a client attempts to append to a closed stream (without `Stream-Closed: true`), servers **MUST** return: - `409 Conflict` status code - `Stream-Closed: true` header - `Stream-Next-Offset: `: The final offset of the closed stream (useful for clients to know the stream's final position) This allows clients to detect and handle the "stream already closed" condition programmatically without parsing the response body. Servers **SHOULD** keep the response body empty or use a standardized error format. Clients **SHOULD NOT** rely on parsing the body to determine the reason for rejection. **Error Precedence:** When an append request would trigger multiple conflict conditions (e.g., stream is closed AND content type mismatches), servers **SHOULD** check the stream's closed status first. This ensures clients receive the `Stream-Closed: true` header, enabling correct error handling. The recommended precedence is: 1. Stream closed -> `409 Conflict` with `Stream-Closed: true` 2. Content type mismatch -> `409 Conflict` 3. Sequence regression -> `409 Conflict` ### 5.2.1. Idempotent Producers Durable Streams supports Kafka-style idempotent producers for exactly-once write semantics. This enables fire-and-forget writes with server-side deduplication, eliminating duplicates from client retries. #### Design - **Client-provided producer IDs**: Zero RTT overhead, no handshake required - **Client-declared epochs, server-validated fencing**: Client increments epoch on restart. Server validates monotonicity and fences stale epochs - **Per-batch sequence numbers**: Separate from `Stream-Seq`, used for retry safety - **Two-layer sequence design**: - Transport layer: `Producer-Id` + `Producer-Epoch` + `Producer-Seq` (retry safety) - Application layer: `Stream-Seq` (cross-restart ordering, lexicographic) #### Request Headers All three producer headers **MUST** be provided together or none at all. If only some headers are provided, servers **MUST** return `400 Bad Request`. - `Producer-Id: ` - Client-supplied stable identifier (e.g., "order-service-1", UUID) - **MUST** be a non-empty string. Empty values result in `400 Bad Request` - Identifies the logical producer across restarts - `Producer-Epoch: ` - Client-declared epoch, starting at 0 - Increment on producer restart to establish a new session - Server validates that epoch is monotonically non-decreasing - **MUST** be a non-negative integer ≤ 2^53-1 (for JavaScript interoperability) - `Producer-Seq: ` - Monotonically increasing sequence number per epoch - Starts at 0 for each new epoch - Applies per-batch (per HTTP request), not per-message - **MUST** be a non-negative integer ≤ 2^53-1 (for JavaScript interoperability) #### Response Headers - `Producer-Epoch: `: Echoed back on success (200/204), or current server epoch on stale epoch (403) - `Producer-Seq: `: On success (200/204), the highest accepted sequence number for this `(stream, producerId, epoch)` tuple. Enables clients to confirm pipelined requests and recover state after crashes. - `Producer-Expected-Seq: `: On 409 Conflict (sequence gap), the expected sequence - `Producer-Received-Seq: `: On 409 Conflict (sequence gap), the received sequence #### Validation Logic ``` # Epoch validation (client-declared, server-validated) if epoch < state.epoch: -> 403 Forbidden -> Headers: Producer-Epoch: if epoch > state.epoch: if seq != 0: -> 400 Bad Request (new epoch must start at seq=0) -> Accept: update state.epoch = epoch, state.lastSeq = 0 -> 200 OK (new epoch established) # Same epoch: sequence validation if seq <= state.lastSeq: -> 204 No Content (duplicate, idempotent success) if seq == state.lastSeq + 1: -> Accept, update state.lastSeq = seq -> 200 OK if seq > state.lastSeq + 1: -> 409 Conflict -> Headers: Producer-Expected-Seq: , Producer-Received-Seq: ``` #### Response Codes (with Producer Headers) - `200 OK`: Append successful (new data) - `204 No Content`: Duplicate append (idempotent success, data already exists) - `400 Bad Request`: Invalid producer headers (e.g., non-integer values, epoch increase with seq != 0) - `403 Forbidden`: Stale producer epoch (zombie fencing). Response includes `Producer-Epoch` header with current server epoch. - `409 Conflict`: Sequence gap detected. Response includes `Producer-Expected-Seq` and `Producer-Received-Seq` headers. #### Bootstrap and Restart Flow 1. **Initial start (epoch=0)**: - Producer sends `(epoch=0, seq=0)` - Server accepts, establishes producer state 2. **Producer restart**: - Producer increments local epoch (0 -> 1), resets seq to 0 - Sends `(epoch=1, seq=0)` - Server sees epoch > state.epoch, accepts, updates state 3. **Zombie fencing**: - Old producer (zombie) still sending `(epoch=0, seq=N)` gets 403 Forbidden - Response includes `Producer-Epoch: 1` header #### Auto-claim Flow (for ephemeral producers) For serverless or ephemeral producers without persisted epoch: 1. Producer starts fresh with `(epoch=0, seq=0)` 2. If server has `state.epoch=5`, returns 403 with `Producer-Epoch: 5` 3. Client can retry with `(epoch=6, seq=0)` to claim the producer ID This is opt-in client behavior and should be used with caution. #### Concurrency Requirements Servers **MUST** serialize validation + append operations per `(stream, producerId)` pair. HTTP requests can arrive out-of-order. Without serialization, seq=1 arriving before seq=0 would cause false sequence gaps. #### Atomicity Requirements For persistent storage, servers **SHOULD** commit producer state updates and log appends atomically (e.g., in a single database transaction). Non-atomic implementations have a crash window where: 1. Data is appended to the log 2. Crash occurs before producer state is updated 3. On recovery, a retry may be re-accepted, causing duplicate data **Recovery for non-atomic stores**: Clients can bump their epoch after a crash to establish a clean session. This trades "exactly once within epoch" for "at least once across crashes" which is acceptable for many use cases. Stores **SHOULD** document their atomicity guarantees clearly. #### Producer State Cleanup Servers **MAY** implement TTL-based cleanup for producer state: - **In-memory stores**: 7 days TTL recommended, clean up on stream access - **Persistent stores**: Retain as long as stream data exists (stronger guarantee) After state expiry, the producer is treated as new. A zombie alive past TTL expiry can write again, which is acceptable for testing but persistent stores should use longer retention. #### Stream Closure with Idempotent Producers Idempotent producers can close streams using the `Stream-Closed: true` header. The behavior is: - **Close with final append**: Include body, producer headers, and `Stream-Closed: true`. The append is deduplicated normally, and the stream closes atomically with the final append. - **Close without append**: Include `Stream-Closed: true` with empty body. Producer headers are optional but if provided, the close operation is still idempotent. - **Duplicate close**: If the stream was already closed by the same `(producerId, epoch, seq)` tuple, servers **SHOULD** return `204 No Content` with `Stream-Closed: true`. When a closed stream receives an append from an idempotent producer: - If the `(producerId, epoch, seq)` matches the request that closed the stream, return `204 No Content` (duplicate/idempotent success) with `Stream-Closed: true` - Otherwise, return `409 Conflict` with `Stream-Closed: true` (stream is closed, no further appends allowed) ### 5.3. Close Stream To close a stream without appending data, send a POST request with `Stream-Closed: true` and an empty body: #### Request ``` POST {stream-url} Stream-Closed: true ``` #### Response Codes - `204 No Content`: Stream closed successfully (or already closed - idempotent) - `404 Not Found`: Stream does not exist - `405 Method Not Allowed` or `501 Not Implemented`: Append/close not supported for this stream #### Response Headers - `Stream-Next-Offset: `: The tail offset (unchanged, since no data was appended) - `Stream-Closed: true`: Confirms the stream is now closed This is the canonical "close-only" operation. For atomic "append final message and close", include a request body as described in Section 5.2. ### 5.4. Delete Stream #### Request ``` DELETE {stream-url} ``` Where `{stream-url}` is the URL of the stream to delete. Deletes the stream and all its data. In-flight reads may terminate with a `404 Not Found` on subsequent requests after deletion. #### Response Codes - `204 No Content`: Stream deleted successfully - `404 Not Found`: Stream does not exist - `405 Method Not Allowed` or `501 Not Implemented`: Delete not supported for this stream ### 5.5. Stream Metadata #### Request ``` HEAD {stream-url} ``` Where `{stream-url}` is the URL of the stream. Checks stream existence and returns metadata without transferring a body. This is the canonical way to find the tail offset, TTL, expiry information, and **closure status**. #### Response Codes - `200 OK`: Stream exists - `404 Not Found`: Stream does not exist - `429 Too Many Requests`: Rate limit exceeded #### Response Headers (on 200) - `Content-Type: `: The stream's content type - `Stream-Next-Offset: `: The tail offset (next offset after the current end) - `Stream-TTL: ` (optional): Remaining time-to-live, if applicable - `Stream-Expires-At: ` (optional): Absolute expiry time, if applicable - `Stream-Closed: true` (optional): Present when the stream has been closed. Absence indicates the stream is still open. - `Cache-Control`: See Section 8 #### Caching Guidance Servers **SHOULD** make `HEAD` responses effectively non-cacheable, for example by returning `Cache-Control: no-store`. Servers **MAY** use `Cache-Control: private, max-age=0, must-revalidate` as an alternative, but `no-store` is recommended to avoid stale tail offsets and closure status. ### 5.6. Read Stream - Catch-up #### Request ``` GET {stream-url}?offset= ``` Where `{stream-url}` is the URL of the stream. Returns bytes starting from the specified offset. This is used for catch-up reads when a client needs to replay stream content from a known position. #### Query Parameters - `offset` (optional) - Start offset token. If omitted, defaults to the stream start (offset -1). #### Response Codes - `200 OK`: Data available (or empty body if offset equals tail) - `400 Bad Request`: Malformed offset or invalid parameters - `404 Not Found`: Stream does not exist - `410 Gone`: Offset is before the earliest retained position (retention/compaction) - `429 Too Many Requests`: Rate limit exceeded For non-live reads without data beyond the requested offset, servers **SHOULD** return `200 OK` with an empty body and `Stream-Next-Offset` equal to the requested offset. If the stream is closed, this response **MUST** also include `Stream-Closed: true` to signal EOF. #### Response Headers (on 200) - `Cache-Control`: Derived from TTL/expiry (see Section 8) - `ETag: {internal_stream_id}:{start_offset}:{end_offset}` - Entity tag for cache validation - `Stream-Cursor: ` (optional for catch-up, required for live modes) - Cursor to echo on subsequent long-poll requests to improve CDN collapsing. Servers **MAY** include this on catch-up reads. It is **required** for live modes when the stream is open (see Sections 5.7, 5.8). Servers **MAY** omit it when `Stream-Closed` is true. Clients **MUST** tolerate its absence when `Stream-Closed` is present. - `Stream-Next-Offset: ` - The next offset to read from (for subsequent requests) - `Stream-Up-To-Date: true` - **MUST** be present and set to `true` when the response includes all data available in the stream at the time the response was generated (i.e., when the requested offset has reached the tail and no more data exists). - **SHOULD NOT** be present when returning partial data due to server-defined chunk size limits (when more data exists beyond what was returned). - Clients **MAY** use this header to determine when they have caught up and can transition to live tailing mode. - **Important:** `Stream-Up-To-Date: true` does **NOT** imply EOF. More data may be appended in the future. Only `Stream-Closed: true` indicates that no more data will ever arrive. - `Stream-Closed: true` - **MUST** be present when the stream is closed **and** the client has reached the final offset **at the time the response is generated**. This includes: - Responses that return the final chunk of data, when the stream is already closed at response generation time, or - Responses with an empty body when the requested offset equals the tail offset of a closed stream (the canonical EOF signal). - When present, clients can conclude that no more data will ever be appended and treat this as EOF. - **SHOULD NOT** be present when returning partial data from a closed stream (when more data exists between the response and the final offset). In this case, `Stream-Closed: true` will be returned on a subsequent request that reaches the final offset. - **Timing note:** If a stream is closed **after** the final chunk was served (or cached), that chunk will not include `Stream-Closed: true`. Clients discover closure by requesting the next offset (`Stream-Next-Offset` from the previous response), which returns an empty body with `Stream-Closed: true`. This is the expected flow when closure occurs between chunk responses or when serving cached chunks. - Clients that need to know closure status before reaching the tail **SHOULD** use `HEAD` (see Section 5.5). #### Response Body - Bytes from the stream starting at the specified offset, up to a server-defined maximum chunk size. ### 5.7. Read Stream - Live (Long-poll) #### Request ``` GET {stream-url}?offset=&live=long-poll[&cursor=] ``` Where `{stream-url}` is the URL of the stream. If no data is available at the specified offset, the server waits up to a timeout for new data to arrive. This enables efficient live tailing without constant polling. #### Query Parameters - `offset` (required) - The offset to read from. **MUST** be provided. - `live=long-poll` (required) - Indicates long-polling mode. - `cursor` (optional) - Echo of the last `Stream-Cursor` header value from a previous response. - Used for collapsing keys in CDN/proxy configurations. #### Response Codes - `200 OK`: Data became available within the timeout - `204 No Content`: Timeout expired with no new data - `400 Bad Request`: Invalid parameters - `404 Not Found`: Stream does not exist - `429 Too Many Requests`: Rate limit exceeded #### Response Headers (on 200) - Same as catch-up reads (Section 5.6), plus: - `Stream-Cursor: `: Servers **MUST** include this header. See Section 8.1. #### Response Headers (on 204) - `Stream-Next-Offset: `: Servers **MUST** include a `Stream-Next-Offset` header indicating the current tail offset. - `Stream-Up-To-Date: true`: Servers **MUST** include this header to indicate the client is caught up with all available data. - `Stream-Cursor: `: Servers **MUST** include this header when the stream is open. Servers **MAY** omit this header when `Stream-Closed` is true (cursor is unnecessary when no further polling is expected). Clients **MUST** tolerate its absence when `Stream-Closed` is present. See Section 8.1. - `Stream-Closed: true`: **MUST** be present when the stream is closed (see Section 5.6 for semantics). A `204 No Content` with `Stream-Closed: true` indicates EOF. **EOF Signaling Across Modes:** Clients should treat **either** of the following as EOF, depending on the mode used: - **Catch-up mode**: `200 OK` with empty body and `Stream-Closed: true` - **Long-poll mode**: `204 No Content` with `Stream-Closed: true` - **SSE mode**: `control` event with `streamClosed: true` In all cases, `Stream-Closed` / `streamClosed` is the definitive EOF signal. The presence of `Stream-Up-To-Date` / `upToDate` alone does **not** indicate EOF - it only means the client has caught up with currently available data, but more may arrive. #### Stream Closure Behavior in Long-poll Mode When the stream is closed and the client is already at the tail offset: - Servers **MUST NOT** wait for the long-poll timeout - Servers **MUST** immediately return `204 No Content` with `Stream-Closed: true` and `Stream-Up-To-Date: true` This ensures clients observing a closed stream do not have hanging connections waiting for data that will never arrive. #### Response Body (on 200) - New bytes that arrived during the long-poll period. #### Timeout Behavior The timeout for long-polling is implementation-defined. Servers **MAY** accept a `timeout` query parameter (in seconds) as a future extension, but this is not required by the base protocol. ### 5.8. Read Stream - Live (SSE) #### Request ``` GET {stream-url}?offset=&live=sse ``` Where `{stream-url}` is the URL of the stream. Returns data as a Server-Sent Events (SSE) stream. SSE mode supports all content types. For streams with `content-type: text/*` or `application/json`, data events carry UTF-8 text directly. For streams with any other `content-type` (binary streams), servers **MUST** automatically base64-encode data events and include the response header `stream-sse-data-encoding: base64`. SSE responses **MUST** use `Content-Type: text/event-stream` in the HTTP response headers. When the stream's configured `content-type` is neither `text/*` nor `application/json`, servers **MUST** include the HTTP response header `stream-sse-data-encoding: base64`. Clients **MUST** check for this header and decode data events accordingly. #### Query Parameters - `offset` (required) - The offset to start reading from. - `live=sse` (required) - Indicates SSE streaming mode. #### Response Codes - `200 OK`: Streaming body (SSE format) - `400 Bad Request`: Invalid parameters - `404 Not Found`: Stream does not exist - `429 Too Many Requests`: Rate limit exceeded #### Response Format Data is emitted in [Server-Sent Events format](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format). **Events:** - `data`: Emitted for each batch of data - Each line prefixed with `data:` - For binary streams (where `stream-sse-data-encoding: base64` is present), the `data` event payload represents bytes encoded using standard base64 per [RFC 4648](https://www.rfc-editor.org/rfc/rfc4648) (alphabet: A-Z, a-z, 0-9, +, /). - Servers **MAY** split the base64 text across multiple `data:` lines within the same SSE `data` event. - Clients **MUST** concatenate the `data:` lines for the event (per SSE rules) and **MUST** remove all `\n` and `\r` characters inserted between lines before base64-decoding. - The resulting string (after removing `\n` and `\r`) **MUST** be valid base64 text with length that is a multiple of 4 (or empty). - If a `data` event's byte payload length is 0, the base64 text **MUST** be the empty string. - Base64 encoding affects only `event: data` payloads. `event: control` events remain JSON as specified and are not encoded. - When the stream content type is `application/json`, implementations **MAY** batch multiple logical messages into a single SSE `data` event by streaming a JSON array across multiple `data:` lines, as in the example below. - `control`: Emitted after every data event - **MUST** include `streamNextOffset`. See Section 8.1. - **MUST** include `streamCursor` when the stream is open. Servers **MAY** omit `streamCursor` when `streamClosed` is true (cursor is unnecessary when no reconnection is expected). - **MUST** include `upToDate: true` when the client is caught up with all available data. Note: `streamClosed: true` implies `upToDate: true` (a closed stream at the final offset is by definition up-to-date), so `upToDate` **MAY** be omitted when `streamClosed` is true. - **MUST** include `streamClosed: true` when the stream is closed and all data up to the final offset has been sent. - Format: JSON object with offset, cursor (when applicable), up-to-date status, and optionally closed status. Field names use camelCase: `streamNextOffset`, `streamCursor`, `upToDate`, and `streamClosed`. **Example (normal data):** ``` event: data data: [ data: {"k":"v"}, data: {"k":"w"}, data: ] event: control data: {"streamNextOffset":"123456_789","streamCursor":"abc"} ``` **Example (final data with stream closure):** ``` event: data data: [ data: {"k":"final"} data: ] event: control data: {"streamNextOffset":"123456_999","streamClosed":true} ``` Note: `streamCursor` is omitted when `streamClosed` is true, since clients must not reconnect after receiving a closed signal. **Client Compatibility:** Clients **MUST** tolerate the absence of `streamCursor` (in SSE) and `Stream-Cursor` (in HTTP headers) when `streamClosed` / `Stream-Closed` is present. Implementations that assume cursor is always present will break when processing closed stream responses. #### Stream Closure Behavior in SSE Mode When the stream is closed: - The final `control` event **MUST** include `streamClosed: true` - After emitting the final control event, servers **MUST** close the SSE connection - Clients receiving `streamClosed: true` **MUST NOT** attempt to reconnect, as no more data will arrive If the stream is already closed when an SSE connection is established and the client's offset is at the tail: - Servers **MUST** immediately emit a `control` event with `streamClosed: true` and `upToDate: true` - Servers **MUST** then close the connection **Example (binary stream with automatic base64 encoding):** ``` event: data data: AQIDBAUG data: BwgJCg== event: control data: {"streamNextOffset":"123456_789","streamCursor":"abc"} ``` #### Connection Lifecycle - Server **SHOULD** close connections roughly every ~60 seconds to enable CDN collapsing - Client **MUST** reconnect using the last received `streamNextOffset` value from the control event - Client **MUST NOT** reconnect if the last control event included `streamClosed: true` ## 6. Offsets Offsets are opaque tokens that identify positions within a stream. They have the following properties: 1. **Opaque**: Clients **MUST NOT** interpret offset structure or meaning 2. **Lexicographically Sortable**: For any two valid offsets for the same stream, a lexicographic comparison determines their relative position in the stream. Clients **MAY** compare offsets lexicographically to determine ordering. 3. **Persistent**: Offsets remain valid for the lifetime of the stream (until deletion or expiration) 4. **Unique**: Each offset identifies exactly one position in the stream. No two positions **MAY** share the same offset. 5. **Strictly Increasing**: Offsets assigned to appended data **MUST** be lexicographically greater than all previously assigned offsets. Server implementations **MUST NOT** use schemes (such as raw UTC timestamps) that can produce duplicate or non-monotonic offsets. Time-based identifiers like ULIDs, which combine timestamps with random components to guarantee uniqueness and monotonicity, are acceptable. **Format**: Offset tokens are opaque, case-sensitive strings. Their internal structure is implementation-defined. Offsets are single tokens and **MUST NOT** contain `,`, `&`, `=`, `?`, or `/` (to avoid conflict with URL query parameter syntax). Servers **SHOULD** use URL-safe characters to avoid encoding issues, but clients **MUST** properly URL-encode offset values when including them in query parameters. Servers **SHOULD** keep offsets reasonably short (under 256 characters) since they appear in every request URL. **Sentinel Values**: The protocol defines two special offset sentinel values: - **`-1` (Stream Beginning)**: The special offset value `-1` represents the beginning of the stream. Clients **MAY** use `offset=-1` as an explicit way to request data from the start. This is semantically equivalent to omitting the offset parameter. Servers **MUST** recognize `-1` as a valid offset that returns data from the beginning of the stream. - **`now` (Current Tail Position)**: The special offset value `now` allows clients to skip all existing data and begin reading from the current tail position. This is useful for applications that only care about future data (e.g., presence tracking, live monitoring, late joiners to a conversation). The behavior varies by read mode: **Catch-up mode** (`offset=now` without `live` parameter): - Servers **MUST** return `200 OK` with an empty response body appropriate to the stream's content type: - For `application/json` streams: the body **MUST** be `[]` (empty JSON array), consistent with Section 7.1 - For all other content types: the body **MUST** be 0 bytes (empty) - Servers **MUST** include a `Stream-Next-Offset` header set to the current tail position - Servers **MUST** include `Stream-Up-To-Date: true` header - Servers **SHOULD** return `Cache-Control: no-store` to prevent caching of the tail offset - The response **MUST** contain no data messages, regardless of stream content **Long-poll mode** (`offset=now&live=long-poll`): - Servers **MUST** immediately begin waiting for new data (no initial empty response) - This eliminates a round-trip: clients can subscribe to future data in a single request - If new data arrives during the wait, servers return `200 OK` with the new data - If the timeout expires, servers return `204 No Content` with `Stream-Up-To-Date: true` - The `Stream-Next-Offset` header **MUST** be set to the tail position **SSE mode** (`offset=now&live=sse`): - Servers **MUST** immediately begin the SSE stream from the tail position - The first control event **MUST** include the tail offset in `streamNextOffset` - If no data has arrived, the first control event **MUST** include `upToDate: true` - If data arrives before the first control event, `upToDate` reflects the current state - No historical data is sent. Only future data events are streamed **Closed streams** (`offset=now` on a closed stream): - Regardless of the `live` parameter, servers **MUST** return immediately with the closure signal - The response **MUST** include `Stream-Closed: true` and `Stream-Up-To-Date: true` headers - The `Stream-Next-Offset` header **MUST** be set to the stream's final (tail) offset - For catch-up mode: `200 OK` with empty body (or empty JSON array for JSON streams) - For long-poll mode: `204 No Content` (no waiting, immediate return) - For SSE mode: The first (and only) control event includes `streamClosed: true` and `upToDate: true`, then the connection closes - This ensures clients using `offset=now` can immediately discover that a stream has no future data **Reserved Values**: The sentinel values `-1` and `now` are reserved by the protocol. Server implementations **MUST NOT** generate these strings as actual stream offsets (in `Stream-Next-Offset` headers or SSE control events). This ensures clients can always distinguish between sentinel requests and real offset values. The opaque nature of offsets enables important server-side optimizations. For example, offsets may encode chunk file identifiers, allowing catch-up requests to be served directly from object storage without touching the main database. Clients **MUST** use the `Stream-Next-Offset` value returned in responses for subsequent read requests. They **SHOULD** persist offsets locally (e.g., in browser local storage or a database) to enable resumability after disconnection or restart. ## 7. Content Types The protocol supports arbitrary MIME content types. Most content types operate at the byte level, leaving message framing and interpretation to clients. The `application/json` content type has special semantics defined below. **SSE Encoding:** - SSE mode (Section 5.8) supports all content types. For streams with `content-type: text/*` or `application/json`, data events carry UTF-8 text natively. For all other content types, servers automatically base64-encode data events (see Section 5.8). Clients **MAY** use any content type for their streams, including: - `application/json` for JSON mode with message boundary preservation - `application/ndjson` for newline-delimited JSON - `application/x-protobuf` for Protocol Buffer messages - `text/plain` for plain text - Custom types for application-specific formats ### 7.1. JSON Mode Streams created with `Content-Type: application/json` have special semantics for message boundaries and batch operations. #### Message Boundaries For `application/json` streams, servers **MUST** preserve message boundaries. Each POST request stores messages as a distinct unit, and GET responses **MUST** return data as a JSON array containing all messages from the requested offset range. #### Array Flattening for Batch Operations When a POST request body contains a JSON array, servers **MUST** flatten exactly one level of the array, treating each element as a separate message. This enables clients to batch multiple messages in a single HTTP request while preserving individual message semantics. **Examples (direct POST to server):** - POST body `{"event": "created"}` stores one message: `{"event": "created"}` - POST body `[{"event": "a"}, {"event": "b"}]` stores two messages: `{"event": "a"}`, `{"event": "b"}` - POST body `[[1,2], [3,4]]` stores two messages: `[1,2]`, `[3,4]` - POST body `[[[1,2,3]]]` stores one message: `[[1,2,3]]` **Note:** Client libraries **MAY** automatically wrap individual values in arrays for batching. For example, a client calling `append({"x": 1})` might send POST body `[{"x": 1}]` to the server, which flattens it to store one message: `{"x": 1}`. #### Empty Arrays Servers **MUST** reject POST requests containing empty JSON arrays (`[]`) with `400 Bad Request`. Empty arrays in append operations represent no-op operations with no semantic meaning and likely indicate a client bug. PUT requests with an empty array body (`[]`) are valid and create an empty stream. The empty array simply means no initial messages are being written. #### JSON Validation Servers **MUST** validate that appended data is valid JSON. If validation fails, servers **MUST** return `400 Bad Request` with an appropriate error message. #### Response Format GET responses for `application/json` streams **MUST** return `Content-Type: application/json` with a body containing a JSON array of all messages in the requested range: ```http HTTP/1.1 200 OK Content-Type: application/json [{"event":"created"},{"event":"updated"}] ``` If no messages exist in the range, servers **MUST** return an empty JSON array `[]`. ## 8. Caching and Collapsing ### 8.1. Catch-up and Long-poll Reads For **shared, non-user-specific streams**, servers **SHOULD** return: ``` Cache-Control: public, max-age=60, stale-while-revalidate=300 ``` For **streams that may contain user-specific or confidential data**, servers **SHOULD** use `private` instead of `public` and rely on CDN configurations that respect `Authorization` or other cache keys: ``` Cache-Control: private, max-age=60, stale-while-revalidate=300 ``` This enables CDN/proxy caching while allowing stale content to be served during revalidation. **Caching and Stream Closure:** Catch-up chunks remain fully cacheable, including chunks at the tail of the stream. When a chunk is returned, it may or may not be the final chunk - this is unknown until the client requests the next offset. The closure signal is discovered when the client requests the offset **after** the final data: 1. Client reads data and receives `Stream-Next-Offset: X` (the tail offset) 2. Client requests offset `X` 3. If stream is closed: server returns `200 OK` with **empty body** and `Stream-Closed: true` 4. If stream is open: server returns `200 OK` with empty body and `Stream-Up-To-Date: true` (or long-poll/SSE waits for data) This design ensures: - All data chunks are cacheable (a chunk that later becomes "final" was still valid data) - The closure signal is a distinct request/response at the tail offset - Cached chunks never become "stale" due to closure - clients simply make one more request to discover EOF **ETag Usage:** Servers **MUST** generate `ETag` headers for GET responses, except for `offset=now` responses. Clients **MAY** use `If-None-Match` with the `ETag` value on repeat catch-up requests. When a client provides a valid `If-None-Match` header that matches the current ETag, servers **MUST** respond with `304 Not Modified` (with no body) instead of re-sending the same data. This is essential for fast loading and efficient bandwidth usage. **ETag and Stream Closure:** ETags **MUST** vary with the stream's closure status. When a stream is closed (without new data being appended), the ETag **MUST** change to ensure clients do not receive `304 Not Modified` responses that would hide the closure signal. Implementations **SHOULD** include a closure indicator in the ETag format (e.g., appending `:c` to the ETag when the stream is closed). **Query Parameter Ordering:** For optimal cache behavior, clients **SHOULD** order query parameters lexicographically by key name. This ensures consistent URL serialization across implementations and improves CDN cache hit rates. **Collapsing:** Clients **SHOULD** echo the `Stream-Cursor` value as `cursor=` in subsequent long-poll requests. This, along with the appropriate `Cache-Control` header, enables CDNs and proxies to collapse multiple clients waiting for the same data into a single upstream request. **Server-Generated Cursors:** To prevent infinite CDN cache loops (where clients receive the same cached empty response indefinitely), servers **MUST** generate cursors on all live mode responses: - **Long-poll**: `Stream-Cursor` response header - **SSE**: `streamCursor` field in `control` events The cursor mechanism works as follows: 1. **Interval-based Calculation**: Servers divide time into fixed intervals (default: 20 seconds) counted from an epoch (default: October 9, 2024 00:00:00 UTC). The cursor value is the interval number as a decimal string. 2. **Cursor Generation**: For each live response, the server calculates the current interval number and returns it as the cursor value. 3. **Monotonic Progression**: Servers **MUST** ensure cursors never go backwards. When a client provides a `cursor` query parameter that is greater than or equal to the current interval number, the server **MUST** return a cursor strictly greater than the client's cursor (by adding random jitter of 1-3600 seconds). This guarantees monotonic progression and prevents cache cycles. 4. **Client Behavior**: Clients **MUST** include the received cursor value as the `cursor` query parameter in subsequent requests. This creates different cache keys as time progresses, ensuring CDN caches eventually expire. **Example Cursor Flow:** ``` # Client makes initial long-poll request GET /stream?offset=123&live=long-poll # Server returns cursor based on current interval (e.g., interval 1000) < Stream-Cursor: 1000 # Client echoes cursor on next request GET /stream?offset=123&live=long-poll&cursor=1000 # If still in same interval, server adds jitter and returns advanced cursor < Stream-Cursor: 1050 ``` **Long-poll Caching:** CDNs and proxies **SHOULD NOT** cache `204 No Content` responses from long-poll requests in most cases. Long-poll `200 OK` responses are safe to cache when keyed by `offset`, `cursor`, and authentication credentials. ### 8.2. SSE SSE connections **SHOULD** be closed by the server approximately every 60 seconds. This enables new clients to collapse onto edge requests rather than maintaining long-lived connections to origin servers. ## 9. Extensibility The Durable Streams Protocol is designed to be extended for specific use cases and implementations. Extensions **SHOULD** be pure supersets of the base protocol, ensuring compatibility with any client that implements the base protocol. ### 9.1. Protocol Extensions Implementations **MAY** extend the protocol with additional query parameters, headers, or response fields to support domain-specific semantics. For example, a database synchronization implementation might add query parameters to filter by table or schema, or include additional metadata in response headers. Extensions **SHOULD** follow these principles: - **Backward Compatibility**: Extensions **MUST NOT** break base protocol semantics. Clients that do not understand extension parameters or headers **MUST** be able to operate using only base protocol features. - **Pure Superset**: Extensions **SHOULD** be additive only. New parameters and headers **SHOULD** be optional, and servers **SHOULD** provide sensible defaults or fallback behavior when extensions are not used. - **Version Independence**: Extensions **SHOULD** work with any version of a client that implements the base protocol. Extension negotiation **MAY** be handled through headers or query parameters, but base protocol operations **MUST** remain functional without extension support. ### 9.2. Authentication Extensions See Section 10.1 for authentication and authorization details. Implementations **MAY** extend the protocol with authentication-related query parameters or headers (e.g., API keys, OAuth tokens, custom authentication headers). ## 10. Security Considerations ### 10.1. Authentication and Authorization Authentication and authorization are explicitly out of scope for this protocol specification. Clients **SHOULD** implement all standard HTTP authentication primitives (e.g., Basic Authentication [RFC7617], Bearer tokens [RFC6750], Digest Authentication [RFC7616]). Implementations **MUST** provide appropriate access controls to prevent unauthorized stream creation, modification, or deletion, but may do so using any mechanism they choose, including extending the protocol with authentication-related parameters or headers as described in Section 9.2. ### 10.2. Multi-tenant Safety If stream URLs are guessable, servers **MUST** enforce access controls even when using shared caches. Servers **SHOULD** validate and sanitize stream URLs to prevent path traversal attacks and ensure URL components are within acceptable limits. ### 10.3. Untrusted Content Clients **MUST** treat stream contents as untrusted input and **MUST NOT** evaluate or execute stream data without appropriate validation. This is particularly important for append-only streams used as logs, where log injection attacks are a concern. ### 10.4. Content Type Validation Servers **MUST** validate that appended content types match the stream's declared content type to prevent type confusion attacks. ### 10.5. Rate Limiting Servers **SHOULD** implement rate limiting to prevent abuse. The `429 Too Many Requests` response code indicates rate limit exhaustion. ### 10.6. Sequence Validation The optional `Stream-Seq` header provides protection against out-of-order writes in multi-writer scenarios. Servers **MUST** reject sequence regressions to maintain stream integrity. ### 10.7. Browser Security Headers When serving streams to browser clients, servers **SHOULD** include the following headers to prevent MIME-sniffing attacks, cross-origin embedding exploits, and cache-related vulnerabilities: - `X-Content-Type-Options: nosniff` - Servers **SHOULD** include this header on all responses. This prevents browsers from MIME-sniffing the response content and potentially executing it as a different content type (e.g., interpreting binary data as HTML/JavaScript). - `Cross-Origin-Resource-Policy: cross-origin` (or `same-origin`/`same-site`) - Servers **SHOULD** include this header to explicitly control cross-origin embedding. Use `cross-origin` to allow cross-origin access via `fetch()`, `same-site` to restrict to the same registrable domain, or `same-origin` for strict same-origin only. This prevents Cross-Origin Read Blocking (CORB) issues and protects against Spectre-like side-channel attacks. - `Cache-Control: no-store` - Servers **SHOULD** include this header on HEAD responses and on responses containing sensitive or user-specific stream data. This prevents intermediate proxies and CDNs from caching potentially sensitive content. For public, non-sensitive historical reads, servers **MAY** use `Cache-Control: public, max-age=60, stale-while-revalidate=300` as described in Section 8. - `Content-Disposition: attachment` (optional) - Servers **MAY** include this header for `application/octet-stream` responses to prevent inline rendering if a user navigates directly to the stream URL. These headers provide defense-in-depth for scenarios where stream URLs might be accessed outside the intended programmatic fetch context (e.g., direct navigation, malicious cross-origin embedding via `