fix: prevent duplicate offset assignment with etcd lease-based partition ownership#123

klaudworks · 2026-03-01T00:04:07Z

Summary

fixes #122 #120, prepares for #115.

Multiple brokers can receive produce requests for the same partition concurrently. Each broker reads the starting offset from etcd once, increments in-memory, and writes it back as a simple PUT at flush time. When two brokers do this for the same partition, one silently overwrites the other's messages in S3.

How it works

Each partition is now exclusively owned by one broker at a time, enforced through etcd.

Broker side: Before writing to a partition, the broker claims ownership by creating an etcd key (/kafscale/partition-leases/{topic}/{partition}) using an atomic compare-and-swap — only succeeds if no other broker has claimed it. The key is attached to an etcd lease with a 10-second TTL. The broker sends periodic heartbeats to keep the lease alive. If the broker crashes and stops heartbeating, etcd expires the lease after 10 seconds and deletes the key, allowing another broker to take over. On graceful shutdown, the broker revokes the lease immediately so failover is instant. All partition keys share a single etcd lease, so there is only one heartbeat stream per broker regardless of how many partitions it owns.

Proxy side: The proxy watches the etcd lease keys and maintains an in-memory routing table mapping each partition to its owning broker. Produce requests are routed directly to the right broker. If the routing table is stale and a broker rejects a request with NOT_LEADER_OR_FOLLOWER, the proxy retries on a different broker (up to 3 attempts). Requests spanning partitions on different brokers are split, forwarded concurrently, and responses are merged back into a single response for the client.

Changes

Broker: etcd lease-based partition ownership. New PartitionLeaseManager acquires and manages partition leases. handleProduce checks ownership before writing. Leases are released immediately on graceful shutdown.
Proxy: partition-aware produce routing. New PartitionRouter watches lease keys for the routing table. New produce routing path splits, forwards, and merges requests by owning broker. Per-client connection pool avoids re-dialing the same broker on every request.
Protocol: produce request/response codec. When a produce request targets partitions on different brokers, the proxy needs to split it into smaller requests (one per broker) and merge the individual responses back into a single response for the client. This required adding serialization and deserialization for produce requests and responses. Round-trip tested across protocol versions 3-10.
Test infrastructure. Extracted embedded etcd setup into internal/testutil for reuse across etcd_store_test.go, partition_lease_test.go, and partition_router_test.go.

Tests

I could cover the multi-broker behavior well despite relying only on the in-memory etcd.

Follow-up

We should use the same establishes lease mechanism to route consumers to the correct broker for fetches and consumer group coordination. This would fix consumer group coordination #121 and at the same time guarantee cache hits for consumers because they read from the broker that serves the partition and has the relevant segments in cache already. This would spare us all the complex performance optimizations I implemented for #115 and didn't publish yet.

…ion ownership

novatechflow

Thank you @klaudworks

fix: prevent duplicate offset assignment with etcd lease-based partit…

4f77b6d

…ion ownership

klaudworks force-pushed the fix/partition-lease-ownership branch from d5f8206 to 4f77b6d Compare March 1, 2026 00:11

klaudworks requested a review from novatechflow March 1, 2026 00:19

novatechflow approved these changes Mar 1, 2026

View reviewed changes

novatechflow merged commit 578b149 into KafScale:main Mar 1, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent duplicate offset assignment with etcd lease-based partition ownership#123