Skip to content

bug: two brokers writing to the same partition can produce overlapping offsets #120

@klaudworks

Description

@klaudworks

What happens

When two or more brokers receive produce requests for the same partition at the same time, they can both assign the same offset range to different messages. This means two different sets of messages end up in S3 claiming to be at the same offsets, and one silently overwrites or shadows the other.

Why it happens

Each broker keeps a local counter for the next offset to assign. This counter is set once when the broker first opens a partition (read from etcd), and then incremented in memory on every produce. The updated offset is only written back to etcd later, when the segment is flushed to S3 -- and that write is a simple overwrite, not a conditional update. So there is nothing stopping two brokers from reading the same starting offset, assigning overlapping ranges, and both writing their segments to S3.

The proxy, which sits in front of the brokers, does not look at which partition a produce request is targeting. It just round-robins client connections across available brokers. So if two producers (or even one producer reconnecting) land on different brokers and write to the same partition, the race can occur.

Proposed Fix

Assumption 1: We cannot properly fix this in the broker unless we introduce some slow etcd partition lease mechanism.
Assumption 2: We may need more than 1 proxy running in parallel o.w. we have a single point of failure and can't deploy this in HA mode.

Discarded Fix Proposal :

I evaluated a few solutions that would be based on deterministic hashing based on a broker list propagated via etcd but always ended up with edge cases that required a proper lease.

Fix Proposal via etcd Leases:

  1. Brokers acquire an etcd lease per partition before writing. When a broker first receives a produce for a partition, it tries to grab a lease key in etcd (one round-trip). If it wins, it reads the current offset, creates the PartitionLog, and starts writing. All subsequent produces are purely local. If it loses, it rejects the request with NOT_LEADER_OR_FOLLOWER. When a broker dies, its leases expire automatically.
  2. Proxies watch the lease keys to know where to route. The proxy watches partition ownership keys in etcd and routes produces to the broker that holds the lease. This is a best-effort optimization because the proxy's view can be briefly stale.
  3. If a broker receives a produce for a partition it doesn't own, it rejects it. The proxy retries on another broker. This covers all edge cases: stale proxy routing table, lease just expired, new partition with no owner yet, scaling events. The broker's local lease check is the single source of truth. The proxy's routing avoids most retries, but the reject-and-retry fallback makes it bulletproof.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions