Stories by Anton Borisov on Medium

AutoMQ: Shared Storage Architecture Deep Dive

Anton Borisov — Wed, 18 Feb 2026 13:14:42 GMT

Disclaimer: I’ve previously written a broader overview of diskless Kafka proposals and the different ideas behind them, not tied specifically to AutoMQ, you can find it here. This post goes deeper into AutoMQ specifics.

If you’ve run Kafka in production for long enough, you have learned the pattern: everything looks beautifully simple until AWS MSK runs monthly security patches and your cluster decides it’s time for a rebalance.

At small scale, replication feels like a safety blanket: a comforting guarantee that your data exists in multiple places and can survive the loss of any single machine. At cloud scale, that same replication quietly transforms into a tax you pay on every byte, every day, with no end in sight. You’re storing the same data three time because the standard is to set replication factor (RF) to 3, you’re paying for cross-AZ replication traffic that exists purely to keep those copies in sync, you’re watching partitions go under-replicated after the slightest operational drama, and every time you try to scale out, “add brokers” translates into a data migration that can take hours to complete.

Kafka still works. It’s just that the operational shape of “works” becomes: works, but you’re always one rebalance away from learning something new about your storage and networking bill.

What if we look at it differently:

Make brokers disposable by shifting durability guarantees to shared storage in the cloud.

That’s the whole architectural inversion, and everything else follows as a consequence. In this post, I’ll take AutoMQ as reference and walk through the three designed layers: ElasticLog, S3Stream, and the metadata control plane, then trace a single record through the entire write path, and finally show how the design pays off in zero-copy failover and no-migration scaling.

A mental model: Kafka’s shared-nothing model (left) couples each broker to its own disk and replicates data between them. AutoMQ’s shared-storage model (right) reduces brokers to compute + cache attached to a shared object storage layer, brokers become disposable because they no longer own the data.

Where the Pain Comes From

Classic Kafka stores partitions on local disks, and durability emerges from replication: leaders write locally, followers fetch those writes, and the in-sync replica set becomes your safety envelope. It’s an elegant design that has powered some of the world’s largest streaming deployments, but it also couples three concerns tightly together: durability (determined by the replication factor and the health of the in-sync replicas aka ISRs), availability (dependent on leader election and how quickly followers can catch up), and scaling (which requires reassigning partitions, i.e. physically moving data).

When those three concerns are coupled, you start seeing the same script play out in different outfits. A broker dies and leader election fires within seconds, but the partitions it owned stay under-replicated until followers finish catching up, minutes, hours, maybe lifetimes. You add brokers to grow the cluster and the reassignment plan looks clean on paper, but behind the scenes it’s copying partition data across the network, competing with production traffic the whole time. Even the simplest capacity ask “we need more headroom” translates into a rebalance that has to physically move bytes across disks before anything actually improves.

AutoMQ makes one design choice that would be heresy in many Kafka shops: the replication factor is 1, and the durable copy lives in object storage.

One copy, period. S3’s 99.999999999% durability, eleven nines, handles the rest. And if you accept that single design decision, the dominoes fall in a very specific order: brokers become stateless because there are no local segment files to own, failover becomes a metadata-only operation because there are no segments to copy, and scaling becomes “add compute” because there’s no data to migrate. This isn’t “Kafka but cheaper.” It’s Kafka whose durability mechanism has moved from ISR-based replication to object storage, and that shift changes the operational character of the entire system.

Quick Primer

AutoMQ is Kafka-compatible at the protocol layer, meaning clients connect and produce and consume exactly as they would with standard Kafka, but internally it swaps Kafka’s file-backed log for one backed by shared storage. Three pieces matter for understanding how this works: ElasticLog, which is a Kafka Log implementation where “segments” aren’t files but slices of streams; S3Stream, which is a storage engine that turns S3 into something log-like through a combination of write-ahead logging, caching, and batched uploads; and the metadata control plane, which maintains the mapping that ties each topic-partition to the underlying streams and manages their object lifecycle.

AutoMQ’s three layers inside a broker:Clients connect via standard Kafka protocol, ElasticLog translates that into segment semantics over streams. S3Stream handles WAL-based durability and caching underneath. The metadata plane tracks partition-to-stream ownership, making the broker replaceable without moving data.

Architecture at a Glance

Think of AutoMQ as Kafka with the log peeled into three layers, each with a single, cleanly separated responsibility. ElasticLog sits at the top and exists to keep Kafka semantics intact, it’s what makes the system look and feel like Kafka to everything above it. S3Stream sits in the middle and makes S3 usable as a log-like storage backend, handling the messy realities of object storage economics and latency. And the metadata plane sits at the bottom, maintaining the ownership mappings and lifecycle state that allow brokers to be truly replaceable without any data movement.

Layer 1: ElasticLog

Kafka’s normal LogSegment is file-backed, consisting of .log, .index, .timeindex, and .txnindex files. AutoMQ introduces `ElasticLogSegment`, which presents itself as a standard Kafka segment to everything above it. It is backed not by files, but by stream slices:

// core/.../streamaspect/ElasticLogSegment.java
public class ElasticLogSegment extends LogSegment {
    private final ElasticLogFileRecords log;       // Data -> stream slice
    private final ElasticTimeIndex timeIndex;       // Timestamps -> stream slice
    private final ElasticTransactionIndex txnIndex; // Transactions -> stream slice

    public ElasticLogSegment(...) {
        super(null, null, null, null, ...);  // Nulls for all file-based params
        log = new ElasticLogFileRecords(
            sm.loadOrCreateSlice("log" + suffix, meta.log()), ...);
    }
}

Kafka wants to append records and advance the log end offset(LEO) synchronously, whereas S3 wants you to batch, pipeline, and confirm later. These two expectations are fundamentally at odds. AutoMQ resolves this tension by splitting the concerns: the LEO advances immediately in the Kafka-facing layer, while the confirmOffset advances separately after the underlying storage pipeline has actually made the bytes durable.

To prevent “async durability” from quietly turning into “infinite memory consumption,” the write path applies backpressure through a semaphore keyed by bytes, allowing some amount of unconfirmed writes with cap at 1GB currently. When the system reaches saturation, producers block early rather than letting memory silently balloon until something crashes.

The read path uses a two-tier executor design. Every consumer fetch first runs on a fast fetch executor (4 threads by default) that attempts to serve entirely from cache. If the data isn’t cached, the request fails fast and gets resubmitted to a slow fetch executor (12 threads) that reads from S3. This keeps cache hits on a small, low-latency thread pool and prevents slow S3 reads from blocking fast ones.

When a segment rolls because it has reached its size or time limit, the old segment’s stream slices are sealed, permanently closed to new writes, with their metadata finalized and offset ranges made stable. This is how AutoMQ achieves an object-store-friendly log shape: you append to the active slice until the segment rolls, then seal it and move on, creating a clean boundary between mutable and immutable data.

At this point you’ve built “Kafka semantics over slices,” but you still need a storage engine that makes those slices efficient to write, read, and manage at scale.

Layer 2: S3Stream

If you hear “Kafka on S3” and imagine every append causing an S3 PUT, stop right there — that’s not an architecture, that’s a bill. S3Stream’s entire reason for existing is to make object storage behave like a log without drowning in request costs and tail latency, and it accomplishes this through a three-stage pipeline where writes land in a durable WAL first, hot reads come from an explicit cache layer, and objects are uploaded to S3 in carefully batched operations to keep the request count sane.

Writes first land in a Write-Ahead Log backed by S3. Rather than issuing a separate S3 PUT for every partition append, which would mean thousands of tiny requests per second, the WAL collects records from all partitions into a shared buffer and flushes them as a single S3 object every 250ms or 8MB, whichever comes first. That one object is a mix of records from different partitions, ordered by arrival time, not by topic. Once the PUT succeeds, the producer gets its ack and the data is durable. Each broker maintains its own WAL, there’s no shared WAL across the cluster.

These WAL objects aren’t the permanent home. In the background, an upload pipeline reads them, reorganizes records by stream, and writes proper partition-ordered objects for efficient reads. Then the WAL objects are trimmed. The WAL exists to make small writes cheap and durable, the upload pipeline exists to make reads fast.

The upload pipeline doesn’t just dump WAL contents into S3 one-to-one, it makes deliberate grouping decisions to keep object counts and request costs under control. Large streams that produce 16MB or more per upload cycle get their own dedicated objects, while smaller streams are bundled together into stream set objects, preventing the object store from filling up with millions of tiny files that would each cost a separate GET request to read back.

Each upload cycle follows a prepare -> upload -> commit protocol: an object ID is allocated from the controller first, the data is written to S3, and only then is the object registered in KRaft metadata. This ordering ensures that metadata always points to objects that actually exist, and that recovery after a crash mid-upload is deterministic. The controller knows exactly which uploads completed and which were abandoned.

Standard Kafka benefits massively from the operating system’s page cache, which transparently keeps recently-written and recently-read data in memory. But if brokers are stateless and S3 is the durable store, there are no local files for the OS to cache, so you need an explicit hot path instead. S3Stream fills this role with a block-structured in-memory cache, 200MB by default with up to 64 blocks, that serves recent reads at memory speed comparable to Kafka’s page cache hit rate. This creates a two-tier read path: the in-memory cache serves at sub-millisecond latency, and S3 GET operations handle everything else at 100–500ms. That layering is why the design can credibly claim that hot reads feel Kafka-like while being honest that cold reads are fundamentally different, no amount of caching eliminates that reality for data that’s fallen out of the hot tier.

At this point we have Kafka semantics through ElasticLog and durable, economical storage behavior through S3Stream. The remaining piece is the glue that makes brokers truly replaceable.

Layer 3: Metadata Control Plane

Shared storage solves the durability problem, but it creates a new requirement that didn’t exist in the shared-nothing world. Brokers must be able to answer instantly: “Which shared-storage streams represent this partition?”

AutoMQ maintains a bidirectional mapping in KRaft metadata that provides exactly this:

// metadata/.../S3StreamsMetadataImage.java
private final TimelineHashMap> partition2streams;
private final TimelineHashMap stream2partition;

This mapping is the architectural keystone of the entire system. When a broker needs to serve a partition, it looks up the stream IDs in this map, and the streams already exist in S3, there’s nothing to copy, nothing to reconstruct. “Ownership” in AutoMQ is just a pointer, not a physical relationship with data on a local disk.

Anatomy of a Write

Now that you know the three layers, let’s trace a single producer record through the entire system, from producer.send() to durable acknowledgment, and see where things can go wrong along the way.

Step 1: ElasticLog Accepts the Append

When a producer record batch arrives, ElasticLog.append() does two things in sequence. First, it tries to acquire a permit from the backpressure semaphore, sized to the byte count of the incoming batch:

// ElasticLog.scala
val permit = records.sizeInBytes()
if (!APPEND_PERMIT_SEMAPHORE.tryAcquire(permit)) {
    while (!APPEND_PERMIT_SEMAPHORE.tryAcquire(permit, 1, TimeUnit.SECONDS)) {
        tryAppendStatistics()  // Log stats while the producer waits
    }
}

If the semaphore is exhausted, meaning too many bytes are in flight between “appended” and “confirmed durable”, the producer thread blocks here, retrying every second. This is intentional: it’s better to slow down the producer than to let unconfirmed data silently consume all available memory.

Once the permit is acquired, the record is appended to the active ElasticLogSegment and the log end offset advances immediately. From this moment forward, consumers fetching with isolation.level=read_uncommitted can see the record, even though it isn't durable yet.

Step 2: The Async Flush Enters S3Stream

The segment’s asyncLogFlush() returns a CompletableFuture that propagates through the storage stack until it reaches the WAL. Along the way, two backpressure mechanisms guard against overload: if the LogCache is full, the request enters a backoffRecords queue and retries periodically until space opens up, and if the WAL itself is at physical capacity, it throws OverCapacityException, which triggers a forced upload of the current WAL contents to S3 to make room. Both mechanisms exist to prevent any single stage from silently overwhelming the next.

Step 3: WAL Batches and Writes to S3

The WAL doesn’t write every record individually, that would be financial suicide on S3. Instead, it batches records for up to 250ms or 8MB(defaults), whichever comes first, then issues a single S3 PUT for the entire batch as a WAL object.

When that PUT succeeds, the write is durable. Two things happen immediately: the record lands in LogCache inside the storage layer, making it available for hot reads, and the backpressure semaphore permit is released, reclaiming budget for new writes. Then, in ElasticLog’s completion callback, confirmOffset advances via a compare-and-set loop that ensures monotonic progress even when concurrent appends complete out of order:

// ElasticLog.scala, confirmOffset advances atomically
while (true) {
    val offset = _confirmOffset.get()
    if (offset.messageOffset < endOffset) {
        _confirmOffset.compareAndSet(offset,
            new LogOffsetMetadata(endOffset, activeSegment.baseOffset, activeSegment.size))
        notify = true
    } else {
        break()
    }
}

Once confirmOffset advances, an appendAckQueue notification fires, processed by a single-threaded callback executor that sends the producer acknowledgment, single-threaded specifically to guarantee that acks are delivered in offset order.

Step 4: Batched Upload to Permanent Objects

The WAL is a staging area, not a long-term home. In the background, S3Storage runs an upload pipeline that drains WAL contents into permanent S3 objects through the same prepare -> upload -> commit protocol described earlier. The prepare phase allocates an object ID from the controller with a 60-minute TTL that ensures cleanup if the upload is abandoned. The upload phase groups records by stream, giving streams that produce 16MB or more per cycle their own dedicated stream objects while bundling smaller streams into stream set objects. And the commit phase registers the new objects in KRaft metadata and marks the WAL source data for trimming.

The pipeline is designed so that the next cycle’s prepare phase overlaps with the current cycle’s upload, ensuring there’s always forward progress. Once committed, the data has graduated from ephemeral WAL objects to permanent, compaction-eligible storage.

The write path: Records flow through ElasticLog into the in-memory WAL buffer, which flushes batched S3 PUTs. Once S3 confirms the write (1), the buffer places the data into LogCache for hot reads (2), durability is confirmed before the producer ack fires.

What Happens When Things Go Wrong

The most straightforward failure is a WAL PUT that doesn’t succeed. When this happens, ElasticLogFileRecords marks the segment as FENCED terminal state where all subsequent append attempts throw IOException:

// ElasticLogFileRecords.java
cf.whenComplete((rst, e) -> {
    if (e instanceof IOException) {
        status = ElasticResourceStatus.FENCED;
    }
});

The records that weren’t confirmed are effectively lost to the producer, which will see a timeout and can retry. The partition will likely need to roll a new segment or be reassigned to recover.

If a broker dies between append and WAL commit, the situation is similar: the log end offset advanced but confirmOffset didn't, meaning those records were never acknowledged. Producers will time out and retry, and the records, which existed only in the dead broker's memory are gone. This is actually the same semantic as standard Kafka when acks=1 and a broker dies before replication completes, with one important difference: in AutoMQ, it routes to the WAL, so the window of vulnerability is bounded by the WAL batching interval (up to 250ms) rather than by replication lag, which can be significantly longer.

The most interesting failure case is when a broker dies after the WAL commit but before the upload to permanent objects. Here, the WAL objects are already sitting in S3, so the data is durable, the producer received its acknowledgment and that acknowledgment was honest. During failover, the new broker runs WAL recovery, reads those WAL objects from S3, uploads them as proper permanent objects, and resets the WAL. No acknowledged data is lost, and the recovery is deterministic because the WAL lives in S3, not on a dead machine’s local disk.

The Payoff: Zero-Copy Failover

Classic Kafka failover has two phases that operate on very different timescales. The first phase is leader election, which is often fast at roughly 10–25 seconds including session timeout and client recovery. The second phase is the aftercare — under-replication recovery, manual reassignment, data movement, which can stretch into hours depending on partition sizes and cluster load.

AutoMQ aims to keep the first phase comparable to standard Kafka, and delete the second phase entirely.

How Failure Is Actually Detected

Failover isn’t triggered the instant a broker disappears — there’s a deliberate detection sequence designed to avoid false positives. The FailoverControlManager runs a scheduled task every 10 seconds that evaluates whether a failover should proceed by checking three conditions through NodeRuntimeMetadata.shouldFailover():

// NodeRuntimeMetadata.java
public boolean shouldFailover() {
    return isFenced() && hasOpeningStreams
        && System.currentTimeMillis() - epoch > DONT_FAILOVER_AFTER_NEW_EPOCH_MS;
}

All three conditions must hold simultaneously: the node must be fenced (meaning KRaft’s session timeout has expired, which takes about 9 seconds by default), the node must have opening streams (confirming it was actually serving partitions rather than sitting idle), and the node must have been down for longer than a grace period after its last restart (to avoid triggering failover if a broker is simply bouncing quickly). Only when all three conditions are met does the controller create a FailoverContext, assign a target broker via round-robin from the pool of healthy nodes, and record the failover decision in KRaft metadata.

The Metadata Event

Once the failover decision is recorded, the actual transfer of partition ownership is remarkably lightweight. The new broker begins serving immediately by attaching to the existing streams in shared storage, creating partition handles with zero data copying:

// core/.../streamaspect/ElasticReplicaManager.scala
case HostedPartition.None =>
    val partition = Partition(tp, time, this)     // Just a metadata wrapper
    allPartitions.put(tp, HostedPartition.Online(partition))
    // No segment copy. No index rebuild. Partition is immediately online.

Three lines of code and the partition is online — no segment copying, no index file rebuilding, no waiting for replicas to catch up. The bytes are already durable in shared storage and always have been; the only thing that needed to change was which broker holds the pointer.

What Happens to In-Flight Writes

When the old broker dies, any records that were appended but not yet WAL-committed are gone, they were never acknowledged, so producers with acks=all will time out and retry, which is the expected contract.

The interesting case is records that were WAL-committed and therefore acknowledged to the producer, but not yet uploaded as permanent objects. Those WAL objects are already sitting in S3, and the FailoverContext carries the dead broker's WAL configuration so the new broker knows where to find them. During recovery, the new broker reads those WAL objects, promotes any records that haven't already been committed as permanent objects, and resets the WAL. No acknowledged data is lost, and the recovery is fully deterministic because every piece of state it needs is already in shared storage.

Epoch Fencing: Preventing Zombie Writers

Zombie writers being fenced :)

There’s a subtle race condition worth considering: what if the old broker isn’t actually dead, but was just slow, and the network partition that triggered its fencing resolves before it realizes it’s been replaced? If both brokers attempt to write to the same stream, you have a split-brain problem.

S3Stream handles this through epoch-based fencing. When the new broker opens streams with a higher epoch, any subsequent append from the old broker’s lower epoch receives an EXPIRED_STREAM_EPOCH error.

The old broker’s stream is marked FENCED and all future operations on it fail immediately. There's no window for split-brain writes and the epoch mechanism makes ownership transfer atomic from the stream's perspective, regardless of whether the old broker knows it's been replaced.

The Cold Cache Penalty

There’s one thing that’s worth being honest about: the new broker’s first few seconds of read performance. When a partition comes online on a fresh broker, its LogCache is empty, which means every consumer fetch during this window hits S3 directly, 100–500ms per read instead of sub-millisecond.

This is a real, observable penalty, and the distinction matters: “partition is online” and “partition is performing at full speed” are not the same moment. The cache warms as new writes flow through and as consumers trigger reads, and within seconds to minutes depending on throughput the hot path is fully active again. But for latency-sensitive consumers, those first moments after failover will feel measurably different from steady state.

Compaction

AutoMQ compacts at multiple levels, from stream sets to large consolidated objects, and supports two strategies for doing so. Physical merges rewrite small objects into bigger sequential ones, targeting objects under 128MB. Composite merges create reference objects that point to the originals, trading some additional indirection for significantly cheaper write operations on larger data.

The goal isn’t just cost control, though that matters, it’s also about read behavior. Compaction is how you prevent “fetch a range of offsets” from degenerating into “issue 400 separate GET requests for tiny objects,” which would destroy both latency and your S3 bill.

MetaStream: Replacing Kafka’s Local Checkpoint Files

To understand what MetaStream does, you first need to understand what it replaces.

Standard Kafka keeps several checkpoint files on every broker’s local disk — all simple text files written atomically via temp-file-and-rename — covering recovery points, log start offsets, high watermarks, compaction progress, and leader epoch histories, plus serialized producer state snapshots that capture in-flight transactions at specific offsets. These files are written by periodic scheduled tasks rather than on every append, and they represent state that cannot be reconstructed from the message stream alone: messages don’t carry “I am a segment boundary” markers, epoch transitions aren’t embedded in payloads, and replaying producer state from offset zero on a 10TB partition could take hours.

Stateless brokers can’t have local files, so AutoMQ collapses all of this checkpoint state into four keys inside MetaStream, an append-only KV store that is itself persisted as an elastic stream in S3:

// core/.../streamaspect/MetaStream.java
public static final String LOG_META_KEY = "LOG";
public static final String PRODUCER_SNAPSHOTS_META_KEY = "PRODUCER_SNAPSHOTS";
public static final String PARTITION_META_KEY = "PARTITION";
public static final String LEADER_EPOCH_CHECKPOINT_KEY = "LEADER_EPOCH_CHECKPOINT";

Each key stores a specific piece of non-derivable state that a new broker needs to begin serving a partition without scanning the entire log. The LOG key is the segment inventory — a table of contents that maps each segment to its byte ranges within S3 streams, so the new broker knows where segments begin and end without scanning terabytes of data. The PARTITION key stores recovery checkpoints, most critically the recoverOffset that tells the new broker "only replay messages from here forward," turning a potentially hours-long full-log scan into a seconds-long tail scan. The PRODUCER_SNAPSHOTS key captures serialized producer state at specific offsets, avoiding the need to replay every historical message through the producer state machine. And the LEADER_EPOCH_CHECKPOINT key records which leader owned which offset range, information that followers need for divergence detection after a leader change and that simply doesn't exist inside the messages themselves.

When a new broker takes over a partition, the MetaStream replay is the very first thing that happens. The broker opens the partition’s MetaStream from S3, iterates every entry to cache the latest value per key, then uses those four keys to reconstruct the full partition state: segment inventory, recovery checkpoints, producer snapshots, and epoch history. Finally, it replays only the messages after recoverOffset to rebuild producer state for the unvalidated tail.

The whole thing takes seconds. It’s reading a small metadata stream — typically a few hundred kilobytes after compaction — not scanning terabytes of partition data. And because MetaStream self-compacts when 60% of its entries become obsolete, it stays compact over the partition’s entire lifetime.

This is what makes brokers truly disposable in practice: the new broker doesn’t need yesterday’s local directory, doesn’t need any files from the dead broker, doesn’t need anything except access to shared storage. It rebuilds its entire world from the MetaStream and starts serving.

Tradeoffs

What you give up is real and shouldn’t be minimized. Cold reads are fundamentally slower because an S3 GET is not a page cache hit, that’s a 100–1000x difference in latency that no amount of wishful thinking can eliminate. You introduce cache tuning and compaction scheduling as first-class operational concerns that didn’t exist before. You trade the relative simplicity of file-system-backed storage for a multi-stage pipeline involving WAL batching, commit protocols, and cache coherence. And write latency increases from sub-millisecond to 5–50ms because the WAL batches writes for up to 250ms before issuing S3 PUTs.

What you stop suffering, though, is equally real. The 3x replication cost that was the non-negotiable baseline is gone. The assumption that “scaling implies migration” and the hours of data movement that came with it — is gone. The failover aftercare that triggered prolonged under-replication windows while you waited for data to rebalance across surviving brokers is gone.

If you’re optimizing for the cloud reality where durability is cheap in object storage but moving bytes around inside your cluster is expensive in both time and money, AutoMQ’s take starts to look appealing.

Closing Notes

Kafka made brokers part of the durable identity of the system where each broker owned its partitions physically, and that ownership was load-bearing for both durability and availability.

AutoMQ removes that identity entirely. Data is permanent, brokers are not. And once you internalize that inversion, the entire design becomes consistent: ElasticLog keeps Kafka semantics intact without segment files, S3Stream makes object storage behave like a log without generating insane request costs, the metadata plane makes ownership transferable without copying a single byte, and MetaStream replaces local checkpoint files so that brokers can be truly, operationally stateless.

This is “bring compute to data,” applied to streaming logs. And if you’ve ever stared at a rebalance plan and thought why am I moving all these bytes just to add capacity, you already understand why this approach is attractive.

Acknowledgments

Thanks to wonderful Nicoleta Lazar for reviewing drafts of this article and improving its clarity.

AutoMQ: Shared Storage Architecture Deep Dive was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

What the Fuss with Fluss: Flink 2.2 Delta Force

Anton Borisov — Fri, 09 Jan 2026 13:23:04 GMT

Apache Flink 2.2 delivers three major enhancements to Delta Join that transform it from a promising optimisation into a production-ready solution for real-world streaming pipelines. If you’ve been waiting for Delta Join to handle CDC sources, support caching or allow filter pushdowns: the wait is over.

This article covers what’s new, why it matters, and how to put these capabilities into practice. If you’re unfamiliar with Delta Join fundamentals, my previous article on Flink 2.1’s Delta Join provides the architectural foundation.

Flink 2.2 addresses the most critical gaps in Delta Join’s 2.1 implementation:

CDC Upsert Support (FLINK-38511): Delta Join now accepts UPDATE_AFTER records, making it usable with MySQL CDC, PostgreSQL CDC, and Debezium connectors
Built-in Caching (FLINK-38495): An integrated LRU cache reduces external storage lookups by 80–90% for typical workloads
Filter and Projection Support (FLINK-38556): Deterministic filters and column projections can now be used by Delta Join, reducing data transfer and enabling new query patterns

Together, these changes enable Delta Join for the most common streaming use case.

Quick Recap

The core problem Delta Join solves is state explosion. In a traditional streaming join, Flink keeps both sides of the join in state, every customer record, every order, everything needed to match late-arriving data. A 10-million-row customer table with 30-day retention easily consumes gigabytes of state per TaskManager. Checkpoints slow down, recovery takes forever, and your cloud bill climbs.

Delta Join flips this model. Instead of storing data in Flink’s state, it lives in external indexed store like Apache Incubating Fluss. When a stream record arrives, Flink performs an async lookup against Fluss rather than querying local state. The result? State drops by 99%+ for large dimensions.

CDC Upsert Support: The Game Changer

FLINK-38511 makes Delta Join usable for real-world database replication. This single enhancement transforms Delta Join from a niche optimization for INSERT-only tables into a practical solution for the most common streaming use case: bidirectional enrichment from CDC sources.

CDC sources define four types of row operations:

+I (INSERT) — New records added to the table
-UB / +UA (UPDATE_BEFORE / UPDATE_AFTER) — Modifications to existing records
-D (DELETE) — Records removed from the table

Flink 2.1’s Delta Join could only handle INSERT operations from indexed lookup store like Fluss. Flink 2.2's code changes are deceptively simple but architecturally significant. In StreamingDeltaJoinOperator.java, the row kind validation evolved:

// Flink 2.1: INSERT-only
Preconditions.checkArgument(
    RowKind.INSERT == element.getValue().getRowKind(),
    "Delta join only supports INSERT records");

// Flink 2.2: INSERT + UPDATE_AFTER
Preconditions.checkArgument(
    RowKind.INSERT == element.getValue().getRowKind() ||
    RowKind.UPDATE_AFTER == element.getValue().getRowKind(),
    "Currently, delta join only supports INSERT and UPDATE_AFTER records");

What Delta Join now supports:

✅ INSERT (+I) — New dimension records
✅ UPDATE_AFTER (+UA) — Updated dimension values
❌ UPDATE_BEFORE (-UB) — Automatically dropped
❌ DELETE (-D) — Not supported (by fundamental design)

Why UPDATE_BEFORE is Automatically Dropped

You might wonder: if Delta Join supports UPDATE_AFTER, why not UPDATE_BEFORE? The answer lies in the stateless design:

No Join State to Retract: Delta Join doesn’t maintain a table of join results in state. When an UPDATE arrives, there’s no previous join result to retract , the operator simply performs a fresh lookup.
UPDATE_AFTER is Sufficient: When a dimension record updates from {id=1, name="Alice"} to {id=1, name="Alice2"}, the CDC stream produces -UB(id=1, name="Alice") followed by +UA(id=1, name="Alice2"). The +UA implicitly replaces the old value in the dimension table's indexed storage.
Downstream Idempotency: The enriched results flow to an idempotent sink (typically with a primary key), which naturally handles the update by overwriting the previous row.

Delta Join changelog propagation path in Flink 2.2: the source emits update-before and update-after events, the planner drops update-before records, the Delta Join performs async lookups and emits only update-after joined rows and the upsert sink overwrites the previous key value.

The StreamPhysicalDropUpdateBefore node is automatically inserted during query optimization, you never see it in your SQL, but it's there in the execution plan, silently removing unnecessary retractions.

Why DELETE Still Isn’t Supported

This isn’t a TODO item, it’s a fundamental architectural constraint.

No State for Retraction: To properly handle DELETE, Delta Join would need to remember what it previously emitted so it can retract the correct join result. But maintaining this state defeats the entire purpose of Delta Join and we’d be back to a regular stateful join with all its memory overhead.

The code in DeltaJoinUtil.java makes this explicit:

private static boolean onlyProduceInsertOrUpdateAfter(StreamPhysicalRel node) {
    ChangelogMode changelogMode = getChangelogMode(node);
    Set allKinds = changelogMode.getContainedKinds();
    return !allKinds.contains(RowKind.UPDATE_BEFORE)
        && !allKinds.contains(RowKind.DELETE);
}

This validation isn’t commented as “TODO: support DELETE.”, it’s a design invariant for now.

The standard workaround is soft deletes. When a customer is “deleted,” CDC produces +UA(id=123, deleted=true), which Delta Join handles perfectly. Downstream analytics can filter on the deleted flag, and you maintain a complete audit trail.

CDC support makes Delta Join usable. But Flink 2.2 also makes it fast. Two complementary enhancements: built-in caching and filter/projection support combine to deliver speculatively 2-5x performance improvements over the 2.1 baseline.

Built-in Caching Layer (FLINK-38495)

Every external lookup has cost: network round-trip, storage I/O, serialization overhead. In Flink 2.1, every stream record triggered a lookup, even if the same dimension key was looked up milliseconds earlier. For skewed workloads (hot customers, popular products), this meant repeatedly fetching the same data.

Flink 2.2 introduces an integrated LRU cache that sits between the Delta Join operator and external storage. The implementation in DeltaJoinCache.java uses Guava Cache with a two-level structure:

// Dual-sided caching (one for left lookups, one for right)
private final Cache> leftCache;
private final Cache> rightCache;

Why this structure?

Outer Cache: Maps join key → inner map (O(1) lookup by join key)
Inner LinkedHashMap: Maps upsert key → row data (handles multiple rows per join key)
LRU Eviction: Automatically manages memory when cache reaches configured size

The inner map is crucial for handling dimension updates. When customer_id=123 updates, the cache upserts the new value in-place for that upsert key, ensuring lookups always see the latest data until LRU eviction.

Configuration:

table.exec.delta-join.cache-enabled: true      # Default
table.exec.delta-join.left.cache-size: 10000   # Max cached join keys
table.exec.delta-join.right.cache-size: 10000

Metrics to Monitor:

deltaJoin.leftCache.hitRate          # Target: >80%
deltaJoin.leftCache.requestCount     # Total lookups attempted
deltaJoin.leftCache.hitCount         # Successful cache hits
deltaJoin.leftCache.keySize          # Distinct join keys cached
deltaJoin.leftCache.totalNonEmptyValues  # Total cached rows

Watch the hit rate. If it’s below 70%, either increase cache size or investigate data skew (if lookups are uniformly distributed across millions of keys, even a large cache won’t help and that’s a sign Delta Join might not be the right fit).

Filter and Projection Support (FLINK-38556)

Projections genuinely reduce data transfer: Delta Join fetches only the columns you specify from Fluss. Filters work differently: they’re applied in Flink after the lookup returns data. The rows still get fetched from Fluss; they just get filtered before joining.

Projection (the real win)

SELECT o.*, c.name, c.tier
FROM orders o
JOIN (
  SELECT id, name, tier FROM customers  -- 3 columns instead of 20
) c ON o.customer_id = c.id;

The projectionOnTemporalTable field in DeltaJoinSpec specifies which columns to fetch. Fluss returns only the requested columns. For wide dimension tables, this cuts network I/O and cache memory by 50–80%.

Filters (applied post-lookup)

Flink 2.2 allows exactly one Calc node between the source scan and Delta Join. Filters extracted from this node are applied in Flink after the lookup via a generated FlatMapFunction in AsyncDeltaJoinRunner. Data still gets fetched from Fluss, filtering just reduces rows before joining.

For CDC sources, filters must reference only upsert key columns.

Why this restriction? Delta Join can emit duplicates because CDC updates on either side trigger fresh lookups against the opposite table and any in-flight async operations get replayed after checkpoint recovery. For upsert key columns (which determine record identity), values are stable, so filtering on them is safe. For non-upsert-key columns, values can change between duplicate emissions, leading to inconsistent results:

50). A row update from US to UK still matches, so UB and UA changelog events are emitted. Right: filter uses a non-upsert key (country = ‘US’). After update, the new value no longer matches, so the events are not emitted. Demonstrates why filters must align with upsert keys to preserve update visibility." src="https://cdn-images-1.medium.com/max/1024/1*DiXmSIKJnJ34Nvdgt6_wxA.png" />

Comparison of upsert-key filters vs non-upsert-key filters: when filtering on the upsert key, updates stay visible. When filtering on non-upsert key, updates can disappear.

For INSERT-only sources, filter on any column as no updates means no inconsistency.

All filters must be deterministic. RAND(), NOW(), CURRENT_TIMESTAMP() are rejected because duplicate emissions would produce different results.

There’s a separate concept that can cause confusion: some connectors support FilterPushDownSpec, which actually pushes filters to the storage layer. This is orthogonal to FLINK-38556. When a connector like Fluss supports this capability, Delta Join validates that those pushed filters also follow the upsert key restriction for CDC sources.

Correctness and changelog requirements on upstream data sources go beyond Delta Join and open up a larger conversation related to generic Join semantic. That’s outside our scope here, but FLINK-38579 provides a good entry point.

Caching reduces lookup frequency by 80–90%. Projection reduces data transfer by 50–80%. Filters reduce rows processed in Flink (but don’t reduce data fetched from Fluss unless the connector supports FilterPushDownSpec). Together, these make Delta Join significantly faster than 2.1.

Design Tradeoffs and Limitations

Delta Join’s performance comes with constraints. Understanding these limitations and why they exist as it’s critical for deciding when to use this optimization.

Indexing

Delta Join will not apply without proper indexing. The validation logic in DeltaJoinUtil.java checks that join keys completely cover at least one index or prefix of it defined in Fluss.

Delta Join performs thousands of lookups per second against external storage. Without indexes, each lookup becomes a full table scan turning an O(1) indexed lookup into an O(n) sequential scan. For a million-row dimension table, that’s the difference between 1ms and 1000ms per lookup.

Only INNER JOIN

This isn’t a limitation to be worked around as it’s an architectural invariant. LEFT, RIGHT, and FULL OUTER joins are fundamentally incompatible with Delta Join’s stateless design.

The problem with LEFT JOIN:

T0: Order for customer_id=999 arrives
    → Async lookup → customer 999 not found in dimension table
    → LEFT JOIN semantics: emit (order_data, NULL, NULL, NULL)

T5: Customer 999 is added to dimension table
    → Should we now emit (order_data, customer_name, customer_tier, ...)?
    → But how do we retract the previous (order_data, NULL, NULL, NULL)?
Fundamental issue: No state tracking what was emitted previously

Delta Join doesn’t maintain join results in state (that’s the whole point inavoiding state!). For INNER JOIN, unmatched rows are simply skipped, so no NULL emission, no retraction needed. For LEFT JOIN, unmatched rows must be emitted with NULL padding, and if that dimension record appears later, we’d need to retract the NULL emission and emit the matched result. This requires state.

From DeltaJoinUtil.java:

public static boolean isJoinTypeSupported(FlinkJoinType flinkJoinType) {
    return FlinkJoinType.INNER == flinkJoinType;
}

If you need LEFT JOIN semantics, use a regular stateful join as the state cost is unavoidable for correctness.

The Idempotency

Perhaps the most critical constraint: downstream operators must handle duplicate records idempotently. I mentioned this in previous section.

If downstream can’t handle seeing (order_id=100, customer_name='Alice') twice, the pipeline breaks.

Bad: Append-only sinks

CREATE TABLE kafka_results (
  order_id INT,
  customer_name STRING
) WITH ('connector' = 'kafka', 'format' = 'json');

-- Kafka consumers will see duplicate records!
INSERT INTO kafka_results
SELECT o.id, c.name FROM orders o
JOIN customers c ON o.customer_id = c.id;

Good: Upsert sinks

CREATE TABLE kafka_results (
  order_id INT PRIMARY KEY NOT ENFORCED,
  customer_name STRING
) WITH (
  'connector' = 'upsert-kafka',
  'key.format' = 'json',
  'value.format' = 'json'
);

-- Duplicates merged by primary key → idempotent
INSERT INTO kafka_results
SELECT o.id, c.name FROM orders o
JOIN customers c ON o.customer_id = c.id;

What qualifies as idempotent:

Upsert sinks (Fluss KV table, Kafka with upsert mode, JDBC with primary key, lake tables, StarRocks PK tables)
Stateless transformations (map, flatMap without side effects)
Any other sink that deduplicates by key

If your downstream processing is append-only or has side effects (incrementing counters, triggering alerts), Delta Join will cause correctness issues.

Filters limitations

We covered this in the filter section, but it’s worth emphasizing: for CDC sources, filters must reference only upsert key columns, otherwise Flink falls back to a regular stateful join.

How to choose now?

Comparison matrix

Decision checklist for using Delta Join. If the dimension is indexed and large, hard deletes are not required, no outer join is needed, downstream is idempotent and ultra-low latency is not mandatory, then Delta Join is suitable and to consider.

Closing Thoughts

Apache Flink 2.2 transforms Delta Join from a promising optimization into a viable solution for real-world streaming pipelines.

Relaxed gotcha to start thinking about The Delta Join + Fluss

Delta Join isn’t universal. INNER JOIN only, no DELETE support, mandatory indexes and the downstream idempotency requirement aren’t limitations to be worked around. They’re fundamental architectural tradeoffs. The optimization trades flexibility (no outer joins, no hard deletes) for efficiency (99% state reduction, 10x throughput).

For organizations processing millions of streaming events against large reference data, Flink 2.2’s Delta Join can reduce state by 99%, cut infrastructure costs significantly, and improve throughput by an order of magnitude, all while handling real-world CDC patterns that were impossible in 2.1.

What the Fuss with Fluss: Flink 2.2 Delta Force was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Adding Authentication to Elixir Native Kafka Client: Trust, but Verify

Anton Borisov — Mon, 15 Dec 2025 17:03:37 GMT

There was an open issue on the kafka_ex GitHub repository from 2017 asking for SASL authentication support. Eight years later, I closed it with my implementation.

At Fresha, we run kafka_ex in production. It does the job. But some managed Kafka providers require authentication and that missing feature had been sitting there for years. I got curious: what would it take to add SASL to an existing Kafka client? How does the protocol actually work? How hard could it be?

I decided to find out, as eight years is a long time for a feature request to sit open, so there is definitely something to learn in my favourite DIY way.

This is what I built. PLAIN and SCRAM mechanisms, the wire protocol details, the design decisions that make it extensible. Elixir might not be your language, but the code is straightforward and the concepts are universal.

Note: Medium code blocks seem to not support Elixir highlighting, so I would use Ruby one as a workaround, hope it won’t be confusing.

Let’s start with what we’re actually implementing.

What Is SASL?

SASL (Simple Authentication and Security Layer) is a framework that separates authentication mechanisms from application protocols. Kafka adopted it to support multiple auth methods through one interface.

When a client connects to a SASL-enabled broker, authentication happens immediately before any produce or consume operations:

Client opens TCP connection (usually with TLS)
Client queries broker’s API versions (optional but recommended)
Client sends SASL handshake, proposing a mechanism
Broker accepts or rejects
Client and broker exchange mechanism-specific messages
Authentication succeeds or fails
Normal Kafka operations begin

SASL flow in kafka_ex: TCP connect, optional TLS, ApiVersions negotiation, SASL handshake (pick mechanism), then one or more SaslAuthenticate request/response steps (e.g., SCRAM) before normal Kafka operations.

I implemented two mechanisms, with the architecture ready for more:

PLAIN is the simplest: send username and password in a single message. It must run over TLS as sending credentials in cleartext is obviously a bad idea. Despite its simplicity, PLAIN is widely supported and often sufficient for internal services.

SCRAM (Salted Challenge Response Authentication Mechanism) is more sophisticated. Your password never crosses the wire. Instead, client and server exchange cryptographic proofs:

Client sends a random nonce
Server responds with its own nonce, a salt and iteration count
Client derives proofs from the password and sends them
Server validates and sends its own proof
Client validates server’s proof (mutual authentication)

Capturing the exchange doesn’t help replay it, nonces are unique. Offline brute-force is possible but PBKDF2 makes it expensive. Kafka supports SCRAM-SHA-256 and SCRAM-SHA-512.

SCRAM over Kafka SaslAuthenticate: client-first → server-first (nonce, salt, iterations) → client-final (proof) → server-final (server signature or error). On success, SASL completes and the connection can use normal Kafka APIs.

OAUTHBEARER (open PR) uses JWT tokens from an OAuth provider. You supply a token provider function, the library calls it on each connection.

MSK IAM (planned) would use AWS credentials for Amazon MSK. Same pattern, though different credential source, same SASL framework.

I wanted to add authentication without rewriting existing code. More importantly, I wanted adding new mechanisms to be trivial.

Authentication flow: KafkaEx.Socket triggers SASL orchestrator, which picks a mechanism module and passes it a send_fun closure. The mechanism builds payloads without knowing about wire protocol, send_fun handles encoding via Codec and I/O operations underneath.

Every SASL mechanism does two things: announce its name and perform authentication. In Elixir, we express this contract with a behaviour:

defmodule KafkaEx.Auth.Mechanism do
  @type auth_opts :: KafkaEx.Auth.Config.t()
  @type send_fun :: (binary() -> {:ok, binary() | nil} | {:error, term()})
  @callback mechanism_name(auth_opts()) :: String.t()
    @callback authenticate(auth_opts(), send_fun()) :: :ok | {:error, term()}
  end

The send_fun abstraction is key. Mechanism modules don't know about sockets or Kafka framing: they send bytes, receive responses. This makes testing straightforward and keeps responsibilities separate.

Adding a new mechanism means implementing these two functions and registering in a map:

@mechanisms %{
  plain: KafkaEx.Auth.SASL.Plain,
  scram: KafkaEx.Auth.SASL.Scram
  # oauthbearer: KafkaEx.Auth.SASL.OAuthBearer  <- open PR
  # msk_iam: KafkaEx.Auth.SASL.MskIam           <- future
}

Authentication config needs validation up front:

defmodule KafkaEx.Auth.Config do
  @enforce_keys [:mechanism, :username, :password]

  defstruct mechanism: :plain,
            username: nil,
            password: nil,
            mechanism_opts: %{}
  
  @type t :: %__MODULE__{
            mechanism: :plain | :scram,
            username: String.t(),
            password: String.t(),
            mechanism_opts: map()
          }
end

The mechanism_opts map handles mechanism-specific settings: algorithm choice for SCRAM, token provider for OAUTHBEARER. Validation happens at construction:

defp validate_config(%{mechanism: :plain} = cfg) do
  unless cfg[:username] && cfg[:password] do
    raise ArgumentError, "PLAIN requires username and password"
  end
  cfg
end

defp validate_config(%{mechanism: :scram} = cfg) do
  unless cfg[:username] && cfg[:password] do
    raise ArgumentError, "SCRAM requires username and password"
  end
  cfg
end

Kafka uses length-prefixed binary messages. Authentication needs three request types: API Versions, SASL Handshake and SASL Authenticate.

Each request has a header with API key, version, correlation ID and client ID:

defmodule KafkaEx.Auth.SASL.CodecBinary do
  @sasl_handshake_key 17
  @sasl_authenticate_key 36

def handshake_request(mechanism, corr, ver, client_id \\ "kafka_ex") do
    <<
      @sasl_handshake_key::16,
      ver::16,
      corr::32,
      byte_size(client_id)::16,
      client_id::binary,
      byte_size(mechanism)::16,
      mechanism::binary
    >>
  end
  def authenticate_request(auth_bytes, corr, ver, client_id \\ "kafka_ex") do
    <<
      @sasl_authenticate_key::16,
      ver::16,
      corr::32,
      byte_size(client_id)::16,
      client_id::binary,
      byte_size(auth_bytes)::32,
      auth_bytes::binary
    >>
  end
end

Elixir’s binary syntax makes this readable: <> means "encode as 16-bit integer". Everything is explicitly sized, so no parsing ambiguity.

Responses need careful validation:

def parse_handshake_response(<>, expected_corr, mechanism, ver) do
  if corr != expected_corr do
    {:error, :correlation_mismatch}
  else
    parse_handshake_body(rest, mechanism, ver)
  end
end

defp parse_handshake_body(<<0::16, _rest::binary>>, _mechanism, 0), do: :ok
defp parse_handshake_body(<<0::16, rest::binary>>, mechanism, 1) do
  mechanisms = parse_mechanism_list(rest)
  if mechanism in mechanisms, do: :ok, else: {:error, {:unsupported_mechanism, mechanism}}
end
defp parse_handshake_body(<>, _, _) do
  {:error, {:handshake_failed, error_atom(err)}}
end

Correlation mismatch means request/response pairing broke. Error codes translate to atoms that mean something when debugging.

PLAIN

PLAIN is almost trivial:

defmodule KafkaEx.Auth.SASL.Plain do
  @behaviour KafkaEx.Auth.Mechanism

  @impl true
  def mechanism_name(_), do: "PLAIN"
  @impl true
  def authenticate(%Config{username: user, password: pass}, send_fun) do
    # RFC 4616: [authzid] NUL authcid NUL passwd
    payload = <<0, user::binary, 0, pass::binary>>
    
    case send_fun.(payload) do
      {:ok, _} -> :ok
      error -> error
    end
  end
end

One message, three fields separated by null bytes: authorization identity (empty), authentication identity (username), password. Send it, you’re done.

The simplicity is deceptive though as PLAIN requires TLS. Without encryption, you’re sending credentials in cleartext. The version support module enforces this:

def validate_config(%Config{mechanism: :plain}, socket) do
  if socket.ssl do
    :ok
  else
    {:error, :plain_requires_tls}
  end
end

SCRAM

SCRAM is more involved. It’s a stateful protocol, each message depends on previous exchanges.

SCRAM flow simplified. Client sends nonce, server responds with salt and iterations. Client derives keys from password (never sent), builds proof via XOR trick, sends it. Server reverses XOR to verify client knows the password. Some nonce details omitted, see RFC 5802 for full spec.

So I modeled the required state as a struct that accumulates values through the exchange:

defmodule KafkaEx.Auth.ScramFlow.Internal do
  defstruct [
    :algorithm,
    :username,
    :password,
    :client_nonce,
    :client_first_bare,
    :server_first_raw,
    :server_nonce,
    :salt,
    :iterations,
    :auth_message,
    :server_signature
  ]
end

The flow threads state through each step:

def authenticate(username, password, algo, send_fun) do
  st0 = %Internal{
    algorithm: algo, 
    username: username, 
    password: password, 
    client_nonce: nonce()
  }
  
  {client_first, st1} = Internal.client_first(st0)

  with {:ok, server_first} <- send_fun.(client_first),
       {:ok, st2} <- Internal.handle_server_first(st1, server_first),
       {client_final, st3} <- Internal.client_final(st2),
       {:ok, server_final} <- send_fun.(client_final) do
    Internal.verify_server_final(st3, server_final)
  end
end

defp nonce(len \\ 24), do: len |> :crypto.strong_rand_bytes() |> Base.encode64(padding: false)

Pure functional composition. Each step succeeds with new state or fails with a reason.

We establish identity and randomness:

def client_first(%__MODULE__{} = s) do
  gs2 = "n,,"  # No channel binding
  cfb = "n=#{escape(s.username)},r=#{s.client_nonce}"
  {gs2 <> cfb, %{s | client_first_bare: cfb}}
end

defp escape(str) do
  str
  |> String.replace("=", "=3D")
  |> String.replace(",", "=2C")
end

Username escaping follows RFC 5802.

The server sends its nonce (appended to ours), a salt and iteration count:

def handle_server_first(%__MODULE__{} = s, server_first) do
  with %{"r" => nonce, "s" => salt_b64, "i" => iter_str} <- parse_kv(server_first),
       true <- String.starts_with?(nonce, s.client_nonce) or {:error, :invalid_server_nonce} do
    {:ok, %{s | 
      server_first_raw: server_first,
      server_nonce: nonce,
      salt: Base.decode64!(salt_b64),
      iterations: String.to_integer(iter_str)
    }}
  end
end

Nonce validation is critical, the server’s nonce must start with ours. This prevents replay attacks.

Here’s where security happens:

def client_final(%__MODULE__{} = s) do
  salted = pbkdf2(s.algorithm, s.password, s.salt, s.iterations)
  
  client_key = hmac(s.algorithm, salted, "Client Key")
  server_key = hmac(s.algorithm, salted, "Server Key")
  stored_key = hash(s.algorithm, client_key)
  
  cfwp = "c=#{Base.encode64("n,,")},r=#{s.server_nonce}"
  auth_message = "#{s.client_first_bare},#{s.server_first_raw},#{cfwp}"
  
  client_sig = hmac(s.algorithm, stored_key, auth_message)
  proof = :crypto.exor(client_key, client_sig)
  server_sig = hmac(s.algorithm, server_key, auth_message)
  
  final_message = "#{cfwp},p=#{Base.encode64(proof)}"
  
  {final_message, %{s | auth_message: auth_message, server_signature: server_sig}}
end

We send a proof we know the password, not the password itself. The server verifies against stored data without seeing plaintext credentials.

SCRAM is mutual and we verify the server too:

def verify_server_final(%__MODULE__{} = s, server_final) do
  case parse_kv(server_final) do
    %{"v" => sig_b64} ->
      if Base.decode64!(sig_b64) == s.server_signature do
        :ok
      else
        {:error, :server_signature_mismatch}
      end
    
    %{"e" => error} ->
      {:error, {:server_error, error}}
  end
end

This catches man-in-the-middle attacks where someone impersonates the broker.

The public interface is clean:

defmodule KafkaEx.Auth.SASL.Scram do
  @behaviour KafkaEx.Auth.Mechanism

@impl true
  def mechanism_name(%Config{mechanism_opts: %{algo: :sha512}}), do: "SCRAM-SHA-512"
  def mechanism_name(%Config{}), do: "SCRAM-SHA-256"
  @impl true
  def authenticate(%Config{username: u, password: p, mechanism_opts: opts}, send_fun) do
    algo = if opts[:algo] == :sha512, do: :sha512, else: :sha256
    ScramFlow.authenticate(u, p, algo, send_fun)
  end
end

Same behaviour, same interface. The complexity stays isolated in ScramFlow.

The SASL module coordinates the pieces:

defmodule KafkaEx.Auth.SASL do
  @mechanisms %{
    plain: KafkaEx.Auth.SASL.Plain,
    scram: KafkaEx.Auth.SASL.Scram
  }

  def authenticate(socket, %Config{} = creds) do
    with {:ok, mech_mod} <- get_mechanism_module(creds),
         api_versions <- fetch_api_versions_if_needed(socket),
         handshake_v <- CodecBinary.pick_handshake_version(api_versions),
         auth_v <- api_versions |> CodecBinary.pick_authenticate_version() |> min(1),
         :ok <- perform_handshake(socket, mech_mod, handshake_v, creds),
         :ok <- mech_mod.authenticate(creds, &send_authenticate(socket, &1, auth_v)) do
      Logger.debug("SASL authentication successful")
      :ok
    else
      {:error, reason} = err ->
        Logger.error("SASL authentication failed: #{inspect(reason)}")
        err
    end
  end
end

The with chain reads like a checklist: get module, fetch versions, handshake, authenticate.

Testing Against Real Kafka

Unit tests aren’t enough for authentication. You need real brokers.

Docker Compose with Confluent’s Kafka image:

services:
  kafka-1:
    image: confluentinc/cp-kafka:7.0.4
    ports:
      - "9092:9092"   # No auth
      - "9192:9192"   # SASL/PLAIN
      - "9292:9292"   # SASL/SCRAM
    environment:
      KAFKA_LISTENERS: "INTERNAL://:29092,NOAUTH://:9092,PLAIN://:9192,SCRAM://:9292"
      KAFKA_ADVERTISED_LISTENERS: "INTERNAL://kafka-1:29092,NOAUTH://localhost:9092,PLAIN://localhost:9192,SCRAM://localhost:9292"
    env_file: docker-compose-kafka.env

The env file configures JAAS:

KAFKA_SASL_ENABLED_MECHANISMS=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512
KAFKA_LISTENER_NAME_plain_PLAIN_SASL_JAAS_CONFIG=org.apache.kafka.common.security.plain.PlainLoginModule required user_test="secret";
KAFKA_LISTENER_NAME_scram_SCRAM-SHA-256_SASL_JAAS_CONFIG=org.apache.kafka.common.security.scram.ScramLoginModule required;
KAFKA_LISTENER_NAME_scram_SCRAM-SHA-512_SASL_JAAS_CONFIG=org.apache.kafka.common.security.scram.ScramLoginModule required;

SCRAM users need creation in ZooKeeper(this docker setup uses Zookeeper):

docker exec kafka-1 kafka-configs --zookeeper zookeeper:32181 \
  --alter --add-config 'SCRAM-SHA-256=[password=secret],SCRAM-SHA-512=[password=secret]' \
  --entity-type users --entity-name test

Tests verify actual authentication:

defmodule KafkaEx.Integration.SaslAuthenticationTest do
  use ExUnit.Case
  @moduletag :integration

  describe "SASL/PLAIN authentication" do
      @tag sasl: :plain
      test "connects and produces with PLAIN" do
        opts = [
          uris: [{"localhost", 9192}],
          use_ssl: true,
          ssl_options: [verify: :verify_none],
          auth: KafkaEx.Auth.Config.new(%{
            mechanism: :plain,
            username: "test",
            password: "secret"
          })
        ]
        {:ok, _pid} = KafkaEx.create_worker(:plain_worker, opts)
        assert :ok = KafkaEx.produce("test_topic", 0, "test", worker_name: :plain_worker)
      end
    end
  describe "SASL/SCRAM authentication" do
    for algo <- [:sha256, :sha512] do
      @tag sasl: :scram, algo: algo
      test "connects with SCRAM-#{algo}" do
        worker_name = :"scram_#{unquote(algo)}_worker"
        opts = [
          uris: [{"localhost", 9292}],
          use_ssl: true,
          ssl_options: [verify: :verify_none],
          auth: KafkaEx.Auth.Config.new(%{
            mechanism: :scram,
            username: "test",
            password: "secret",
            mechanism_opts: %{algo: unquote(algo)}
          })
        ]
        {:ok, _pid} = KafkaEx.create_worker(worker_name, opts)
        assert :ok = KafkaEx.produce("test_topic", 0, "test", worker_name: worker_name)
      end
    end
  end
end

We don’t just test authentication succeeds, we verify Kafka operations work afterward.

Configuration is straightforward:

# PLAIN
config :kafka_ex,
  brokers: [{"kafka.example.com", 9192}],
  use_ssl: true,
  sasl: %{
    mechanism: :plain,
    username: System.get_env("KAFKA_USERNAME"),
    password: System.get_env("KAFKA_PASSWORD")
  }

# SCRAM
config :kafka_ex,
  brokers: [{"kafka.example.com", 9292}],
  use_ssl: true,
  sasl: %{
    mechanism: :scram,
    username: System.get_env("KAFKA_USERNAME"),
    password: System.get_env("KAFKA_PASSWORD"),
    mechanism_opts: %{algo: :sha256}
  }

Then use kafka_ex normally, authentication is invisible.

What’s Next

The nice thing about this shape is that new mechanisms don’t sprawl through the client. OAUTHBEARER is basically the same contract with a different payload (a JWT you fetch from a provider). MSK IAM would land the same way too: different credential source, same SASL seam.

Adding one is boring on purpose:

implement mechanism_name/1 and authenticate/2
validate whatever it needs in Config
register the module in the mechanism map
wire up a broker in Docker so you can prove it works end-to-end

Two lessons from doing this in a real codebase:

kafka_ex already has its own idioms. If you follow them, your changes look like they were always there and reviewers don’t have to “learn your framework” to approve auth.

And tests matter more than you want them to. The Docker setup took longer than writing PLAIN, but it’s what caught the real mistakes, the kind you never see in unit tests.

Also, fail loudly. Auth failures are not the place for “connection closed”. Returning tagged errors like {:error, :server_signature_mismatch} saves hours when something is misconfigured.

For Fresha and Beyond

We needed SASL to improve our Kafka setup. So I added it to kafka_ex in a way that doesn’t leak sockets into mechanisms and doesn’t turn auth into a special case: a small contract (Mechanism + send_fun), strict validation and tagged failures you can debug from logs.

PLAIN is deliberately simple (and deliberately TLS-only). SCRAM is the opposite: stateful, proof-based, mutual verification and still fits behind the same seam.

If you need SASL in kafka_ex, it’s there. If you’re implementing it elsewhere, the takeaway is simple:

Design the clean seam, test against a real broker and make insecure configurations impossible.

Appreciation

Thanks to Piotr Rybarczyk for the review, it helped to keep the scope tight and avoid unnecessary integration points.

Adding Authentication to Elixir Native Kafka Client: Trust, but Verify was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

StarRocks Incremental MV: A Bridge Over Shifting Ice

Anton Borisov — Wed, 26 Nov 2025 17:16:39 GMT

Note: The Incremental View Maintenance framework Phase 1 discussed here has been merged into StarRocks but is not yet released. This article explores the architecture and concepts based on the merged PRs, discussing what’s coming and why it matters conceptually, not what you can run in production today yet.

Incremental view maintenance is a very simple idea: when base data changes, update the view without recomputing everything. What makes it hard is threading deltas through joins, aggregations, and state makes it one of the harder problems in database engineering.

I’ve written about some of the conceptual underpinnings before in my streaming computation article, where I argued that if we can rely on determinism, we can use simpler models to track the flow of changes without recomputing everything or synchronizing state globally. The amount of work becomes proportional to the amount of change. Though it was about streaming engines, it seems a more generic problem I also explored why we need more efficient CDC behavior from Iceberg and how snapshot boundaries and metadata overhead shape what’s possible for incremental pipelines.

Now it’s time to put the whole bigger picture together. Because while all of that was interesting, it felt speculative, talks about a future that wasn’t quite here. Visionary and poetical, perhaps, but not yet practical.

But let’s see how it can be made practical, interleaved with that vision.

Materialized views in StarRocks are powerful. The query rewrite engine transparently routes queries through fresh MVs and the refresh scheduler handles maintenance automatically. But before this work, they suffered a major drawback: refresh was partition-based. If 10 records touched 10 partitions, you’d recalculate 10 partitions. The granularity was too coarse, the cost unpredictable, sometimes dangerous.

IVM is basically the idea of making the amount of work proportional to the amount of input and threading it through the computation graph. Feldera is one of my favorite examples as DBSP model enforces a synchronous clock where time behaves like database transactions, with determinism guaranteeing that the same inputs always produce the same outputs. StarRocks borrows conceptually from this feature.

The determinism requirement isn’t incidental. As Jack Vanlightly recently detailed in the context of durable execution: deterministic control flow means recovery is replay from durable inputs, not restoration from checkpointed state. Streaming IVM relies on the same foundation conceptually: deterministic operators over versioned deltas.

I should also mention Snowflake’s Dynamic Tables also prove incremental maintenance capabilities by utilising automatic refresh mechanisms that can process only changed data when the query supports it. But as it’s closed source I wouldn’t discuss it here and leave it for the interested people to reason about the implementation details.

The first part of this effort is append-only Iceberg-based MVs that can be maintained incrementally instead of refreshing whole partitions. Combined with StarRocks’ MV rewrite, this forms a powerful synergy: fast refreshes save compute costs with transparent fallback to full calculation if the MV isn’t fresh enough. The more efficient this maintenance becomes, the closer it can be pushed toward real-time recalculation.

But we need better Iceberg CDC. If every delta requires reading through chain of manifests and metadata files, that overhead exists outside the engine — it’s a format-level limitation. That’s why Iceberg V4’s proposals matter: the Root Manifest concept would concentrate change detection at a single point, making delta discovery a metadata operation rather than a manifest-walking exercise. These foundational pieces in the format are the prerequisite for incremental computation that actually scales.

After reading the DBSP paper, I expressed the idea that this approach is so powerful that database engines should eventually support it natively. Now StarRocks is pushing in this direction, so for me it feels as the natural progression in some sense. And I’m very excited to explore it with you.

Previous Model: PCT

Before diving into the new architecture, it’s worth understanding what came before.

StarRocks’ existing approach is called PCT as for Partition Change Tracking. The MaterializedView.RefreshMode enum captures the options:

public enum RefreshMode {
    AUTO,
    PCT,          // Partition-based refresh mode (partition-change-tracking)
    FULL,         // Full refresh mode
    INCREMENTAL;  // Incremental refresh mode - the new IVM approach
}

PCT tracks changes at the partition level of base tables to enable efficient incremental processing. When a base table’s partition is updated, StarRocks identifies which MV partitions are affected and refreshes only those via INSERT OVERWRITE. This is smarter than full refresh, but the granularity is still partitions.

The shortcomings compound in practice. If your base table isn’t partitioned, you can’t create a partitioned MV, there’s nothing to track incrementally. MV partitions must align strictly with base table partitions: if your base uses date_trunc partitioning, your MV must follow the same expression. And when MVs contain many group-by keys over large datasets, partition-level refresh can OOM because you're still reprocessing entire partition contents.

PCT was a reasonable solution within its constraints. But those constraints reflect a partition-centric worldview, put it simply, changes happen to partitions, not to rows. The new IVM framework inverts this: changes are row-level deltas that propagate through the computation graph.

PCT vs IVM refresh granularity: when two rows touch two partitions, PCT overwrites both entire partitions. IVM appends only the two changed rows, preserving existing data untouched.

Time-Varying Relations and Computational Algebra

The conceptual heart of the new framework is the Time-Varying Relation (TVR). Looking at the implementation, TVR represents a logical relation that evolves over time, supporting versioned and temporal queries.

Every base table tracked by an MV maintains a version range: where it was, where it is now. The delta between these versions defines exactly what changed. TvrTableSnapshot captures a point-in-time state, while TvrTableDelta represents the difference between two snapshots.

TVR version tracking conceptually: the MV records its last-refreshed snapshot (snap_2), compares against the base table’s current snapshot (snap_3), and computes a delta that identifies only the new files. The engine scans c.parquet, ignoring unchanged data.

The MVIVMBasedMVRefreshProcessor orchestrates this. When refresh triggers, it first collects change snapshots by comparing each base table's current version against what was recorded at the last refresh:

public ProcessExecPlan getProcessExecPlan(TaskRunContext taskRunContext) throws Exception {
    // ...
    // collect change snapshots
    try (Timer ignored = Tracers.watchScope("MVRefreshCheckChangedVersionRanges")) {
        final Map mvTvrVersionRangeMap =
                mv.getRefreshScheme().getAsyncRefreshContext().getBaseTableInfoTvrVersionRangeMap();
        for (BaseTableSnapshotInfo snapshotInfo : snapshotBaseTables.values()) {
            TvrVersionRange changedVersionRange =
                    getBaseTableChangedVersionRange(snapshotInfo, mvTvrVersionRangeMap, currentRefreshMode);
            logger.info("Base table: {}, changed version range: {}",
                    snapshotInfo.getBaseTableInfo().getTableName(), changedVersionRange);
            // collect changed version range
            TvrTableSnapshotInfo tvrTableSnapshotInfo = (TvrTableSnapshotInfo) snapshotInfo;

            tempMvTvrVersionRangeMap.put(snapshotInfo.getBaseTableInfo(), changedVersionRange);
            // update the snapshot info with the changed version range
            tvrTableSnapshotInfo.setTvrSnapshot(changedVersionRange);
        }
    }
    boolean isTaskRunSkipped = snapshotBaseTables.values().stream()
            .map(snapshotInfo -> (TvrTableSnapshotInfo) snapshotInfo)
            .map(TvrTableSnapshotInfo::getTvrSnapshot)
            .allMatch(TvrVersionRange::isEmpty);
    if (isTaskRunSkipped) {
        logger.info("No base table has changed, skip the refresh for materialized view: {}",
                mv.getName());
        return new ProcessExecPlan(Constants.TaskRunState.SKIPPED, null, null);
    }
    // ... 
}

The getBaseTableChangedVersionRange method computes the delta by comparing the MV's recorded version against the table's current snapshot:

TvrVersionRange beforeTvrVersionRange = mvTvrVersionRangeMap.get(baseTableInfo);
TvrVersion beforeVersion = beforeTvrVersionRange.to;
if (beforeVersion.equals(currentVersion)) {
    // no change, so we can skip the refresh
    logger.info("Base table {} has not changed, skip to refresh", baseTableInfo.getTableName());
    return TvrTableDelta.of(beforeVersion, currentVersion);
}
return TvrTableDelta.of(beforeVersion, currentVersion);

The processor then validates that all deltas in the range are append-only — the current Phase 1 limitation:

for (TvrTableDeltaTrait deltaTrait : tableDeltaTraits) {
    if (!deltaTrait.isAppendOnly()) {
        throw new SemanticException("TvrTableDeltaTrait is not append-only for base table: %s.%s",
                baseTableInfo.getDbName(), baseTableInfo.getTableName(), deltaTrait);
    }
}

The Iceberg connector integration is where this gets concrete. The listTableDeltaTraits method walks the snapshot ancestry:

final Iterable snapshots = SnapshotUtil.ancestorsBetween(
        toSnapshotIdInclusive, fromSnapshotIdExclusive, nativeTable::snapshot);
for (Snapshot snapshot : snapshots) {
    long currentSnapshotId = snapshot.snapshotId();
    TvrTableDelta delta = TvrTableDelta.of(currentSnapshotId, lastSnapshotId);
    TvrDeltaStats stats = TvrDeltaStats.of(snapshot.addedRows());
    if (snapshot.operation() != null && snapshot.operation().equals(DataOperations.APPEND)) {
        tvrDeltaTraits.add(TvrTableDeltaTrait.ofMonotonic(delta, stats));
    } else {
        tvrDeltaTraits.add(TvrTableDeltaTrait.ofRetractable(delta, stats));
    }
}

This distinguishes between monotonic changes (append-only) and retractable changes (updates/deletes). For Phase 1, only monotonic changes are supported — Iceberg’s IncrementalAppendScan API handles this cleanly:

if (tvrVersionRange.start() != null && tvrVersionRange.start().isPresent()) {
    IncrementalAppendScan incrementalAppendScan = nativeTbl.newIncrementalAppendScan();
    // Configure scan with version range
}

Instead of scanning full partitions, the engine reads only the files added between two snapshots. Iceberg’s metadata already tracks this — every snapshot records which data files were added. The incremental scan produces exactly those files, nothing more.

Aggregate Combinator Functions

The hardest part of IVM isn’t reading deltas, it’s always computing through them. Aggregations are the canonical challenge: if you have SUM(amount) over a billion rows and 10 new rows arrive, you shouldn't rescan the billion.

StarRocks solves this with aggregate combinator functions. The comments in the header file shows the pattern:

// agg_state_combine.h
struct AggStateCombineState
// This combinator is equivalent to calling `{agg_func}_union({agg_func}_state(arg_types))` in SQL,
// but with reduced function call overhead and memory allocation for better performance.
// eg:
// - SQL: sum_union(sum_state(col))
// - This combinator: sum_combine(col)
//
// DESC: intermediate_type {agg_func}_combine(arg types)
//  input type          : aggregate function's arg types
//  intermediate type   : aggregate function's intermediate_type
//  return type         : aggregate function's intermediate_type

The key insight: aggregates have intermediate states that can be merged. For SUM, the intermediate state is just the running total. For COUNT, it's a counter. These states are associative — you can combine partial results in any order and get the same answer.

The combinator function suite includes:

{agg}_state(args) — converts raw values to intermediate state
{agg}_union(state) — merges intermediate states
{agg}_merge(state) — produces final result from intermediate state
{agg}_combine(args) — shorthand for _union(_state(args))

For IVM, the existing MV result becomes the accumulated state. New deltas convert to intermediate state via _state, merge with existing state via _union, and the combined state replaces the old. Work is proportional to delta size, not total data size.

Aggregate combinator flow: new delta rows convert to intermediate state via sum_state, then merge with existing MV state via sum_union. Group A’s three new values (50+25=75) fold into state(500) to produce state(575). Group C(75) also contributes. Group B stays untouched — no delta, no work.

The AggStateCombineCombinator in Java bridges this to the optimizer:

public static Optional of(AggregateFunction aggFunc) {
    Type intermediateType = aggFunc.getIntermediateTypeOrReturnType().clone();
    FunctionName functionName = new FunctionName(
        AggStateUtils.aggStateCombineFunctionName(aggFunc.functionName()));
    // ... build combinator with proper type handling
}

This is classic incremental view maintenance made practical. The idea that aggregates with associative merge operations can be updated incrementally rather than recomputed has been known for decades. StarRocks implements it through aggregate state algebra: combinator functions expose intermediate representations that can be merged, so new contributions fold into existing results without touching the original data.

But it’s worth to note that the framework is architected for full differential computation — TvrTableDeltaTrait already distinguishes monotonic from retractable changes, and the combinator functions provide the algebraic foundation for mergeable state. Phase 1 implements the append-only path, but the abstractions are designed to extend toward indexed state and retraction propagation. What DBSP formalizes with Z-sets, StarRocks is building toward through aggregate state algebra and TVR semantics concepts.

Rewriting Queries for Incremental Execution

The optimizer transformations make this work end-to-end. The RuleSet adds TVR-specific rules:

import com.starrocks.sql.optimizer.rule.tvr.TvrAggregateRule;
import com.starrocks.sql.optimizer.rule.tvr.TvrFilterRule;
import com.starrocks.sql.optimizer.rule.tvr.TvrJoinRule;
import com.starrocks.sql.optimizer.rule.tvr.TvrProjectRule;

Each rule transforms standard operators to their incremental equivalents. TvrAggregateRule rewrites aggregations to use combinator functions. TvrJoinRule handles join semantics, but here's an important nuance: for append-only inputs, new rows from the left side must still join against the full right side snapshot (and vice versa) to produce correct results. The delta identifies what's new, but the join still needs the complete picture from the other side.

Incremental join semantics for Phase 1: new rows from the left delta (L1, L2) join against the full right snapshot (R1, R2, R3). Only the new matches (L1⋈R1, L2⋈R3) are appended to the MV — existing results stay untouched.

This is where systems like Feldera have an edge. Feldera maintains indexed arrangements: compact, time-indexed structures that support efficient point lookups. When a delta arrives, the join probes the index rather than scanning the full relation. StarRocks’ current approach is more straightforward: read the delta from one side, read the current snapshot from the other, join them. It’s correct and still far better than partition-level refresh, but there’s room for optimization as indexed state structures mature in the engine.

The optimizer entry point gates this on a session variable:

if (context.getSessionVariable().isEnableIVMRefresh()) {
    tree = logicalTvrRuleRewrite(tree, rootTaskContext, requiredColumns);
}

Once enabled, the refresh processor prepares context and triggers the rewrite:

ctx.getSessionVariable().setEnableIVMRefresh(true);
ctx.getSessionVariable().setTvrTargetMvid(GsonUtils.GSON.toJson(mv.getMvId()));

The MV definition query gets rewritten. Table scans become incremental scans bounded by version ranges. Aggregations become state merges. The resulting plan processes deltas and produces delta outputs that append to the existing MV.

Iceberg as the Causal Foundation

Why Iceberg specifically? Because its snapshot model provides exactly the causal ordering that IVM requires. Though there is PR to do the same with Paimon, for example.

Every Iceberg commit creates a snapshot with a unique ID and a parent pointer. Snapshots form a linear history — you can always ask “what changed between snapshot A and snapshot B” and get a deterministic answer. The SnapshotUtil.ancestorsBetween call walks this chain.

More critically, each snapshot records its operation type. The code checks:

if (snapshot.operation() != null && snapshot.operation().equals(DataOperations.APPEND)) {
    tvrDeltaTraits.add(TvrTableDeltaTrait.ofMonotonic(delta, stats));
} else {
    tvrDeltaTraits.add(TvrTableDeltaTrait.ofRetractable(delta, stats));
}

Append operations are monotonic: data only grows, never changes. This is the tractable case for Phase 1. The MV state is a pure function of all appended data and new appends extend it without invalidating existing results.

Retractable changes (updates, deletes) require the full differential machinery, tracking which rows were removed, propagating negative contributions through aggregations, handling join semantics where old matches must be undone. This is Phase 2 territory, likely requiring Iceberg V3’s row lineage features to produce proper CDC streams.

Creating an Incremental MV

The proposed syntax is straightforward:

CREATE MATERIALIZED VIEW test_mv1 PARTITION BY dt 
REFRESH DEFERRED MANUAL 
properties
(
    "refresh_mode" = "incremental"
)
AS SELECT t1.dt, t1.col1, t2.col1, t1.col2, t2.col2
FROM iceberg_catalog.db.t1 
JOIN iceberg_catalog.db.t2 ON t1.dt = t2.dt;

The "refresh_mode" = "incremental" property triggers IVMBasedMVRefreshProcessor instead of MVPCTBasedRefreshProcessor. On refresh, the processor:

Collects current snapshots for all base tables
Compares against the last-refreshed snapshots stored in baseTableInfoTvrDeltaMap
Computes deltas for each base table that changed
Rewrites the MV query to process only those deltas
Executes the incremental plan, appending results to the MV
Updates the stored snapshot references

The test cases demonstrate the flow:

REFRESH MATERIALIZED VIEW test_mv1 WITH SYNC MODE;
-- Initial population

insert into iceberg_table values ('new_row', 100, '2023-12-01');
-- Base table gets new data

REFRESH MATERIALIZED VIEW test_mv1 WITH SYNC MODE;
-- Only the new row is processed

You can switch modes dynamically:

ALTER MATERIALIZED VIEW test_mv1 SET("refresh_mode" = "auto");

AUTO mode lets StarRocks choose the best strategy based on delta characteristics, falling back to partition-based or full refresh when incremental doesn't make sense.

Limitations and the Road Ahead

This is Phase 1 of a larger effort, and it’s important to be clear about current boundaries.

The implementation supports:

Append-only Iceberg tables (no updates or deletes)
Select, Project, Filter, Join, Aggregate, UnionAll operators
Basic aggregate functions with proper combinator support

What’s explicitly not yet supported:

Retractable changes (updates, deletes) require Iceberg V3 CDC capabilities and more work
Window functions with complex framing
Non-deterministic expressions
Indexed state for efficient join lookups (joins still read full snapshots from one side)

The retractable change support is particularly interesting. Iceberg V3 introduces row lineage with durable row IDs and sequence numbers that survive compaction. Combined with deletion vectors, this enables precise CDC: you can ask “what rows were added, changed, or removed” and get exact answers without scanning. The TvrTableDeltaTrait.ofRetractable path is already stubbed out, waiting for that foundation.

Bridging the dots

What first seemed visionary with better Iceberg CDC, progress-based computation models, determinism as a correctness foundation, now it’s gradually being woven into actual database engine code.

You could always theoretically build this with Flink, for example: read CDC streams, maintain state, output to a sink. But the operational cost is immense. Flink requires dedicated infrastructure, careful state backend configuration, checkpoint tuning and job-specific operator graphs. For each materialized view, you’d be building a mini-streaming-application.

StarRocks is integrating this directly into the database engine. You write SQL, declare a refresh mode, and the optimizer handles the rest. The MV rewrite system serves queries transparently. The scheduler handles refresh timing. State lives in the MV itself, so no external systems, so no operational overhead.

This is a significant conceptual shift worth discussing, even before it ships. We’ve spent years building external streaming infrastructure to handle what databases should have done natively. The new papers and tools utilising deterministic computation over time-indexed deltas is both simpler and more powerful than coordination-based checkpointing, and it’s finally being absorbed into some query engines.

The current implementation has clear boundaries: append-only tables, joins that still read full snapshots from one side. Feldera’s indexed arrangements are more sophisticated for stateful operations and StarRocks is nowhere there in the current phase yet. Iceberg V4’s Root Manifest proposal would make delta discovery and computations truly cheap, but that’s still in design. These are honest limitations that would require time to overcome and improve on.

But the architecture is right. TVR as the semantic model with relations that vary over time, with deltas as first-class citizens. Combinator functions that make differential aggregation algebraically clean. Optimizer rules that rewrite queries transparently. Iceberg snapshots as the causal foundation, with V3’s lineage and V4’s possibility to operate with compact deltas pointing toward a future where CDC is a metadata operation.

What makes this exciting isn’t that it’s perfect: it’s that a production database engine is building the infrastructure for incremental computation natively. The amount of work is finally becoming proportional to the amount of change.

The bridge over shifting ice is being built right now.

To put it simply :)

StarRocks Incremental MV: A Bridge Over Shifting Ice was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Freeze the River’s Flow: Evolution of Streaming Computation Models

Anton Borisov — Wed, 12 Nov 2025 15:51:54 GMT

This piece builds on my recent work with Apache Fluss and Flink 2.1’s DeltaJoin feature, zooming out to examine the broader philosophical divide emerging in streaming computation.

Enough Time to Memorise

Everyone wants to make stream processors intelligent. Agents that optimise jobs, copilots that rewrite SQL, pipelines that explain themselves in the language of prompts.
But the real frontier for streaming computations isn’t in reasoning about data, it’s in remembering it.

A stream never stops. That’s its beauty and its curse.
Data flows through a cluster of machines, each doing its part: transforming, joining, aggregating. And somewhere in the middle of all that motion, we expect the system to remember to recover if a node dies and to start again without losing itself.
But how can something that never pauses learn to look back?

That question: “how a distributed system remembers ?” — has haunted computer science for forty years. It began, surprisingly, not with Flink or Kafka or the cloud, but with two researchers sketching an idea on paper in 1985. K. Mani Chandy and Leslie Lamport asked: could you record a system’s entire global state without ever stopping it?

Their answer was both elegant and impossible-sounding: yes, if you redefine what “at the same time” means.

In their world, processes don’t share a single clock. They exchange messages. Each message defines a relationship : this happened before that.
So Chandy and Lamport proposed a simple trick: send a special message, a marker, through every communication channel. When a process receives that marker, it snapshots its local memory and forwards the marker onward. When all markers have circled the network, you can stitch those local pictures together into a single, consistent global cut as a moment in distributed time that never truly existed, yet represents the system perfectly.

The original Chandy–Lamport snapshot algorithm (1985) conceptually: one process initiates a snapshot by sending a marker to others. Each process, upon receiving the marker, records its own local state and any in-transit messages. Together, these local snapshots form a consistent global cut across the distributed system, a moment in time that never truly existed, yet represents the system perfectly.

No halting. No central coordinator. Just a chorus of local snapshots harmonised by causality.

It was the first real solution to the paradox: how to make a running system remember itself without breaking the flow of time.

Every distributed database, stream processor and checkpoint mechanism today is, in some way, a descendant of that idea. When we talk about checkpoints in Flink, we are really talking about an automated, industrial-scale version of that thought experiment — a continuous act of collective memory.

And as we push further: toward disaggregated state, externalised logs and time-aware computation, that old illusion begins to flicker. The river still flows, but maybe it no longer needs to freeze for us to remember its shape.

Streams Go Stateful

When Chandy and Lamport wrote their paper, the idea of a “stream” was still abstract: a metaphor for distributed messages, not the backbone of real-time data infrastructure.
Decades later, Apache Flink turned that metaphor into machinery. What began as a theoretical idea became a living mechanism that could survive crashes, scale across thousands of nodes and still remember exactly where it had been.

At its heart, Flink’s approach is a direct descendant of the original algorithm.
What Chandy and Lamport called a marker, Flink turned into a barrier, an actual message that flows through the data stream.
Each source injects barriers at regular intervals, tagging a consistent moment in logical time. As those barriers travel downstream, every operator waits until it has received the same barrier from all its inputs, then takes a local snapshot of its internal state. Once every operator has reported completion for that barrier ID, the JobManager declares a checkpoint: a global, consistent image of the dataflow at that instant.

How Flink turns the Chandy–Lamport snapshot into a living mechanism: each operator takes a local snapshot when it receives a barrier, persists its state, forwards the barrier downstream. The coordinator later gathers all state handles to form one consistent global checkpoint.

The crux is that the job never stops and barriers weave through the same channels as the data, letting the system capture itself without halting. That’s how consistency emerges as a by-product of flow.

But turning that elegant idea into production engineering took a decade of iteration. Real streams are messy: they backpressure, spill to disk and occasionally drown their operators in terabytes of state.
To cope, Flink learned new tricks and shifted the initial model.

Learning to Cope with Scale

As deployments grew from tens to thousands of tasks, the original checkpoint model started to bend. A single consistent cut meant propagating barriers across the graph, blocking input channels and serializing each operator’s RocksDB or heap state to remote storage. On multi-terabyte pipelines that could saturate I/O for minutes, so the community didn’t abandon the model, it enriched with ingenious ideas and clever tricks.

Asynchronous snapshots

Originally, the operator thread stalled while state was written. From 1.3.0, Flink moved serialization to a background I/O thread: the operator marks the cut at the barrier and keeps processing while the async phase persists bytes. This eliminated full-graph stalls during snapshotting.

Incremental checkpoints

With RocksDB, SST files are immutable. Flink began persisting only newly created SSTs plus a manifest of referenced files, so recovery reuses prior files and downloads only deltas. This collapsed network traffic by orders of magnitude on large jobs.

Unaligned checkpoints

Barrier alignment became the next bottleneck: fast channels waited for slow ones. 1.11 introduced unaligned checkpoints, which snapshot in-flight buffers as operator state. Alignment is no longer a prerequisite, snapshots grow slightly, but latency under backpressure drops dramatically.

State changelog backend

1.15 added a changelog layer that continuously logs state mutations (insert/update/delete) to remote storage. Periodic checkpoints reference changelog segments, recovery restores the base snapshot and replays to the last acknowledged sequence number, thus turning fault tolerance from a bursty event into an ongoing process.

Disaggregated state backend

In 2.0+, Flink introduced disaggregated state: the authoritative copy of state lives on remote storage, local disks act as an optional cache. Compute and storage scale independently, checkpoints mostly write metadata (offsets/pointers) rather than duplicating data files. Recovery remaps operators to existing remote state fragments instead of moving large snapshots.
It was paired with another feature, an important step toward a much more interesting conceptual model of execution.

Async execution model

Flink 2.0 shifts from potentially blocking barrier alignment to some form of continuous progress tracking.
The new asynchronous execution framework (FLIP-425) breaks a record’s life into futures and callbacks executed on the mailbox thread, preserving single-threaded semantics while overlapping computation and I/O.
A KeyAccountingUnit keeps per-key order, an EpochManager tracks when all records in a watermark epoch finish so time can advance without waiting for every operator.
Checkpoints drain only bounded in-flight work and persist metadata, that is why it’s essential for disaggregated state, where recovery remaps operators to remote fragments instead of replaying full snapshots.

Building on this, Key-Ordered Async Lookup Join (FLIP-519) enforces sequential processing per key but parallelism across keys.
Together they partially replace coordinated pauses with incremental per key movement, effectively allowing to adjust to a runtime that doesn’t freeze for consistency, but flows forward through it.

Flink’s decade of checkpoint engineering: from stop-the-world snapshots to remote-first, metadata-driven recovery. Each milestone lowers a different cost of remembering: latency, I/O, alignment or coordination.

Each innovation reduced the cost of remembering, therefore snapshots became faster, lighter and more continuous.
Yet the fundamental motion remained the same: barriers marking time, operators freezing briefly, memory being externalised just long enough to catch up with the flow.

Lookup for my state

Each optimisation pushed checkpointing toward the edges: async I/O moved it off the critical path, incremental snapshots reduced what needed copying, changelog backends made it continuous, disaggregated state turned it into metadata. But state itself still lived inside Flink’s execution graph, owned by operators, versioned by barriers, recovered through checkpoint restoration.

The next step wasn’t to make snapshots faster. It was to make them optional for part of the pipeline.

What if lookup state didn’t need checkpointing at all? What if one operator’s state could be another job’s durable source, say a shared table that multiple pipelines read without duplication, where recovery meant re-attaching rather than restoring?

That’s one of the idea behind Fluss: externalizing lookup state into a versioned, shared log-structured table managed outside Flink’s lifecycle.

Classical stream-stream lookup example: Regular join vs DeltaJoin externalizing history into Fluss. Instead of hoarding state, each side emits a changelog and probes the index on demand.

Computational state as running aggregations, windowed buffers transforms with every record. It must be operator-owned because it’s computation-specific. But lookup state is referential: dimension tables queried by multiple jobs, enrichment data that doesn’t compute, just matches.

With Fluss, lookup state externalizes entirely. It’s not operator-owned RocksDB on remote storage: it’s a shared, versioned table managed by a separate system. When a job performs a lookup join, it opens a Fluss snapshot ID, maintains a local cache and issues remote reads if needed.

The lookup state never enters Flink’s checkpoint model. Multiple jobs share the same table. Recovery means re-opening the table, not remapping operator state.

Barriers still coordinate exactly-once semantics across the pipeline. But lookup state no longer participates in checkpoint coordination — it lives outside the river entirely.

Time as Memory

If Flink taught streams how to remember, these newer systems teach them when to remember.
They stop synchronising work around barriers and start reasoning about progress — about what it means for time itself to move forward inside a distributed graph.

Where Flink builds global coordination on explicit messages (barriers), systems like Timely Dataflow and its descendants build implicit synchrony out of partial order and causality.

Timely Dataflow: progress as protocol

In Timely Dataflow, the basic unit of control isn’t a checkpoint: it’s a timestamp.
Every record carries a logical time — a structured timestamps clock describing its epoch and nested loop iteration.
Instead of global barriers, Timely’s runtime continuously exchanges small progress messages between operators.
Each message announces the creation or completion of events at a particular timestamp, effectively incrementing or decrementing a counter in a distributed ledger of time.

From these messages, every operator constructs a frontier: the minimal set of timestamps that may still produce output.
In formal terms, it’s an antichain: a set of times such that no element is less than another in the partial order.
When the frontier advances beyond time t, the operator knows that no future input ≤ t can arrive through any path in the dataflow.
At that instant, all computation for time t is sealed.

Operators can then compact state, aggregate results or emit finalized outputs, all without a coordinator.
Each dataflow shard progresses at its own pace, yet global consistency emerges naturally from the intersection of local frontiers.
There are no checkpoints, no distributed barriers, no backpressure from alignment.
The only coordination is informational: “I’m done with everything before this point.”

Timely Dataflow conceptually: time itself becomes the coordinator. Operators exchange lightweight progress messages, constructing distributed frontiers that define which timestamps are complete. Once a frontier passes time t, all work for t is sealed: no checkpoints, no global barriers, only causality.

When operators are deterministic (pure functions of data and timestamp), recovery becomes a replay problem, not a restore problem.
Reinject the durable input streams and the same frontiers advance in the same order, producing identical results.
Flink externalizes state to guarantee safety, on the other side Timely internalizes causality to achieve it.

Differential Dataflow: memory as difference

Sitting atop Timely, Differential Dataflow redefines what persistence means.
Instead of materializing entire states, it represents data as a time-indexed collection of differences as insertions, retractions and updates annotated with timestamps.
Each record is stored as a triple (data, time, diff), where diff expresses the weight or multiplicity of that change (e.g., +1 for an addition, –1 for a deletion).

Operators don’t rebuild results when inputs change, they simply propagate new differences through the graph.
Every operator maintains one or more arrangements, effectively compact, partially ordered indexes of these triples.
As frontiers advance, old differences whose effects are no longer visible are consolidated or discarded.
The result is a continuously compacting view of the world: new deltas flow in, old ones collapse into summarized state and no explicit snapshot is ever taken.

Given deterministic operators and a durable stream of (data, time, diff) events, recomputing the output is guaranteed to produce the same trace.
Persistence, when used, is purely an optimization, as the system can checkpoint its arrangements to disk to avoid full replay, but correctness doesn’t depend on it.

This is the inversion of checkpointing logic.
Instead of freezing a full image and resuming from it, Differential’s traces are self-stabilizing: their compactness grows as time advances.

Feldera: declarativity on top of determinism

Feldera reimagines streaming computation through DBSP: a synchronous model where time behaves like database transactions. Unlike Timely Dataflow’s timestamp-labeled events that can arrive out of order, DBSP enforces a single logical clock that partitions continuous time into discrete steps. Each step collects all input changes across every stream, computes incrementally, and emits one batch of output changes. There are no frontiers to track and no runtime out-of-order corrections to reconcile — just a strict alternation between input and output synchronized by causality.
Relations aren’t tables, but functions of logical time whose contents at step t depend only on inputs from steps 1…t.

This synchronous guarantee means correctness doesn’t depend on distributed snapshots and determinism ensures the same state can always be replayed from inputs. When a node fails, recovery simply replays the exact sequence of input batches from the last checkpoint. Checkpoints still exist for speed, capturing intermediate operator state to avoid full replay, but determinism itself guarantees consistency.

The cost is architectural: DBSP trades Timely’s flexible concurrency for transactional simplicity, making every logical clock tick an atomic boundary where all changes settle before time advances.

The evolution of streaming memory: Each inward spiral represents a decade of optimization: from Chandy-Lamport’s markers to Flink’s decade of optimising coordination along with Timely’s frontiers curling tight. But what is in the Eye of the Storm?

Uncomfortable question

Streaming computations have evolved into two distinct paradigms and it would be blind not to notice that Flink’s decade of dominance faces a genuine challenge from a fundamentally cleaner computational model.

The operator-coordinated approach, mastered by Flink, has perfected the art of distributed coordination. Every innovation as async snapshots, incremental writes, disaggregated state has made checkpointing nearly invisible. This is remarkable engineering. Flink solved the impossible: making a distributed system remember itself without stopping, powering the world’s largest data infrastructures reliably, at scale.

But the progress-based model, pioneered by Timely and its descendants, asks a different question entirely. Instead of perfecting coordination, what if we eliminated it? No barriers, no synchronized snapshots, but operators tracking time frontiers, computing when causality permits. It’s not an optimization but a different computational philosophy.

Flink has invested deeply in its checkpoint protocol, embedded throughout the network stack, state backends, recovery logic. This isn’t a weakness, but the natural result of solving real production problems for a decade. Although it does mean adopting progress-based semantics would require rebuilding from scratch, not evolution.

Progress-based systems started with different foundations. They achieve exactly-once guarantees for incremental computation without barrier coordination: younger, yes, but unburdened by legacy decisions and unburdened by Flink’s requirement to handle non-deterministic operations

The parallel to CISC versus RISC is tempting: complex instruction sets, perfected over decades, eventually challenged by simpler primitives that achieved the same results more elegantly. CISC processors weren’t wrong and they were exactly what we needed until we discovered different domains demanded different trade-offs. CISC persisted where compatibility mattered andRISC dominated where efficiency was critical. Streaming may follow the same path: checkpoint-based for general streaming, progress-based for incremental computation.

Let’s be crystal clear here: streaming systems are specializing, not replacing each other. Checkpoint-coordinated models excel at arbitrary stateful computation in messy production environments. Progress-tracked models excel at incremental computation with structured time. Both will coexist because they solve fundamentally different problems. But it would be silly to neglect the conceptual shift and elegance offered by newer models for workloads where Flink’s coordination machinery feels like overkill.

The river is choosing a new course and the question arises: did we spend decades making checkpoints invisible, only to discover we could have made them unnecessary for the most useful type of computations?

The answer might reshape not just how we build stream processors, but how we think about distributed time itself.

Freeze the River’s Flow: Evolution of Streaming Computation Models was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Iceberg CDC: Stream a Little Dream of Me

Anton Borisov — Mon, 27 Oct 2025 15:05:41 GMT

Night freezes streams to a whisper: “I love you”

Another article with musical vibes. 😬

Ananth Packkildurai captured the mood perfectly in “When Dimensions Change Too Fast for Iceberg”, a track about what happens when real-time rhythm collides with immutable structure. Iceberg was written for steady tempos, not improvisation. This piece continues that melody: how v3 and v4 retune Iceberg’s metadata so streaming can finally play in time.

Streaming into Iceberg seems like it should be boring by now: a source emits inserts, updates, and deletes, then a sink writes them and queries see a fresh, consistent table. The friction starts the moment those updates are small, constant, and urgent.

Iceberg’s superpower is immutability with time travel aka snapshots over append-only files. That same superpower turns every change into “add something new and teach readers to ignore something old.” Perfect for studio batch albums, twitchy with real-time improvisations.

Most streaming writers don’t know where yesterday’s row physically lives. They take the easy contract with equality deletes that say “drop any row where key = K.” Throughput stays high, but readers inherit the tax: scan candidate files and re-apply predicates over and over. You can flip the trade-off with positional deletes or deletion vectors and let readers skip exact row offsets, but then writers need a reliable way to find those offsets in real time. Without help, they can’t. That’s why upserts on Iceberg so often devolve into “cheap to write, expensive to read.”

Two paths for deleting data in Iceberg-like tables:
Positional deletes remove rows by their physical location inside data files, while equality deletes filter rows by matching column values. Streaming sinks can only produce equality deletes, since positional deletes require stable file layouts and snapshot coordination.

There’s a second source of drag that’s quieter but just as important: identity and change detection. A clean change feed or a correct point-in-time join requires rows that remember who they are and a cheap way to say what changed between two moments. Iceberg V3 moved the ball with row lineage and standardised delete semantics, so that updates stopped looking like arbitrary rewrites. But metadata still fanned out across many files, equality deletes still piled up, and anything spanning multiple tables remained awkward because each table ticks on its own clock.

The V4 proposals push on the part that hurts most in streaming: metadata. A single Root Manifest replaces sprawling lists, manifest-level delete vectors mark removals without rewriting children, and pruning lifts from rigid partition tuples into column and expression stats. “What changed” may be expressed at the top of the tree and discovered with tiny I/O. None of this bans equality deletes or conjures a built-in key index, it just makes small commits cheap and change planning explicit, which is exactly what streaming systems need.

The last mile isn’t in the table format at all, but orchestration. To make incremental joins reliable and upserts feel precise, you want a catalog that can hand out a global commit order and, where it’s truly needed, an optional way to map a hot key to its last position so writers can emit positional deletes on contact. With V3’s lineage and V4’s compact deltas, those services may shift from ad-hoc fixes to maintainable design.

The rest of this article unpacks how we got here, what V3 already fixed, what V4 plans to change in practice, what a catalog can add without becoming a database, and why CDC on Iceberg is finally drifting from hard to merely careful.

IceV3rg

Iceberg v3 made CDC possible without icy cold sweat by tightening two foundations: how we mark rows as gone, and how rows keep their identity as they move.

First, deletes got a shape that readers can exploit. In v2, a table could accumulate many small delete files per data file, forcing engines to juggle a pile of predicates. v3 formalized deletion vectors (bitmaps over row ordinals) and clarified how a reader targets them to the right files. When a DV is present, a scan can skip exact offsets instead of re-reading and filtering whole files. For batch merges that already know “the old row is in file F at position P,” this is gold, reads become surgical, and compaction can fold DVs back into clean base files. In streaming, that same precision is available if the writer can find positions, otherwise you still fall back to equality deletes because they’re the only contract a firehose can always satisfy. The main benefit is indirect, but conceptual: by localizing deletes, DVs shorten compaction cycles and narrow the gap between batch maintenance and real-time freshness.

How Iceberg stores deletion vectors: deletes are encoded as Roaring Bitmaps, written into Puffin files, and linked through manifest entries so readers can filter Parquet rows without scanning separate delete files.

Second, v3 introduced row lineage so updates stop being lost in the churn of file-level rewrites. Each row gets a durable ID and a “last updated” sequence and when you rewrite a file for compaction, the IDs ride along unchanged, and when you truly change a row, its sequence ticks. That one idea removes two of CDC’s worst taxes: you no longer need an anti-join to deduplicate “final” versions, and you can tell an update from a rewrite without replaying history. It also lets a change feed show clean before/after pairs instead of ambiguous delete+insert shadows.

Before lineage, engines had to join two snapshot file lists by business ID and anti-join to filter duplicates.
After lineage, each row carries its own identity (_row_id) and sequence (_seq), allowing CDC queries to use simple filters over sequence ranges without additional joins.

Together, DVs and lineage let engines build snapshot-diff and change-log views that are correct and reasonably fast on a single table. You can ask “what changed between S and T,” get a small set of file tasks, and reconstruct updates without guessing. But v3 didn’t change two structural realities. Equality deletes remain the path of least resistance for streaming writers, so readers still inherit predicate cost unless you compact aggressively or have a way to issue positional deletes on the hot keys. And while a single table has a perfect, linear history, relations do not, so v3 doesn’t give you a global clock. Incremental joins across tables are still a balancing act performed by the catalog or the engine, not something the format solves for you.

So v3’s promise is clear: identity is stable, and deletes are expressible as precise masks. That’s enough to make CDC workable today with some tweaks. What it doesn’t do is make every micro-commit tiny or every “what changed?” query cheap to plan. That’s the part v4 aims squarely at.

Iceberg V4

V4 plans to replace the manifest-list fan-out with one “Root Manifest” per snapshot. That file directly records the delta: new/removed files, data-file deletion vectors (DVs), equality deletes, and manifest DVs (DVs that act on leaf manifests). The aim is to make both writing and reading metadata proportional to the size of the change, which is exactly what a CDC reader needs when polling for “what changed since X.”

Because the Root Manifest can carry MDVs that mark adds/removes inside leaf manifests, a reader can do change detection at the root: take the previous root you cached, read the new root, and compute added/removed entries by interpreting MDVs (either as a replacement DV or as a smaller “diff DV,” both patterns are discussed with storage/plan-time trade-offs). Practically, that means a CDC plan can be built from a tiny set of byte ranges rather than re-listing and re-parsing many manifests.

In Iceberg V4, each snapshot will have a single Root Manifest listing everything changed in that commit:new manifests, new data files, and any manifest-level delete vectors (MDVs). The Root Manifest points to both the new data file D5 and Manifest M1, while Manifest DV1 records row-level removals inside M1. To detect deltas, a CDC reader just diffs the previous and current Root Manifest and applies MDVs and no need to re-scan all manifests

In V4, a manifest row is a generic content entry that can be a data file, a DV (data or manifest), an equality-delete file, or a leaf (data/delete) manifest. The schema is cleaned up and stats are standardized, still equality deletes remain first-class (note the carried-over equality_ids). This is important: CDC doesn’t ban equality deletes, it just gains better metadata to read around them.

Iceberg V4 proposed way simplifies manifest metadata by removing the explicit partition tuple and spec ID. Instead, each entry records column-level lower and upper bounds (and derived stats for transforms like bucket or hour). If bounds are exact, readers can reconstruct the partition, otherwise, they treat the file as range-spanning. This shrinks metadata and lets mixed partitioned files share one manifest.

A big design thread in V4 is to decouple manifests from partition specs. The proposal shows multiple approaches, with the lead one removing the materialized partition tuple and relying on lower/upper bounds (including stats for derived expressions like bucket() or hours()). For equality-delete matching, readers reconstruct the needed partition values from stats, but only if the bounds are exact and fail if lower != upper for a partitioned expression. Writers, in turn, must produce complete (un-truncated) stats for identity-partitioned fields. This shrinks metadata and lets differently partitioned files live in the same manifest, but shifts some vigilance to writers/readers.

A CDC reader can prune f1 and f3 precisely but must keep f2 because it spans a range.

If you’d rather keep the tuple, there are alternative designs (union-tuple schema or a hybrid where identity partitions carry an explicit partition_value alongside bounds), each with storage or complexity trade-offs. The point: V4’s column-stats direction gives CDC planners enough information to prune and to match deletes without pinning manifests to a single spec.

What this means for streaming CDC:

Lighter polls: Engines can cache the last Root Manifest, fetch the next one, and compute added/removed files and DV deltas without re-walking the tree. That takes a meaningful chunk out of commit-to-visibility latency for CDC consumers.
Better DV hygiene: The proposal details how follow-on deletes can either replace a prior DV or append a smaller “diff DV”: both patterns keep the root small and make change detection algebraic (subtract prior DV and apply delta). This is friendlier to incremental readers and aligns with CDC’s “tell me just the rows that flipped” planning.
Equality deletes are still here on purpose. They’re retained in V4 (see equality_ids) because many streaming writers still rely on them for simple upserts. V4 doesn’t magically turn them into positional deletes, but it does make equality-heavy snapshots cheaper to discover and plan. If you wants positional efficiency, you can pair V3/V4 row lineage (immutable row IDs/sequence) with an index or periodic compaction to produce DVs, as V4’s metadata just lowers the overhead around those operations.

How MDVs are stored (inline vs Puffin), whether to declare optional data↔delete-manifest “affinity” to skip an extra reconciliation pass, and the exact rules for “change detection at the root” are explicitly called out as design levers. Their eventual choices will shape how much work a CDC client does per poll and how cache-friendly the path is.

Net-net: V4 doesn’t “solve CDC” by itself, deletes will still be equality-scoped, and writers may still choose COW vs MOR, but it collapses metadata amplification, makes incremental planning a root-level operation, and loosens partition-schema coupling so the write path can evolve without hurting incremental readers. That’s exactly the kind of groundwork CDC needs.

Catalogs with enough brains

Iceberg plans to get you row identity (v3) and cheap deltas (v4).

The last mile is orchestration. Catalogs like Lakekeeper sit at the only place that sees every commit, across every table and branch. Two capabilities turn “possible” CDC into “pleasant” CDC: a global timeline and compact, byte-range deltas, plus an optional key locator for hot upserts. Here’s the shape of that without turning a catalog into a database.

A single table already has an ordered snapshot history, but multi-table relations don’t. Incremental joins, point-in-time pipelines, and multi-table change feeds all need one clock.

Per-branch sequence numbers. On each successful commit, the catalog stamps a strictly increasing catalog sequence for that branch (table name included in the record).
Branch-aware reads. Readers can ask, “give me everything after seq=N on branch main,” then plan in snapshot order across tables. This is essential for correct incremental joins and cross-table CDC.

GET /v1/branches/{branch}/changes?after_seq=N
→ [
  {seq, table_id, snapshot_id, root_uri, mdv_refs[], op, ts},
  ...
]

No re-listing of storage, no guessing, it’s the catalog that hands you the minimal set of things that changed.

V4 concentrates change detection in the Root Manifest and its manifest delete vectors. The best thing a catalog can do is expose exactly those byte ranges.

What to return: the new root’s URI (and byte range if the store supports it) plus any external MDV references.
Why it matters: engines can diff the previous root they cached with the new one and apply MDVs to identify added/removed leaf entries — without walking leaf manifests or touching data files. That keeps commit-to-visibility latency low for CDC consumers.

Readers thrive on stable IDs. Catalogs can make engines fast by standardizing cache keys.

Content-addressed roots/manifests. Use immutable IDs (e.g., content hash + ETag + length) for all returned artifacts so engines can deduplicate and persist safely.
Coherent eviction. When a compaction job lands, the change feed is the single point that invalidates caches, everything else remains cold.

The catalog exposes each root manifest with its URI, ETag, and length. The engine compares the new ETag to the cached one: if identical, it stays idle, if changed, it means a new commit. The engine fetches the new root, diffs it with the previous one, and emits CDC events , byte-precise change detection.

Optional concept: a PK locator for hot upserts

Not required for correctness, but invaluable where upserts churn. Also it’s not only catalog changes for sure, requires changes in other parts as well, so it’s more on a side of a wishful thinking for future.

What it is: A tiny LSM/KV that tails Iceberg table metadata (snapshots/manifests/DVs) and your compactors lineage feed and maintains key → (file_uri, row_position, seq).
How it’s used: streaming sinks look up a key, then if found, they emit a positional delete (or DV) before appending the new row. Readers stay cheap even when keys flip often.
Why optional: most tables don’t need it, keeping it out of core catalog avoids heavy SLAs and lock-in. Run it as a sidecar or plugin.

Lookup sketch

POST /v1/tables/{table}/pk-locator/lookup
→ {hit, file_uri, row_position, seq}

Correctness under rewrites: when compaction rewrites files or MDVs replace older ones, it patches the mapping. Writers always have a safe fallback: emit an equality delete if a lookup misses or races compaction.

The catalog PK locator tails table metadata to maintain a small map of recent keys to their last known file and row position. Streaming sinks query it to emit positional deletes on contact, and compaction jobs patch it when files are rewritten. It stays lightweight: no separate store, no strict consistency, just advisory metadata for faster CDC.

Scope and privacy. A PK locator stores pointers, not values. If your compliance model is strict, keep it table-scoped or partition-scoped and encrypt.
Degradation path. If the locator is down, writers fall back to equality deletes. Readers remain correct.

This idea isn’t entirely hypothetical. Projects like Moonlink already prototype an external indexing layer that tails Iceberg metadata on write path and maintains a live map of key → (file, row position). That lets writers emit positional deletes in real time instead of equality ones, shifting cost from readers to writers.
The catalog-side PK locator sketched here follows the same spirit, just a lighter, cooperative service rather than a separate database, using the best effort approach.

Tomorrow?

What this buys you tomorrow

Engines poll a tiny delta (root + MDVs), plan from byte ranges, and touch only the files that changed.
Streaming writers remain simple, hot paths can be precise via a lookup.
Incremental joins become routine because commits across tables share a monotonic order.

That’s the difference between “CDC possible” and “CDC pleasant.” The format gives you identity and compact deltas, the catalog supplies time and targeting, so just enough brains to keep the lake feeling live, without turning it into a database.

Disclaimer:

Some of the mechanics described here are simplified to keep the focus on the bigger picture. The goal here isn’t to capture every edge case, but to sketch how these ideas connect and why they matter for streaming into Iceberg.

I love bad jokes, so also about tomorrow :)

Iceberg CDC: Stream a Little Dream of Me was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Good, The Bad and The AutoMQ

Anton Borisov — Thu, 16 Oct 2025 05:56:53 GMT

Diskless Kafka Western

The Wild West of Diskless Kafka

Three gunslingers ride into the same town at dusk, each promising cheaper streams and quieter nights on call.
The town is Apache Kafka and the fight is over where your bytes live and how much you pay to keep them moving.

Traditional Apache Kafka: each broker stores full local replicas of the partition log on its own disk.
Producers write to the leader, which replicates to followers across availability zones. Durability = replication.factor (typically 3).Reads come from page cache, not from shared storage.

For almost fifteen years, Kafka’s shared-nothing, disk-first design has delivered the goods: fast local writes, ISR replication for durability, and page-cache-powered reads. But the cloud changed the terrain. Object stores got faster (and cheaper), cross-AZ egress stayed pricey, and operators began asking a blunt question:

Why are we triple-mirroring every hot byte when the object store already handles durability?

The result is a new frontier diskless Kafka, or more broadly, shared-storage Kafka. Depending on who you ask, it either trims replication costs without rocking the boat… or throws out the boat entirely and builds a railroad.

This story has three main characters:

The Good — Slack’s KIP-1176 (Fast Tiering): extends Tiered Storage to include the active write-ahead log. Leaders push fresh bytes to a fast local object store (S3 Express One Zone or EBS), followers pull from the store instead of across AZs. Page cache stays king, acks = 1 remains instant, and Slack claims about 40 percent cost savings by cutting cross-AZ traffic. It’s pragmatic, AWS-fluent, and close to merge-ready.
The Bad — Aiven’s KIP-1150 (Diskless Topics): “Bad” as in bold. Writes go straight to object storage brokers become compute only. The payoff is radical elasticity and up to 90 percent lower storage cost but at the price of re-architecting replication, transactions, and caching. Aiven’s newer Diskless 2.0 merges this path with Tiered Storage, offering a single zero-copy format and reversible topic migration.
The AutoMQ — KIP-1183 (Unified Shared Storage): instead of choosing sides, AutoMQ proposes an abstraction layer: AbstractLog, AbstractLogSegment, and an optional Stream API for S3, HDFS, or NFS. It refactors Kafka’s core so local and shared architectures can coexist under one contract. Visionary and vendor-neutral, though based heavily on the already existing implementation if the community can stomach another layer of indirection.

Each approach redraws Kafka’s durability contract in its own way.
The next sections follow their trails, first the good, then the bad, and finally the AutoMQ.
Saddle up.

The Good, Slack’s KIP-1176: Fast Tiering Done Right

If Aiven’s diskless dream wants to rebuild the town from scratch, Slack’s proposal is the cautious sheriff: keep the rules, clean up the streets, and save everyone a fortune on bullets.

The problem it rides in to solve

Tiered Storage (KIP-405) already let Kafka move old, closed segments to object storage.

But the expensive part of any Kafka bill isn’t cold history, but the hot traffic:three brokers in three AZs endlessly mirroring every new byte.
Slack’s internal metrics showed that cross-AZ replication alone could eat 50–60 % of cluster cost.

So KIP-1176 extends the tiered idea to the active log, the write-ahead stream still forming under producers’ hands.
Instead of sending replicas across zones, the leader uploads those fresh bytes to a fast object store such as S3 Express One Zone (S3E1Z) or EBS.
Followers don’t fetch from the leader anymore, they fetch from the same storage bucket.

Architecture

Think of it as inserting an object-store hop between the leader and its replicas:

Producer → Leader (append to page cache)
        → Fast object store (S3E1Z/EBS)
        → Followers read from object store

Kafka’s RemoteWalStorageManager handles these uploads and downloads in the background.
Every few milliseconds it batches 300-500 KB from multiple partitions, combines them into a single WAL object, and publishes metadata into __remote_wal_log_metadata.
Followers read that metadata, know which byte range to fetch, and stay in sync without ever touching another AZ.

Because the copy happens asynchronously, producers using acks = 1 see identical latency.
For acks = all, the extra hop through the object store adds only a few milliseconds.
Under failure, the data-loss window roughly doubles (from ≈ 10 ms to ≈ 20 ms) and still acceptable for most real-time workloads.

The leader writes locally and offloads active segments to a fast object store (e.g., S3 Express). Followers fetch these slices from storage instead of cross-AZ replication, cutting latency and data-transfer costs while keeping standard ISR semantics.

Performance and cost

Slack’s benchmark claims the same throughput and almost identical latency for the acks = 1 path, while reducing broker footprint and AZ traffic by about 43 %.
The magic lies in using a “fast tier” store that’s internally replicated, so Kafka’s replication factor can safely drop from 3 to 2 or even 1 for certain topics.
Page cache remains the performance anchor and with this Kafka’s read path doesn’t change.

Compared to the radical KIP-1150 “diskless” approach, KIP-1176 keeps brokers stateful, but they still have local segments and page cache, and that makes that state cheap to maintain.

Failure handling

Slack’s authors were careful to model full-AZ outages.
If both the leader and its co-located S3E1Z bucket disappear, Kafka simply elects a new leader in another zone, followers truncate or catch up exactly as in the classic model.
Durability and semantics remain unchanged with only the transport layer shifted.In other words, this KIP doesn’t challenge Kafka’s architecture, it optimises its plumbing.

Why it‘s Good(pun intended)

What makes 1176 “good” is its restraint.It reuses KIP-405 classes, appending “WAL” variants RemoteWalStorageManager,RLMWalTask,RemoteWalCombinedLogSegmentMetadatarather than inventing a parallel subsystem.The design keeps compatibility with existing tiered storage plugins, Iceberg integrations, and Kafka’s consumer expectations.

Operators who already use Tiered Storage can enable the new path with a single topic-level flag:

remote.wal.storage.enable=true

Slack’s proposal shows that Kafka can behave cloud-native without becoming alien to itself.
It doesn’t remove disks, it just stops paying them to gossip across zones.
No new coordinators, no double formats, no fresh abstractions, just a smarter way to move bytes.

It’s easy to underestimate how radical that modesty is.
While others chase perfect statelessness, 1176 cuts the biggest operational bill and leaves everything else working exactly as before.
If the Apache community wants a diskless-adjacent feature it can merge tomorrow, this is the one.

The Bad, Aiven’s KIP-1150: I Shot the Disk

If Slack’s sheriff cleaned up the town, Aiven’s gunslinger came to burn the jail. He doesn’t want leaner replication, he wants brokers with no durable disks at all.

The first shot, KIP-1150 “Diskless Kafka”

The original Aiven proposal imagined Kafka without local storage.
Brokers would act purely as compute and all data would live in object storage such as S3. Durability would come from the object store’s own replication, so Kafka’s replication factor (RF) could drop to 1.

To coordinate writes and reads, KIP-1150 introduced a Batch Coordinator that managed ordering, deduplication, and offsets across producers.
Producers wrote straight to a shared WAL in the object store, consumers fetched through a distributed cache. It replaced leaders and followers with a flat, leaderless design: any broker could serve any partition because there was no local state to own.

Simplified Diskless 1.0: Stateless brokers, coordinator on the hot path.
Producers write directly to object storage through any broker. The Batch Coordinator assigns offsets and tracks batches, serving lookups for consumer fetches. All reads and writes flow through shared object storage, making brokers effectively stateless and durability fully delegated to the cloud layer.

That architecture looked elegant on a whiteboard, yet catastrophic for anyone who’d run Kafka in anger. Core committers (Jun Rao, Colin McCabe) quickly pointed out what would break:

Semantics: transactions and idempotence assume a local leader’s log. You remove that and you rebuild consensus from scratch.
Latency: even S3 Standard means 100–200 ms PUTs, far above Kafka’s < 10 ms target.
Complexity: the coordinator added a new distributed layer inside another distributed system.
Migration: existing topics couldn’t be converted, they’d need to be recreated.

KIP-1150 wasn’t a patch, but a new product.

The Diskless 2.0 rewrite

Aiven didn’t abandon the idea, they rebuilt it.
The September 2025 update, Diskless 2.0 Unified Zero-Copy Kafka, folds Diskless into Tiered Storage (KIP-405) rather than replacing it.
The new principle: write once to shared WAL objects, reuse Tiered for reads.

Producer → Object-store WAL
         → Brokers rebuild per-partition local segments (Tiered format)
         → Consumers read Local → Tiered fallback

Now Kafka still elects leaders and tracks ISR, but only for metadata, not data. The record bytes live in object storage and the metadata log remains replicated in Kafka for coordination. Followers assemble ephemeral local segments from the shared WAL and serve reads out of the page cache, leadership merely assigns which broker uploads those segments.

Simplified Diskless 2.0:
Producers append batches to a shared, object-backed write-ahead log (WAL). Brokers rebuild ephemeral per-partition segments on local disks and serve reads locally first, falling back to Tiered Storage when data rolls off cache. The Batch Coordinator only orders writes and keeps bounded metadata, while a rotating upload leader pushes rolled segments through the Tiered Storage plugin.

Key improvements over 1150:

Unified format with Tiered Storage → plugins (Iceberg, RLMM) just work.
Bounded metadata → only ~two segments per partition tracked in the coordinator.
Zero-copy migration → flip a topic type Classic ↔ Diskless without re-ingesting data.
Stable latency → Batch Coordinator off the hot path.

In essence, Diskless 2.0 restored leadership, ISR metadata, and transaction semantics while keeping data durability in the object store.
Kafka’s disks became caches, not sources of truth.

Still, the limits remain clear:

Object storage is the only durable copy (RF = 1 for data).
Cold starts rebuild caches from WAL objects.
Fast, single-AZ stores (e.g. S3 Express) are assumed.
The code lives today in Aiven’s Inkless fork, not Apache Kafka core. Aiven confirmed they’ll upstream “once Tiered and Diskless formats converge.”

The verdict

KIP-1150 began as an act of rebellion: a clean slate that broke every rule in Kafka’s book.
Diskless 2.0 rewrites that act as reform: it keeps the dream of stateless brokers but rebuilds on community foundations.
It’s no longer “no disks”, but“no redundant disks.”

Technically audacious, politically feasible, but still a long ride from a merged Apache feature.
For now, the Diskless outlaw roams its own fork, testing how far Kafka’s principles can bend before they break.

The AutoMQ, KIP-1183: Storage Unification by Refactor

AutoMQ doesn’t start at replication policy, it starts at the log engine boundary. KIP-1183 proposes a new storage substrate inside Kafka so both disk-backed and shared-storage engines can plug into the same core. Concretely: introduce AbstractLog/AbstractLogSegment as first-class base type, keep today’s UnifiedLog as the file-backed implementation, add a SharedLog implementation for object/file/DFS backends. Managers that currently depend on UnifiedLog (log manager, cleaner, recovery) would instead target the abstract interfaces.

Because object storage isn’t a filesystem, the KIP sketches an optional Stream API, append/fetch/trim over stream slices and uses four streams per partition family: Data, Time, Txn, and Meta. The MetaStream holds producer snapshots and mapping info, when leadership moves, a new broker reloads metadata from the MetaStream to resume. Kafka’s controller (KRaft) would gain a tiny KV client and three controller RPCs to map partitions to their MetaStream IDs.

What AutoMQ’s model really means

Durability & RF. Data durability is fully in the shared store, Kafka typically runs RF=1 at the Kafka layer. Failover therefore requires the next broker to rebuild state from storage (MetaStream + streams), which AutoMQ reports as roughly 1–2 s in practice, reviewers pressed what this looks like with many idempotent producers/transactions.
Leader semantics. AutoMQ stays leader-based (in contrast with leaderless 1150). However, brokers don’t exchange rich state, they reload it from MetaStream/KV on leadership change, which explains the slower failovers versus hot in-memory ISR hand-off.
Scope. The KIP is intentionally about surfaces, not full implementation detail yet: it aims to “make storage pluggable” first, then let engines evolve behind the interface (Stream is recommended, but nominally optional).

Simplified AutoMQ arch: Producers write to the partition leader, which appends to shared-storage streams.
New leaders reopen those streams in read mode, fence the old writer, and serve consumers directly from shared storage. Durability lives entirely in the Stream API layer, brokers are stateless

How it differs from Aiven (1150 → Diskless 2.0)

Both AutoMQ and Aiven use a WAL-on-shared-storage idea, but they wire it into Kafka differently:

Engine vs pipeline. Aiven keeps Kafka’s current engine and unifies the pipeline: write once to shared WAL objects, then replicas re-materialize per-partition local segments in Tiered format for reads (“Tiered becomes the compactor”). AutoMQ changes the engine: a SharedLog built on Stream becomes a peer to UnifiedLog.
State exchange. Diskless 2.0 re-introduces leaders and ISR (for metadata) and has replicas assemble caches from WAL so fetches are local and failover stays within normal ISR choreography. AutoMQ’s brokers don’t exchange metadata beyond KRaft KV/MetaStream, new leaders re-open streams and reload state, trading simplicity for ~second-level recovery.
Where RF applies. In Diskless 2.0, Kafka’s RF applies to the metadata log, data is single-sourced in the object store (and Tiered is reused for the read-optimized view). In AutoMQ’s design, Kafka runs RF=1 and all durability comes from the shared backend, yet availability hinges on how fast a broker can reload MetaStream and reopen streams.

KRaft déjà vu

The proposal adds new core APIs (AbstractLog, SharedLog, Stream, KVClient, plus controller RPCs). That’s powerful but it’s also merge surface. Reviewers explicitly flag fragmentation risk: too thin and plugins reinvent core logic (fencing, compaction/merge, caches), too thick and Kafka’s core complexity balloons. It’s the classic like it was with KRaft trade-off: big architectural win, big consensus tax. Even AutoMQ acknowledges the risk and frames the refactor as a way to reduce long-term forking.

Bullet summary

What it is: a storage-engine refactor (AbstractLog + Stream + KV) so Kafka can run on shared storage without forking the codebase.
How it works: WAL-like streams per partition family, brokers reload MetaStream on failover, RF=1 in Kafka, durability in storage.
How it differs from Aiven: Aiven unifies the pipeline and keeps ISR for metadata and AutoMQ changes the engine and accepts slower failover in exchange for a cleaner “stateless” model.
Biggest risk: API fragmentation and second-order complexity in plugins vs core, following the infamous KRaft lesson.

The reality: AutoMQ already runs the future

WarpStream was the first to market the idea of a diskless, object-store Kafka service, but AutoMQ were the first to make it work at visible production scale and to document it openly.
They built it, ran it in production, and shared their design and findings openly, giving everyone else, Aiven included, a real implementation to learn from and improve.

Their openness stands in sharp contrast to cloud vendors’ closed models like MSK Express brokers, which are probably running diskless in some form, yet remain completely opaque.
With AutoMQ, we can trace how the system works, why certain choices were made, and where it can improve.
That transparency: messy, evolving, but real is what keeps progress visible.

That’s why I have a soft spot for AutoMQ as they don’t just build the future, they open it up for everyone to build together, yet continuing pushing the frontier with new ideas and features.

The Frontier Shifts

Something irreversible has started.
Kafka is no longer a broker farm with disks and replication: it’s becoming a topology of storage, compute, and intent.
The question is not if it detaches from local state, but how that detachment will be expressed and standardized.

Slack and Aiven are defining the safe roads, pragmatic ways to make Kafka cheaper, lighter, and easier to operate.
AutoMQ has already crossed the river and is building on the other side.
Their model isn’t the destination for everyone, but it proves that Kafka can survive and even thrive when storage is no longer a substrate but a medium.

What happens next will blur the boundary between streaming and storage entirely.
Kafka may stop being a product you deploy and become a behavior you compose: elastic, stateless, and transparent in cost and durability.
The work done now by Slack, by Aiven, by AutoMQ is not about who arrives first, but about expanding what Kafka can mean.

Acknowledgment

Thanks to the teams at Slack, Aiven and AutoMQ for pushing the frontier forward.
You’re doing the hard work that keeps Kafka evolving and the rest of us curious.

Beyond the snapshots and the segments, something larger is forming: a log without the wall.

Pink Floyd, obviously. Like I’d settle for less. :P

The Good, The Bad and The AutoMQ was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Switching me Softly

Anton Borisov — Mon, 29 Sep 2025 12:38:35 GMT

Zero-downtime 12 -> 17 PostgreSQL Upgrade at Fresha

Killing me softly with his bash song,
Zero-downtime all along…

Artwork by Bojan Jevtić(https://bojan-jevtic.pixels.com). Used with permission.

Introduction

Upgrading PostgreSQL in production is never just about running pg_upgrade.
At Fresha we run around 200+ PostgreSQL databases, and by the end of 2024 some of our most critical and heavily loaded ones were still stuck on Postgres 12, specifically around 20 dbs. PostgreSQL 12 was heading for end-of-life in November 2024 -February 2025 on RDS, but the risk and pain of upgrades had been high enough that we had postponed them again and again.

This created a dangerous situation: business-critical data on a soon-to-be unsupported version, and no safe path forward. We had Debezium streaming change events into Kafka, outbox connectors producing ordered domain events, and fleets of replicas serving reads. Dropping into “maintenance mode” for even a few minutes was unacceptable.

For years, upgrades had been treated like staring into car headlights and everyone froze. I wanted to break that cycle. By combining database internals, streaming knowledge, and a healthy dose of bash automation, I showed how zero-downtime upgrades could work in practice. Once we proved it on the hardest PG12 databases, we scaled the method into a repeatable solution for dozens of teams and hundreds of databases.

Let step back for a little: the obvious path was always “schedule maintenance windows” and take the hit. But that’s not realistic when your system is 24/7 and globally distributed. The real challenge wasn’t just moving the data, it was dealing with the entire ecosystem around it:

Debezium CDC connectors continuously streaming changes into Kafka.
Outbox pattern connectors generating ordered business events.
Replication slots and WAL management that can’t just be dropped and recreated without data loss.
Physical replicas serving production reads.
PgBouncer pools handling thousands of concurrent client connections.

Instacart’s zero-downtime cutover post describes a well-known approach in DBA world:

Take a consistent base backup (RDS snapshot or pg_basebackup).
Keep a replication slot open on the primary so WAL keeps accumulating.
Align the replica’s replication origin with the backup last known LSN.
Stream the changes from the slot until the replica catches up.
Promote the replica and cut traffic over.

Logical replication during PostgreSQL blue-green upgrade: WAL is accumulated while restoring the snapshot, then replication origin is advanced on the new cluster to catch up seamlessly.

That mechanism works, but it’s only the foundation. In practice it doesn’t address the messy parts we faced: Debezium connectors tied to those slots, outbox event ordering, sequence alignment, PgBouncer switchovers, and idempotency guarantees. Those were the hard problems we had to solve to make upgrades safe and repeatable at Fresha. And yes we haven’t stopped on just major upgrades. We did 12 -> 17.

Why Classic Approaches Don’t Work

The classic “restore from snapshot, upgrade, and cut over” is conceptually simple. But in reality:

Logical replication must continue through the upgrade process.

Every physical replica needs to remain consistent and available for read-only queries.

Kafka connectors tied to slots on the old cluster must transition cleanly, otherwise they either lose data or emit duplicates in unpredictable order.

Before building our own method, we looked at the obvious options:

RDS Blue/Green Deployments

Amazon RDS offers Blue/Green upgrades that spin up a parallel environment and promise minimal downtime. Sounds perfect until you realize it doesn’t integrate with the logical replication slots and Debezium connectors we rely on. Blue/Green can swap endpoints, but it doesn’t handle Kafka offsets, connector states, or downstream deduplication different requirements. For us, that’s a non-starter.

Direct pg_upgrade on Production

In-place upgrades sound easy: just stop Postgres, run pg_upgrade, restart. But that implies minutes or hours of downtime on terabyte-scale databases. It also assumes you’re comfortable with no rollback option beyond restoring backups. We weren’t.

So we built something different:

Start target database as a logical replica restored from a snapshot.
Rebuild its physical replicas alongside, so the new cluster mirrors the old one.
Use PgBouncer as the traffic switch, with a scripted pause/resume to freeze connections during the switchover.
Design explicit Debezium handling modes for both CDC and outbox connectors.

Synchronizing slots and connectors: duplicate replication slots, create new Kafka Connect connectors, then switchover PgBouncer to move traffic to the upgraded database.

The rest of this article walks through how we turned that into an automated, developer-friendly upgrade framework that anyone on our team could run in ~5 minutes per database.

Winter Is Coming: Prewarm & Analyze

Catching up a replica is only half the story. Before you flip traffic, you need to make sure the new cluster can actually carry production.

Consistency checks. On staging we ran PG 17 with full app tests: migrations, queries, extensions. Don’t skip this. If the app isn’t green here, it won’t be green in prod.

Prewarm and analyze. A fresh RDS restore is cold. First queries crawl while blocks come in from object storage to load to disk. We prewarmed critical tables and ran ANALYZE so stats were correct and the planner didn’t go wild on the first real query.

Two-phase switchover.

Read-only flip: move replicas first. All writes still hit Blue, but reads come from Green. This is the production smoke test — if queries fail here, you still have a rollback button.
Full flip: once RO is solid, pause PgBouncer pools, wait for replication to be fully caught up, then point writes at Green. Under the hood, “full” just means “RO first, then RW.”

PgBouncer routes traffic during switchover replicas can be switched read-only first, followed by a full master switchover.

PgBouncer tricks. Running it on Kubernetes comes with its own set of quirks. One of the big ones: how do you safely reschedule pods without jolting the on-call awake at 3 a.m.? That rabbit hole probably deserves its own article.

In our case, the pain point was ConfigMaps. Redeploying them was just too slow. The fix? Pre-mount all configs on every pod, and when it’s time to switch over, skip the redeploy and just fire off a few admin commands.

SET conffile = '/etc/pgbouncer/pgbouncer_new_rw.ini';
RELOAD;

No pod restarts, no reconciliation delay.

Mind the Gap (setval +100k)

Advance early, outside the pause. Right before switchover we run sync_sequencesscript, which iterates all sequences from pg_sequences on Blue, reads each sequence’s last_value, and then bumps the same sequence on Green to last_value + sync_sequences_gap (default 100000). This keeps the flip window short.
Exact mechanics. For each schemaname.sequencename:

-- On Green 
SET transaction_read_only = OFF; 
SELECT setval('',  + , true);

Why this is fine. We only need sequences to never collide after the flip. Pushing Green well ahead of Blue guarantees new inserts won’t reuse a value, and sequences are monotonic anyway — no need to “pull back” on rollback.

Handling Debezium During Switchover

The problem: Debezium ties each connector to a logical replication slot on the source (Blue). When you bring up the target (Green) and want the same data to continue into the same topics, you may want to accept a short overlap window where both Blue and Green can emit the same changes.

CDC to Warehouse (easy mode: dedup)

For CDC →Debezium -> Snowflake, we already deduplicate downstream. So we intentionally create a small overlap:

Managing WAL overlap:both Blue and Green CDC slots accumulate changes, with a controlled overlap to ensure continuity.

Keep Blue’s connector running.
Create Green’s connector with the same schema/topic config, start from the fresh slot on Green.
Allow a short duplication window (configurable) so both produce the same tail of WAL.
Stop Blue’s connector.

Result: Snowflake reconciles quickly and retains one copy of each row/version. This is reliable, fast, and zero-risk if dedup is in place.

Outbox/Event Streams (stricter)

Outbox topics care about ordering semantics. Our rule of thumb:

During the flip there should be no “funny ordering”, so the consumer either sees an event it has already processed, or it’s a brand-new event.

We guarantee that by:

Pausing writes on Blue right before the RW flip (brief), giving Debezium time to flush the tail (controlled in config).
Only then starting the Green connector.
Keeping the topic keying and partitioning identical, so any duplicates are the same key in the same partition.

Outbox connector cutover: old connector flushes WAL, new connector starts reading from the overlapped slot on the upgraded database.

If your consumers are idempotent (exactly-once at the business level), this pattern is safe: duplicates are replays of the same message, never re-ordered across the cut.

Slots & offsets: You don’t “move” a slot, you create a new slot on Green and start the connector there. Old connector keeps its Blue slot until you stop it. Kafka offsets continue per-topic/partition and duplicates are identical payloads, not new keys.

Overlap discipline: Keep the window short. Long overlaps create needless churn and noise in metrics.

Sensible checks: Only flip RW after:

Blue→Green replication lag under threshold.
Outbox writes paused and flush complete (we wait a bit to drain).
Green connector running and healthy.

Rollback: If anything looks off, stop the Green connector, resume Blue, unpause writes. Because we haven’t changed keys/ordering, consumers remain consistent.

Orchestration mode: strict ordering, zero confusion

When idempotency isn’t a sure bet for outbox consumers, we split the stream cleanly: old DB emits only “old” events, new DB emits only “new” events. No interleaving, no ambiguity.

How we mark events

We add a boolean version column to outbox tables, e.g. use_pg17 BOOLEAN DEFAULT NULL (present on all partitions).
Right before the flip, we change the default on the target (Green) to TRUE (and keep it NULL/FALSE on source/Blue). New writes on Green automatically carry use_pg17=true.

How we route events

We create a new outbox connector on Green with a JSR-223 Groovy filter that only passes rows where use_pg17 == true. The old connector keeps running on Blue but filters to “not true” (null/false). This cleanly divides the topic stream.
Green filter:
value.after.use_pg17 == true
Blue filter:
value.after.use_pg17 == null || value.after.use_pg17 == false
In the connector config we set:

transforms=filter, 
outbox transforms.filter.type=io.debezium.transforms.Filter,
transforms.filter.language=jsr223.groovy, 
transforms.filter.condition=,
transforms.outbox.type=io.debezium.transforms.outbox.EventRouter,
snapshot.mode=never

This is applied to old and new connectors.

Switchover choreography

Prep: ensure use_pg17 exists on all outbox tables; the script checks this and fails fast if missing.
Flip default on Green: set use_pg17 DEFAULT TRUE so any new events after the write flip carry the tag.
Start Green outbox connector with the use_pg17==true filter, old Blue connector still runs with “not true” filter.
Pause writes on Blue, let Debezium flush and old connector process current events, switchover to Green, resume writes and Green produces new (tagged) events, filtering duplicates from old.
Retire Blue connector once the tail is drained; drop the filter on Green if desired (after Blue is off, everything is “new” anyway). Connector lifecycle and config mutation are automated via the Connect REST API in the script.

Outbox orchestration: old and new connectors overlap, using a version flag to filter duplicate events during switchover.

Why this works

Topics never receive mixed “old/new” events during the window. Consumers either see already-processed events (from Blue) or brand-new events (from Green). There’s no reordering across the boundary because we pause writes, drain, then resume.
Rollback is symmetric: flip the defaults back, adjust filters (the script handles forward/reverse conditions automatically).

The Greatest Teacher, Failure Is

No matter how carefully you plan, switchover days have a way of surfacing the unknowns. We built in multiple safety nets to make sure “wrong” never meant “irreversible.”

CDC connectors are the easy part.
If a Debezium CDC connector stumbles, our downstream (Snowflake) is already deduplicating on keys, so we can always re-consume the stream. Worst-case, we spin up a new connector with snapshot.mode=initial against a trimmed dataset, relying on table indices to cut over at a known safe point. Consumers will reconcile cleanly.

Outbox connectors are trickier.
Because they carry business events, we can’t just blindly replay snapshots. That’s why we leaned on idempotent consumers and, when possible, the Kafka Connect REST API to drop offsets. On older Connect versions, the fallback was heavier: create a fresh connector/slot pair, trim the table online to the events we care about, then re-bootstrap. You may use something like my previous article as a recipe.

If the new database turns out to be unusable.
Our orchestration doesn’t just go forward, but also it can reverse. Once the switchover completes, we can run the same scripts in reverse: switch PgBouncer back, flip replication slots/publications, and re-attach connectors to the old cluster. In theory, you can loop this forward/backward multiple times until you’re satisfied.

Dry-run mode as insurance.
We even added a “test catchup” mode in switchover.sh: it pauses writes, lets logical replication and slots catch up, but never actually flips traffic to the new cluster. This gave us two superpowers:

measure how long the pause would last under real workload,
test whether the application tolerated read-only windows.

If it didn’t? We’d just resume writes on the source and treat it as a rehearsal, not a disaster.

Takeaway: A zero-downtime upgrade isn’t just about the happy path. It’s about building escape hatches: reset CDC, replay outbox carefully, reverse the switchover if needed, and dry-run until you trust the process.

Everything Everywhere All at Once, known as YAML

One of the biggest lessons we learned after the first few upgrades was: you don’t want hardcoded scripts. Each database had its quirks: different connector sets, replica layouts, outbox usage, so we pushed everything into a YAML config and a single entrypoint.

Yeah, you can do modules with Bash :)

Our switchover.sh script is really just a thin wrapper around sub modules in lib/, reading from switchover-config.yaml. That config defines:

Source and target clusters

source:
  internal_name: postgres_source
  host: localhost
  port: 5433
  replicas:
    - name: postgres_source_ro_1
    - name: postgres_source_ro_2
target:
  internal_name: postgres_target
  host: localhost
  port: 5434
  replicas:
    - name: postgres_target_ro_1
    - name: postgres_target_ro_2

This lets us model not just primaries, but physical replicas, so PgBouncer can safely switch read-only traffic separately from read-write.

Database-level details (slots, publications, subscriptions):

database:
  name: testdb
  user: testuser
  password: testpass
  logical_slot: dms_slot
  logical_publication: dms_pub
  logical_subscription: dms_sub
  transaction_wait_time: 3        # allow txs to finish before going RO on Blue

PgBouncer pools and config files for pre-delivered switchover configs:

pgbouncer:
  switchover_rw_config_file: pgbouncer_new_rw.ini
  switchover_ro_config_file: pgbouncer_new_ro.ini

Kafka connectors and how they map to slots:

kafka:
  connect_clusters:
    - name: cdc
      url: http://localhost:8083
    - name: outbox
      url: http://localhost:8084
debezium:
  outbox:
    version_column: use_pg17
    tables:
      - outbox_events_1
      - outbox_events_2

Replication tolerances (lag thresholds, sequence gaps, catchup timeout):

replication:
  max_lag_bytes: 20000        # What to consider no lag.
  catchup_timeout: 10         # How long to pause to catchup
  sync_sequences_gap: 100000  # Some safe gap for sequences, 
# we don't want to create sequences on 1k tables during pause :)

Why expose so many options? Because upgrades are never uniform:

Sometimes you want duplication mode for Debezium (let duplicates flow, downstream deduplicates).
Sometimes you need orchestration mode (JSR filters + version columns to separate old vs new events).
Some teams only wanted to test read-only switchover, without touching writes.
On the biggest DBs, we needed to pause transactions, wait for catchup, then flip. On smaller ones, we could skip the wait entirely.

Instead of maintaining a dozen scripts, we ended up with a single command:

./run.sh switchover \
  --direction forward \
  --mode full \
  --debezium-mode orchestration \
  --transaction-mode wait \
  --test-catchup disabled \
  --config configs/switchover-config.yaml

Every option maps directly to one of those operational decisions. This made the process reproducible, reversible, and explainable.

Results

Between January and February, we upgraded 20+ PostgreSQL databases from PG12 to PG17 without a single minute of downtime. Even the largest and most sensitive database, the one where we couldn’t assume idempotent consumers, was migrated safely using the orchestration trick with version columns and JSR filters.

And yes, after the final switchover, we had a cake. 🎂

But more importantly, we ended up with a repeatable framework:

The same scripts and configs could be applied across clusters.
Every option (duplication vs orchestration, test catchup vs full switchover) was codified and reusable.
Rollback and dry-run modes gave us confidence under pressure.

End-to-end switchover flow: from prewarming and replication slot setup, through pausing PgBouncer and Debezium orchestration, to resuming traffic on Green.

Conclusion

Zero-downtime PostgreSQL upgrades are not just possible, but they’re now a reality for our team. With careful orchestration of CDC connectors, outbox event streams, replication slots, and PgBouncer traffic, we moved our production fleet forward to PG17 without disruption.

That said, the switchover is only half the story. The switchover itself is elegant, but the supporting work can still feel tedious:

disabling migrations while replicas catch up
tweaking PgBouncer configs
changing Terraform definitions for connectors
creating PRs, waiting for approvals, and fixing the inevitable typo.

We’re not stopping here. The scripts and configs are good enough today that teams across Fresha can already run upgrades on their own by following the documented process. The next step is to take the human glue out of it:

templated configs
PRs generation
orchestration with Step Functions on AWS
and finally a workflow where upgrading Postgres feels like a safe, boring routine.

We’ve proven this works at Fresha scale, and now we’re investing in making it a one-button operation. When PG18 lands, we don’t want anyone on-call to break a sweat

Credits

Anton Borisov — rule-breaker and architect of the impossible, brought the idea nobody thought would fly, built the first script, solved the hardest DB hurdle, and set the pace as the project’s boldest innovator. Basically the one person you shouldn’t let near production… unless you want results.
Emiliano Mancuso — Head of infrastructure, believed in the process when others said it was impossible. Also doubles as our premier rubber duck.
Jan Zaremba — leader of Infrastructure, encyclopedic knowledge and steady hand that made every iteration of the implementation safer than the last.
Robert Pyciarz — doomsday architect, mapped out every failure path and tucked Kubernetes into all the right corners.
Paritosh Anand & Paweł Michna — deadline wrestlers, ran upgrades at crunch time and wrote down every step so the rest of us could sleep.
Blend Halilaj — snapshot tactician, turned the idea of blue-green RDS restores into a clean, repeatable workflow.

And a special encore for Rehan Ullah, Sam Stewart, and (again) Paweł Michna, the crew who is turning all of this into a one-button upgrade right now.

Switching me Softly was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

What the Fuss with Fluss: Flink Delta Force

Anton Borisov — Mon, 22 Sep 2025 15:29:37 GMT

Released July 31, 2025, Flink 2.1 dropped a bombshell on the streaming world: DeltaJoin and MultiJoin. These aren’t your typical join operators, they’re surgical instruments designed to excise the “state tumor” that’s been metastasizing in production Flink clusters for years.

After messing with Flink/Fluss you’ll get addicted to otters

Every Flink team knows the dirty secret: join state inflates faster than a London rental listing. You start with gigabytes, graduate to terabytes, then watch helplessly as checkpoints stretch from seconds to minutes to “let’s just restart the job.” Traditional streaming joins hoard everything — every customer record, every order, every update, because they might need it someday. It’s digital hoarding at datacenter scale.

The numbers are sobering. For many teams, a modest e-commerce platform joining orders with customers accumulates tens of gigabytes of state per billion events. Add product catalogs, inventory, shipping — suddenly you’re managing terabytes of state across your topology. Checkpoints that started at 30 seconds stretch to 5 minutes, then 20. Recovery time follows the same trajectory. Your on-call rotation becomes a game of Russian roulette: whose turn is it when the next OOM brings down production?

Sure, you can fight back — state TTLs to expire old data, interval joins with time bounds, broadcasting small dimension tables, or even falling back to batch reprocessing. Some teams pre-partition data to reduce per-operator state or switch to temporal joins where possible. But these workarounds either limit functionality or add operational complexity. That’s why the industry is racing toward fundamental solutions: DeltaJoin externalizes state entirely, RisingWave uses cloud-native shared storage, Feldera pre-materializes everything, so different philosophies, same recognition that traditional join state is broken.

DeltaJoin flips the script: why store when you can fetch? Instead of maintaining partner history in operator state, it queries an external indexed store at emit time. Flink’s DeltaJoin with eventual consistency always probes the latest counterpart at emit time (no snapshot pinning). The payoff is dramatic: minimal checkpoint overhead, lightning-fast recovery, and jobs that actually scale elastically instead of just claiming to.

Streaming is being reinvented in front of us.

Flink/Fluss, RisingWave, and Feldera are not rivals so much as different philosophies of state and paying attention now is how we learn which ideas will define the next decade.

I’ll focus mainly on Flink and Fluss, but the story wouldn’t be complete without glancing at how RisingWave and Feldera tackle the same challenge from other angles.

Enter Fluss: The Storage Layer That Gets It

This lookup-based rebellion demands a cooperative storage layer. Apache Fluss (Incubating) isn’t just another streaming store, it’s purpose-built for this exact pattern. While other stores make you choose between streaming updates and efficient lookups, Fluss delivers both through its prefix lookup capability on primary-key tables.

The magic is in the prefix. Most lookup stores demand exact keys, meaning you need the complete primary key or you’re out of luck. Fluss lets you probe with just a prefix. Got a composite key (customer_id, order_id, item_id) but only know the customer? Traditional stores would force a scan or secondary index if supported. Fluss says "no problem" as your prefix lookup hits a tight RocksDB range scan on a single tablet. Paimon's PK-based lookup works great when your join keys match the primary key exactly; Fluss's prefix lookup makes composite-key joins practical when only a prefix is available, and it’s the real-world case in enrichment pipelines.

Prefix lookups make the difference: Paimon requires full primary key matches (or workarounds like nested aggregates), while Fluss supports probing with just a prefix — making composite-key joins practical in real pipelines.

Under the hood, each Fluss table bucket maps to a KV tablet backed by RocksDB, plus a child log tablet for the changelog. When your lookup hits the tablet leader, it translates directly into RocksDB operations. The dual structure: mutable KV store plus changelog gives you both point-in-time lookups and CDC streams. This isn’t accidental, it’s architected specifically for patterns like DeltaJoin.

Regular joins bloat operator state by holding both sides in memory, while DeltaJoin externalizes history into Fluss. Instead of hoarding state, each side emits a changelog and probes the index on demand.

DeltaJoin builds on two critical innovations. FLIP-486 provides the core operator StreamingDeltaJoinOperator with bilateral LRU caches and the planner intelligence to know when to use them. When a record arrives on either side, it checks its cache, then triggers an async probe on miss. Two AsyncDeltaJoinRunner instances handle the bilateral lookups, each maintaining its own cache to avoid constant external calls.

FLIP-519 solves the harder problem: async ordering chaos. The KeyedAsyncWaitOperator ensures updates for the same key process sequentially while different keys run concurrently. Without this, your upserts would corrupt faster than a politician's promises. It's the difference between eventual consistency and eventual catastrophe.

The KeyedAsyncWaitOperator serializes updates per key while letting different keys run concurrently—turning async chaos into ordered streams.

Configuration: Where Dreams Meet Reality

Your setup starts simple:

SET 'table.optimizer.delta-join.strategy' = 'AUTO';
SET 'table.exec.async-lookup.key-ordered-enabled' = 'true';
SET 'table.exec.async-lookup.output-mode' = 'ALLOW_UNORDERED';
-- DeltaJoin caches are configurable per side; tune per workload

But the optimizer is picky. Between source and join, only stateless operators survive the rewrite: scans with pushdowns, key-preserving calcs, watermarks, exchanges. One stateful operator in the chain? Back to traditional joins and their state baggage. No cascaded joins yet — the planner stays conservative.

Changelog streams face tighter scrutiny. Streams heavy in UPDATE_BEFORE operations get rejected—DeltaJoin can't reconstruct "before" values from point lookups. Delete operations follow strict type-based rules: INNER joins tolerate deletes from one side only; LEFT/RIGHT joins only from the outer side. The planner stays conservative, choosing correctness over optimization every time.

Current limitations of Flink’s DeltaJoin operator across topology, semantics, and consistency.

Your Fluss table design determines performance:

CREATE TABLE fluss_customers (
    customer_id BIGINT,
    region_id INT,
    order_count BIGINT,
    customer_data STRING,
    PRIMARY KEY (customer_id, region_id) NOT ENFORCED
) WITH (
    'bucket.num' = '16',
    'bucket.key' = 'customer_id',  -- single-tablet path for lookups by customer
    'table.log.ttl' = '7d'          -- correct option name for retention
);

PK tables default to hash bucketing by primary key (excluding partition keys) unless you set bucket.key. This configuration is mission-critical—set bucket.key = 'customer_id' and every probe hits a single tablet. Skip this step, and Fluss distributes your lookups across the cluster.

The Architecture Wars: Different Philosophies, Same Problem

Here’s where the streaming world splits into camps. RisingWave takes the “streaming database” approach: it’s PostgreSQL-wire-compatible and maintains all join state in shared storage with tiered caching. When you write a join in RisingWave, you’re essentially creating a materialized view that updates incrementally. The engine manages multi-version concurrency control (MVCC) internally, giving you snapshot consistency by default. Your joins always see a consistent view of both sides, but you pay for it with storage overhead and checkpoint coordination across actors.

RisingWave’s architecture is fascinating: compute nodes are stateless, with all state living in Hummock shared storage (typically S3-compatible object stores). Hot data stays in multi-level caches: block cache, meta cache, and optional disk cache (NVMe/EBS). When a join processes a record, it might hit memory, local SSD, or remote storage. The consistency protocol ensures all actors see the same epoch, preventing temporal anomalies. But this coordination has a cost: checkpoint barriers must traverse all actors, with latency bounded by the slowest operator and storage I/O.

Their Disk Cache shows impressive results, up to 94% fewer remote reads and ~75% fewer S3 GETs in their tests, drastically reducing both latency and cloud costs. Atome, a BNPL payments company, migrated parts of their Flink-based pipeline to RisingWave specifically for operational simplicity and consistency guarantees.

Feldera goes even further with incremental everything. Built on Differential Dataflow principles (DBSP), Feldera doesn’t just maintain state — it maintains the entire computation graph as incremental operators. Every join is a Z-set transformation that tracks insertions and deletions as weighted updates. When you query a join in Feldera, you’re reading from a pre-computed, constantly updating index. Feldera recently rounded this out with first-class backfill orchestration, labeled connectors and staged historical-then-realtime ingest — so bootstrapping big state is part of the product, not a runbook.

Feldera’s advantage comes from maintaining the entire computation graph as incremental state. When you join the result of Join A with Join B (multi-hop), Feldera doesn’t recompute Join A from scratch, it maintains Join A’s output as a differential index that updates incrementally. Each downstream join operates on pre-materialized, indexed results from upstream joins. For recursive queries like supply chain tracing or graph algorithms, Feldera’s pre-computation model can outperform lookup-based approaches on multi-hop or recursive queries, because each iteration reuses incrementally maintained indexes instead of issuing fresh lookups or recomputing upstream joins.

Different philosophies of state management: Flink + Fluss use minimal operator state with external KV + log, RisingWave relies on stateless compute actors over shared storage, while Feldera pre-materializes results using incremental indexes.

DeltaJoin + Fluss takes the opposite bet: externalize everything. No internal state management, no MVCC overhead, no incremental index maintenance. Just probes. The philosophical triangle is clear:

Flink+Fluss (DeltaJoin): “probe on demand, minimal job state, eventual consistency”
RisingWave: “MVCC + materialized views over shared storage for snapshot-consistent results”
Feldera: “pre-materialize and index every intermediate; strongest for multi-hop/recursive”

The Performance Reality Check

RisingWave checkpoints can reach 100–200MB/s throughput with consistent latency, but checkpoint size grows linearly with state. A billion-row dimension table means gigabytes of checkpoint data, even with compression. Recovery requires rebuilding consistent state across all actors, think minutes for large deployments. However, the shared storage model enables parallel recovery: new actors can start processing immediately while warming their caches asynchronously.

Feldera’s checkpoints are incremental diffs, often smaller than RisingWave’s, but the storage footprint is significant. That same billion-row table requires substantial storage to maintain efficient Z-set indexes, and that’s before considering join output cardinality. The payoff? Sub-millisecond query latency on pre-computed results. Feldera can process millions of updates per second on a single node for moderately complex queries, but each join multiplies storage requirements.

DeltaJoin checkpoints are tiny compared to stream-stream joins, it’s just cache entries and in-flight requests. Recovery means warming caches, not rebuilding state. The billion-row table lives in Fluss, accessed on-demand. The trade-off: uncached lookups add low single-digit milliseconds of latency, and cache misses during traffic spikes can cascade into backpressure. With proper cache configuration and hot-key patterns, high cache hit rates keep most lookups fast.

The implementation uses pooled result handlers and bounded in-flight async probing to manage resources efficiently. Failed lookups retry with exponential backoff, while partial caching helps balance memory usage against lookup frequency.

Performance trade-offs matrix comparing RisingWave, Feldera, and DeltaJoin+Fluss across checkpoint size, recovery time, latency, and memory/storage footprint.

Real-World Patterns and Anti-Patterns

Choose RisingWave when you need SQL compatibility and epoch-based snapshot consistency for analytics and dashboards. Financial dashboards, operational analytics, anywhere business users write SQL directly. The PostgreSQL compatibility means your BI tools just work. The MVCC consistency means your reports always balance. The trade-off is operational complexity: you’re running a distributed database, not just a stream processor.

Choose Feldera for complex, multi-hop incremental queries. Graph analytics, recursive CTEs, anything where you’re joining join results with other join results. Say a logistics company tracking packages through their network, like packages join with trucks, trucks join with routes, routes join with weather, all joining with real-time GPS. In Feldera, this entire graph updates incrementally. In DeltaJoin, you’d need millions of lookups per second, each adding latency.

Choose DeltaJoin + Fluss for high-volume enrichment with bounded dimensions. User profile lookups, feature assembly, entity resolution, essentially the cases where one side is orders of magnitude larger than the other. Taobao enriching millions of events per second with user profiles and advertiser configs sees dramatic state reduction. Traditional join state would be massive. With DeltaJoin + Fluss? Just cache overhead, with dimension data living in Fluss’s optimized storage.

Guidelines for choosing between RisingWave, Feldera, and DeltaJoin+Fluss: best-fit use cases and situations to avoid.

But know the anti-patterns. DeltaJoin fails catastrophically with high-cardinality, high-mutation dimensions. If your dimension table has millions of rows changing every second, your cache becomes useless, every join becomes a lookup, and your network becomes the bottleneck. Imagine real-time inventory tracking: every sale updates inventory, invalidating caches instantly. It’s a lot better to stick to traditional joins with aggressive TTLs in this case.

Anti-pattern: using DeltaJoin with high-churn dimension tables causes cache misses, lookup storms, and network saturation.

The Uncomfortable Consistency Question

Here’s what the documentation confirms: DeltaJoin delivers eventual, not snapshot consistency by design. Each probe sees the latest row version at lookup time. Concurrent updates to both join sides create transient mismatches that eventually converge.

Consider this scenario: An order for customer C arrives at T1. DeltaJoin looks up customer C, gets version V1. At T2, customer C updates their address. At T3, another order arrives and gets version V2. Your output stream now has two orders for the same customer with different addresses, even though they were placed seconds apart. RisingWave and Feldera users might scoff — they get consistency guarantees built-in. But those guarantees aren’t free — they’re paid in checkpoint size, recovery time, and operational overhead.

Eventual vs. snapshot consistency: DeltaJoin emits orders with different customer versions until checkpoint convergence, while systems like Feldera align versions at snapshot boundaries.

Your downstream must be built for this reality. Upsert sinks that handle duplicates gracefully. Idempotent operations that survive replays. Business logic that tolerates temporary inconsistencies. Monitoring that tracks convergence time, not just throughput. For many real-time analytics use cases, this trade-off is brilliant: you exchange perfect consistency for operational sanity. For financial reconciliation requiring ACID guarantees? Keep walking.

The Future State

Data Future is bright and shiny with these new tools!

While Fluss 0.6 introduced prefix lookups, version 0.8 targets full Flink 2.1/DeltaJoin integration as a key milestone. The roadmap includes potential optimizations like automatic bucket key inference and adaptive cache sizing, though specifics remain subject to change.

RisingWave is doubling down on cloud-native architecture, with their Elastic Disk Cache showing significant S3 I/O reductions, investing a lot of effort in streaming Iceberg. Feldera continues pushing incremental computation boundaries, with their backfill orchestration making it easier to bootstrap complex stateful pipelines.

The streaming world isn’t converging on a single solution — it’s diverging into specialized tools for different problems. Traditional approaches like Flink’s disaggregated state still treat state as job-private, offloaded but isolated. Materialize, Feldera, and RisingWave maintain rich indexed state within the engine, offering stronger consistency and incremental processing with smart intermediate caches at higher operational cost and a whole different set of tradeoffs.

DeltaJoin + Fluss isn’t trying to be RisingWave or Feldera. It’s solving a specific problem: making traditional Flink joins scale without the traditional state explosion. For teams already invested in Flink, facing the state management wall, this combination offers a pragmatic escape route without switching engines entirely.

The constraints are real: eventual consistency, strict topology requirements, and the need for cooperative storage. But for teams willing to embrace these trade-offs, the payoff is substantial: joins that actually scale, checkpoints measured in seconds not hours, and operators who can sleep through the night.

Shoutout to the Flink, Fluss, Feldera, and RisingWave teams for quietly revolutionizing how we think about data in motion.

While the rest of tech argues about AI, these folks are solving the unglamorous problems that actually keep the world’s data flowing and making it look elegant in the process.

What the Fuss with Fluss: Flink Delta Force was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The SELECT FOR UPDATE Trap Everyone Falls Into

Anton Borisov — Mon, 08 Sep 2025 08:57:03 GMT

You Probably Are Too

I wasn’t sure if I should write this article. After all, row-level locking in PostgreSQL isn’t exactly breaking news — these features have been around since 2013. But here’s the thing: I’ve been asked about SELECT FOR UPDATE performance issues three times this month alone. Different companies, different engineers, same fundamental misunderstanding.

I have a simple rule for deciding whether to write about something: Would this article have saved me hours of debugging 2–5 years ago? If yes, then it’s worth writing, even if it seems obvious to me now. The truth is, we don’t write technical articles because we just discovered something cool and can’t wait to share. We write them for that developer who’s currently staring at a deadlock graph at 2 AM, wondering why their perfectly reasonable looking code is causing production issues.

So if you’ve worked with PostgreSQL in any production environment with concurrent transactions, you’ve likely reached for SELECT FOR UPDATE to prevent race conditions. It seems like the obvious choice — after all, if you're going to update a row, you should lock it first, right?

Here’s the uncomfortable truth: SELECT FOR UPDATE is probably the wrong tool for your use case, and it's silently destroying your application's concurrency. Let me show you why this seemingly logical choice is actually a historical artifact that can bring your database to its knees.

The Race Condition We’re All Trying to Solve

Let’s start with why developers reach for row-level locking in the first place. Picture this scenario: you’re building an e-commerce platform where multiple users might try to purchase the last item in stock simultaneously.

Without proper locking, here’s what happens at the default READ COMMITTED isolation level:

- Transaction A reads inventory
SELECT quantity FROM inventory WHERE product_id = 123;
-- Returns: quantity = 1

-- Meanwhile, Transaction B also reads the same row
SELECT quantity FROM inventory WHERE product_id = 123;
-- Also returns: quantity = 1

-- Both transactions think there's stock available
-- Transaction A updates
UPDATE inventory SET quantity = 0 WHERE product_id = 123;

-- Transaction B updates (overwrites A's change!)
UPDATE inventory SET quantity = 0 WHERE product_id = 123;

Yep, exactly so

Both customers think they got the item. Your warehouse is confused. Customer service is having a bad day. This is the classic “lost update” problem.

The traditional solution looks elegant:

BEGIN;
SELECT quantity FROM inventory 
WHERE product_id = 123 
FOR UPDATE;  -- Lock the row!

-- Now safely check and update
UPDATE inventory SET quantity = quantity - 1 
WHERE product_id = 123 AND quantity > 0;
COMMIT;

Problem solved, right? Not quite. This solution works, but it’s using a sledgehammer where a scalpel would do.

The Four Horsemen of Row Locking

PostgreSQL row locks. Gloomy, gothic, yours…

Here’s where things get interesting. PostgreSQL doesn’t just have “locked” and “unlocked” states for rows. Since version 9.3, PostgreSQL actually has four distinct row-level lock modes:

FOR KEY SHARE — The gentlest lock, used by foreign key checks
FOR SHARE — Prevents modifications but allows other shared locks
FOR NO KEY UPDATE — Prevents deletion but allows non-key updates
FOR UPDATE — The nuclear option, preventing almost everything

These aren’t arbitrary distinctions. They form a carefully designed hierarchy that enables PostgreSQL to maintain referential integrity while maximizing concurrency. The breakthrough insight was recognizing that not all updates are created equal.

Consider this example relationship:

CREATE TABLE accounts (
    account_id BIGINT PRIMARY KEY,
    balance DECIMAL(10,2),
    status VARCHAR(20)
);

CREATE TABLE transactions (
    transaction_id BIGINT PRIMARY KEY,
    account_id BIGINT REFERENCES accounts(account_id),
    amount DECIMAL(10,2)
);

When you insert a transaction, PostgreSQL needs to ensure the referenced account won’t disappear. But should that prevent you from updating the account’s balance? Of course not! The foreign key only cares that the account continues to exist with the same ID.

The Hidden Cost of FOR UPDATE

Here’s the critical insight that most developers miss: SELECT FOR UPDATE doesn't match what UPDATE actually does.

When you execute an UPDATE statement, PostgreSQL is smart about which lock it acquires:

If you’re changing a primary key or unique column: FOR UPDATE lock
If you’re changing any other column: FOR NO KEY UPDATE lock

But when you manually write SELECT FOR UPDATE, you're always getting the strongest lock, even when you don't need it.

Let me demonstrate the catastrophic effect this can have:

-- Session 1: Lock an account to update its balance
BEGIN;
SELECT * FROM accounts 
WHERE account_id = 1001 
FOR UPDATE;

-- Session 2: Try to insert a transaction
-- This will BLOCK even though we're not changing the key!
INSERT INTO transactions (transaction_id, account_id, amount) 
VALUES (5001, 1001, 100.00);
-- Waiting... waiting... waiting...

Session 2 is blocked not because there’s a logical conflict, but because we used the wrong lock level. The correct approach:

-- Session 1: Use the appropriate lock level
BEGIN;
SELECT * FROM accounts 
WHERE account_id = 1001 
FOR NO KEY UPDATE;  -- This is what UPDATE would use!

-- Session 2: Insert proceeds without blocking
INSERT INTO transactions (transaction_id, account_id, amount) 
VALUES (5001, 1001, 100.00);
-- Success! No waiting!

Real-World Impact: A Production Horror Story

Back in the past I consulted my friend working for fintech startup that was experiencing mysterious deadlocks and timeout errors during peak hours. Their payment processing system would grind to a halt whenever transaction volume spike

Financial = fintech support, good pun

The culprit? A seemingly innocent piece of code:

-- Their original code
BEGIN;
-- Lock user account for payment processing
SELECT * FROM user_accounts 
WHERE user_id = $1 
FOR UPDATE;

-- Process payment logic here
-- Update account balance
UPDATE user_accounts 
SET balance = balance - $2,
    last_transaction = NOW()
WHERE user_id = $1;
COMMIT;

Meanwhile, their audit system was trying to insert records:

INSERT INTO audit_log (user_id, action, timestamp)
VALUES ($1, 'payment_initiated', NOW());

The audit table had a foreign key to user_accounts. Every payment would block audit logging, which would block other systems, creating a cascade of locks throughout their database.

The fix was embarrassingly simple:

-- Fixed code
BEGIN;
SELECT * FROM user_accounts 
WHERE user_id = $1 
FOR NO KEY UPDATE;  -- Changed this one line

-- Rest of the code unchanged
UPDATE user_accounts 
SET balance = balance - $2,
    last_transaction = NOW()
WHERE user_id = $1;
COMMIT;

Result: 70% reduction in lock wait times, complete elimination of deadlocks, and their system could handle 3x the transaction volume.

The Decision Tree: Which Lock Do You Actually Need?

Here’s a practical guide for choosing the right lock level:

PostgreSQL row-lock decision tree: choose the weakest lock that still guarantees correctness — FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, or FOR KEY SHARE.

In practice, 90% of the time you want FOR NO KEY UPDATE. The name is admittedly terrible (a historical artifact from the PostgreSQL 9.3 upgrade that maintained backward compatibility), but it's almost always the right choice.

Practical Recommendations

Default to FOR NO KEY UPDATE: Unless you're specifically deleting rows or modifying primary keys, this is your go-to lock.
Reserve FOR UPDATE for actual deletions: The only time you truly need FOR UPDATE is when you're about to DELETE or modify a column that could be referenced by a foreign key.
Monitor lock waits: Use pg_stat_activity and pg_locks to identify when lock contention is happening:

SELECT blocked.pid,
       blocked.query as blocked_query,
       blocking.query as blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking 
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

4. Document your locking strategy: Make it clear in your code why you’re choosing a specific lock level. Your future self (and your teammates) will thank you.

The Takeaway

SELECT FOR UPDATE is a powerful tool that's almost always the wrong choice. It's like using a sledgehammer to hang a picture frame — it'll work, but you'll damage the wall and annoy the neighbors.

The next time you’re tempted to write SELECT FOR UPDATE, pause and ask yourself: "Am I actually going to delete this row or change its primary key?" If the answer is no, use FOR NO KEY UPDATE instead. Your database will handle more concurrent operations, your users will experience fewer timeouts, and your ops team will sleep better at night.

Remember: in PostgreSQL, the lock level should match your actual intent. Don’t let a historical naming convention trick you into over-locking your data. Your application’s performance depends on it.

UPDATE:
I just discovered Laurenz Albe’s SELECT FOR UPDATE considered harmful (PostgreSQL) — independently reaching same conclusion.

For a crisp rationale/history, read his; for a deeper, production-oriented guide — decision-tree, FK blocking examples, monitoring queries, and explain-driven tuning — stay here.

Both: prefer FOR NO KEY UPDATE unless deleting or changing FK-referenced keys.

The SELECT FOR UPDATE Trap Everyone Falls Into was originally published in fresha-data-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.