Stories by Giannis Polyzos on Medium

Streams, Tables, and The Illusion of Duality

Giannis Polyzos — Tue, 19 Aug 2025 06:28:10 GMT

What Is Stream/Table Duality?

At its core:

A stream is a never-ending log of changes 📜.
A table is the materialized view of those changes, the current state 🗄️.

👉 Every table can be expressed as a stream of updates.
👉 Every stream of updates can be materialized into a table.

For example:

A database emits INSERT, UPDATE, and DELETE events → stream.
Apply those events in order, and you reconstruct the table.
Conversely, capture every change to a table → you get the changelog stream.

It’s bidirectional. That’s the duality. 💡

⚙️ Core Properties of Duality

To make this magic work, you need:

Changelog semantics: the stream must carry not just new rows, but also updates and deletes.
Primary keys: so the system knows which row to update.
Time as a first-class citizen: streams provide ordering, tables represent “state at time T.”
Materialization: a table is a materialized view of a stream.
Consistency: replaying the same stream always yields the same table.

🛑 Where It Breaks Down

Here’s where things get interesting.. Lots of work takes place in terms of integrating Apache Kafka with Apache Iceberg.

Some position this as a Stream/Table duality. Let’s break things down:

🧊 Apache Iceberg

Iceberg is a great open table format for managing append-only data.

But ❌ it does not enforce primary keys.
❌ It doesn’t natively support row-level updates or deletes in streaming mode.
That means you can’t get true stream/table duality; you only get snapshots of appended data.

📦 Apache Kafka

Kafka is an event log, not a changelog.
By default, it just stores raw events; it doesn’t have the notion of updates or deletes.
So ❌ Kafka alone is not a changelog stream.

But…

✅ With Debezium or Flink CDC

You capture real database changes (with keys + operation types).
Kafka topics then carry true changelog streams.
From there, Flink can materialize those streams into tables.

Kafka is an event log, not a changelog.

This is particularly important to highlight because this is where things get complex downstream.

Stream processing engines like Apache Flink build on the concept of the changelog.

Since Kafka doesn’t generate a changelog, Flink requires adding an expensive operator (ChangelogNormalize) as an intermediate step to normalize data from Kafka, resulting in heavy states and redundant storage.

Moreover, this changelog is not reused, which means that if you consume the same Kafka topic to perform different operations in different jobs, each job requires creating and storing that changelog in its state.

🤔 So, Do Streams and Tables Need to Be in the Same System?

This is something I have been debating with..

There are lots of systems that have at their core the concept of the Stream/Table duality, like Postgres, MySQL, Apache Beam, but also systems I’m working heavily with, like Apache Flink, Apache Paimon, and these days with Apache Fluss. So I guess it’s only natural to advocate in favor of “it needs to be in the same system”.

Otherwise, I would argue it’s an integration between systems that can’t support both natively… However, since there is no formal definition, the answer should be, not necessarily.

Systems like Apache Flink, Kafka Streams, and others expose both APIs in one place, making duality seamless. But you can also have separate systems, I guess.

As long as the ordering, keys, and changelog semantics are preserved, duality holds.

Based on what was discussed above, though, I doubt it’s fair to consider an integration between Apache Kafka & Apache Iceberg as a Stream/Table duality.

🚀 The Takeaway

Streams = the story (all the events that ever happened).
Tables = the snapshot (what’s true right now).
Together, they form a powerful duality that underpins modern real-time analytics, stream processing, and data lakehouses.

But beware 👀:

Kafka ≠ changelog (unless CDC is involved).
Iceberg ≠ table duality (because no primary keys, only appends).

There is more to this that goes further down different approaches and trends in the industry these days, but for now, that’s where I wanted to focus on.

PS: If you are looking for a solution that builds on all the above principles and gives you extra benefits, like the ability to actually query streams, built-in caches, and more, Apache Fluss might be of interest.

Keep on streaming 🌊 🤘

The Importance of Unified Batch and Streaming in Modern Data Processing.

Giannis Polyzos — Mon, 07 Oct 2024 07:10:54 GMT

In the past decade, data-driven companies have evolved from processing periodic batches of data to needing real-time insights from continuous streams. However, the journey hasn’t been seamless.

Historically, companies have been forced to use two technology stacks to handle batch and streaming workloads separately. This separation has resulted in various challenges, from increased operational complexity to delayed insights.

Today, we’re seeing a shift toward the unification of batch and streaming processing into a single stack. Unified data platforms offer the promise of managing both workloads seamlessly, reducing complexity and enabling real-time analytics at scale.

The following is a great presentation from Linkedin, exploring batch and streaming unification.

https://medium.com/media/2f2ded95e42d4d693012030108922bb3/href

In this article, we’ll explore why unifying batch and streaming is critical, what properties a modern unified data platform must have, and how it affects various layers of the data ecosystem — from ingestion and compute to storage and development.

Separate Stacks for Batch and Streaming

Traditionally, organizations have employed two distinct architectures to process data, each tailored to specific needs:

Batch Processing:

Typically used for large-scale, periodic processing of data (e.g., nightly ETL jobs).
Systems like Apache Hadoop and later Apache Spark have been popular for running heavy batch workloads.
Latency is generally higher, ranging from hour to day level, depending on how frequently the batch jobs are triggered.

Stream Processing:

Designed for real-time or near-real-time processing, continuously consuming data from event streams (e.g., Apache Kafka).
Technologies like Apache Flink are widely used for low-latency, real-time analytics.
Latency is typically in milliseconds or seconds, providing up-to-the-moment insights.

The problem is that companies end up managing two separate infrastructures for ingesting, storing, and processing data. Data duplication often occurs, where the same data must be ingested and stored differently for batch and streaming pipelines, while operational overhead increases as teams maintain two technology stacks with different tools, frameworks, and performance trade-offs.

The Importance of Unifying Batch and Streaming!

With the growing demand for real-time data insights alongside historical batch processing, many organizations now recognize the need for unified data processing platforms that can handle both batch and streaming workloads efficiently.

Key reasons why unifying these two paradigms is essential include:

Operational Efficiency: By converging batch and streaming, companies can reduce the complexity of managing separate infrastructures. This means fewer moving parts, less data duplication, and simplified operational management.

Real-Time and Historical Data Processing: organizations can access and process both real-time and historical data without needing to switch between systems. This provides a complete picture, leveraging both real-time analytics with historical trends for better decision-making.

Consistency Across Workloads: A unified platform ensures consistent processing models across both batch and streaming. This eliminates the discrepancies that arise from using different technologies with different data semantics, ensuring that data transformations applied to streams are also applied to historical data.

Scalability and Flexibility: Unified architectures are inherently more scalable because they are designed to handle varying workloads — whether it’s processing a large batch job or handling high-throughput streams of data. Organizations gain flexibility to adjust resource allocation dynamically based on current workloads.

Cost-Effectiveness: Maintaining two distinct systems leads to redundancy in terms of resources and infrastructure costs. Unifying batch and streaming on a single platform can significantly reduce infrastructure costs while optimizing resource usage.

Key Properties of a Unified Data Stack

To support both batch and streaming workloads efficiently, a unified data stack must have specific characteristics across all layers of the data pipeline, from ingestion to storage and compute.

Next, let’s see the necessary properties of a unified platform and how it addresses both batch and streaming needs.

Unified Data Ingestion Layer

The ingestion layer collects data from different sources — whether it’s event-driven data from Kafka, relational databases, or log files. In a unified architecture, this layer must:

Support both batch and streaming data sources: Whether the data arrives as a continuous stream (e.g., via Kafka or Pulsar) or as periodic batches (e.g., from flat files or databases), the ingestion framework should handle both seamlessly.
Seamless CDC Integration: Change Data Capture (CDC) technology should be supported natively to keep the data lake synchronized with source systems in both real-time and batch scenarios.
Latency Tolerance: The ingestion framework should accommodate low-latency streaming while also supporting high-throughput batch ingestion, with the ability to switch between or combine these modes as needed. We refer to this functionality as snapshot and incremental data reading.

Unified Compute Layer

The compute layer is where data is processed, transformed, and enriched. A unified compute framework must:

Handle both real-time and batch processing: The compute engine must be capable of processing data in real-time streams and large-scale batch workloads using the same APIs and data models.
Stateful Processing: Real-time processing often requires maintaining state (e.g., tracking user sessions or windowed aggregations). The compute engine must manage the state efficiently in both streaming and batch contexts.
Event-Time Semantics: In streaming applications, the ability to process data based on the event time (when the event occurred) is crucial. Unified compute engines should maintain accurate event-time processing for both batch and streaming data.
Exactly-Once Semantics: To ensure data consistency, the compute engine should support exactly-once processing guarantees, which are critical for both real-time and batch pipelines.

It also offers capabilities to support batch processing, like a hybrid shuffle mode, adaptive and speculative execution, operator fusion codegen and more

Unified Storage Layer

The storage layer in a unified architecture needs to support data retention and retrieval in a way that accommodates both real-time and historical querying:

Transactional Support: Data storage systems should provide ACID transactions to ensure that both batch and streaming updates are consistent and atomic, even when different workloads write to the same data store.
Efficient Querying: The storage layer should support low-latency querying for real-time data while also being optimized for large-scale batch processing.
Schema Evolution and Flexibility: Since both batch and streaming data may undergo schema changes, the storage system must support schema evolution without breaking existing queries or applications.
Time-Travel and Versioning: To enable access to historical snapshots alongside real-time data, the storage layer should support time travel — the ability to query past versions of the data — while also maintaining efficient storage for real-time updates.
Streaming reads/writes: To enable real-time analytics, in addition to batch reads/writes, the storage needs to be able to support streaming read/writes. In addition, it also needs to support upserts efficiently as they are crucial for streaming data processing.

Unified Development and API Layer

Another important aspect of a unified batch and streaming platform is the development experience. Engineers should be able to use the same development tools, languages, and APIs for both streaming and batch workloads:

Single API for Batch and Streaming: A unified platform should offer a single API that allows developers to write code once and run it in both batch and streaming contexts without changes.
Unified Data Models: The data model for batch and streaming should be consistent, allowing data transformations, aggregations, and joins to work seamlessly across both paradigms.
Extensibility and Integration: The platform should integrate easily with the broader data ecosystem, supporting plugins for various sources, sinks, and connectors (e.g., JDBC, Kafka, S3).

The Holy Grail Of Unified Batch and Stream Processing

Apache Flink CDC, Apache Flink, and Apache Paimon are a powerful trio, that enable organizations to unify batch and streaming architectures.

Apache Flink provides a unified API for both batch and streaming processing, which means users can run the same set of code whether it’s batch or stream.

Apache Flink CDC is a framework that provides snapshot and incremental data ingestion. It provides also a seamless experience for schema synchronization, full database synchronization, and new table discovery.

Apache Paimon is a modern lake storage layer that supports streaming reads/writes, and batch reads/writes. It also provides ACID transactions, time travel, and schema evolution, making them ideal for unified architectures that handle both batch and streaming.

Moreover, Flink 1.20 introduced the Materialized Table, for simplifying the development of batch and streaming applications. The engine can intelligently decide if it’s a batch, incremental, or stream processing job.

Conclusion

As data needs continue to evolve, the separation between batch and streaming is becoming an unnecessary burden. Companies need both real-time insights and large-scale historical analysis in their data pipelines. By unifying batch and streaming into a single architecture, organizations can dramatically simplify their operations, reduce costs, and unlock new capabilities in real-time and historical analytics.

Apache Flink CDC, Apache Flink, and Apache Paimon combined with the Materialized Table help shape this unified future by providing the necessary infrastructure and capabilities to support both types of workloads.

Whether you’re processing massive batch ETL jobs or handling high-frequency event streams, a unified stack can offer a scalable, consistent, and efficient way to manage your data.

By consolidating batch and streaming into a cohesive stack, companies can gain the agility needed to adapt to modern data demands, providing real-time insights and large-scale historical analysis with minimal complexity.

The majesty of Apache Flink and Paimon

Giannis Polyzos — Sun, 07 Jul 2024 08:46:26 GMT

The majesty of Apache Flink and Paimon.

Why Apache Paimon is the go-to lakehouse storage for Apache Flink.

Roughly one year ago I discovered Apache Paimon — Flink Table Store at the time.

As a stream processing practitioner and enthusiast — focusing on Apache Flink — I was always intrigued by what’s next and what innovations can help users unlock more use cases and do “more with less”.

At the time I was looking to explore what would be the most appropriate table format to adopt along with Apache Flink. I had done many Delta Lake along with Apache Spark back in the day, but for Flink, it was mainly between Apache Iceberg and Hudi.

Long story short, Apache Paimon came to be and I never looked back ever since, although it took me a while to grasp the full potential of the project.

In this blog post, I aim to share my thoughts on why I think Apache Paimon is the best lake storage for Apache Flink with the hope of helping engineers and decision-makers better understand the WHY.

Setting the State ✅

If you look into official Apache Paimon documentation you will see it is a lake format and all the rich functionality the project provides.

But to fully grasp the true value as an analytical solution, we need to “zoom out” and try and see the bigger picture with Paimon as part of an ecosystem along with Apache Flink and Flink CDC.

A different approach to Paimon and as I like to say: Apache Paimon is not yet another table format, but it brings streaming data and analysis on the lake, by leveraging Lakehouse primitives. You can think of it as a gen2 lake format that extends Apache Flink’s capabilities on the lake.

A “naive way” to think about it compared to other table formats and engines is the following:

A traditional data warehouse versus a more realtime one like Clickhouse, that allows connecting upstream and downstream tables updating in realtime as new data arrives.

Streaming data and analysis has been perceived to be something complex and expensive. At the same time, there is no good “low-cost” solution for use cases that require more than a few seconds of latency (>30 seconds to a few minutes).

This means users can’t easily make cost/latency trade-offs and also experiment with streaming data easily before investing more into it.

People might wonder, why yet another lake format for Apache Flink? Delta Lake, Apache Iceberg and Apache Hudi are the three table formats, used along with Apache Spark and other query engines. Apache Flink though as a streaming first engine, has different requirements and can expand to more use cases compared to traditional batch oriented query engines. This alone sets a need for a lake storage that can unlock the full power of such an engine.

The above are key points, I hope to address in this blog post and how Apache Flink and Paimon, can provide more for you.

But first, let’s start with some fun facts 😄.

Apache Paimon has its name from a character in Genshin Impact and the following illustration summarizes the relationship between Apache Flink and Paimon 😊

Note: Apache Paimon has a sailfish as a logo, which holds the record for the highest speed marine animal.

Apache Flink Heritage

Being born under the Apache Flink umbrella, it is a first-class citizen for Apache Flink as the compute engine; It leverages its sophisticated batch and stream capabilities to enable streaming data and analysis on the Lakehouse, by keeping latencies and resources low.

It uses LSM (Log Structured Merge Trees) as the underlying data structure and uses the concept of the bucket for data separation. This allows a single table abstraction to achieve Message Queue functionality, Range Scan Queries, and Key Value Lookups.

It also provides partitioning that allows operating on slices of the data.

One SQL via Flink, for Ad-Hoc (OLAP), Batch, and Streaming queries.

In the context of Apache Flink the term Unified is used to denote — a single engine for both Batch and Streaming data.

We can think of Batch as a specialized case of Stream and similarly OLAP as a specialized case of Batch.

Similarly, Apache Paimon inherits the term Unified lakehouse storage, i.e. a single storage layer abstraction to handle batch, OLAP, and streaming data.

LSM (Log-Structured Merge Tree)

LSM is the go-to data structure for high-speed data ingestion, used by many popular data systems like RocksDB, Cassandra, ScyllaDB, Clickhouse, and more. Paimon innovatively combines the LSM structure with the lake format, bringing real-time streaming updates into the lakehouse architecture. It supports columnar formats like parquet and orc but also row-based like avro.

When dealing with streaming data, we typically see two types of streams:

Changelog Streams: for dealing with upserts and CDC data, by specifying some primary key field.
Append-Only Streams: for dealing with append-only data and keeping complete log data.

Apache Paimon provides different table modes to account for different scenarios:

Primary Key Table: to handle updates and CDC data on the Lakehouse
Append Table: For high-speed ingestion, when strict ordering guarantees are not a requirement; this table mode is not bound by the parallelism.
Bucketed Append Table: Strict ordering guarantees are a requirement. By leveraging the concept of buckets it can act as a partition similar to a Kafka topic and allows implementing message queue functionality on the Lakehouse.

Strong Integration with Apache Flink CDC

As a “best buddy to Apache Flink”, Paimon integrates seamlessly with Apache Flink CDC and fully automates the CDC Lakehouse ingestion.

It can handle all the upstream snapshot and incremental data reading, schema table changes, and perform a full database synchronization.

Vendors like Ververica, also provide more functionality, like new table discovery and source merge optimizations.

Currently, it supports operational data stores like PostgreSQL, MySQL, MongoDB, and Amazon RDS.

At the same time, it also supports streaming storage layers, like Apache Kafka, Apache Pulsar, and Redpanda. For such systems, it supports several cdc formats and it can extract all the schema changes from the message payload.

Ingestion with Tags

When performing streaming CDC ingestion, data is synchronized in real time, but for historical analysis scenarios, it might be required to create daily views for analysis and time travel.

This is important for querying the state of your tables at different points in time or identifying what changed between different snapshots. This helps with use cases when you want to perform data backfills, identify data corruptions, or restore to previous versions.

For example, you might want to analyze business sales, across different quarters, identify data that has been corrupted or deleted, or reproduce business reports.

While doing CDC ingestion users can leverage Tags to achieve the above.

The important part of this is the file reuse mechanism, because of the LSM that helps with redundant storage costs.

Typically with other data lake/lakehouse solutions, when a tag is created, it needs to maintain a full copy of the data for that tag. For example, in scenarios that require to keep around data for 100 days — 100 copies of that data will need to be stored. Apache Paimon can leverage file reuse since it uses LSM.

What is file reuse? The LSM structure is as follows:

The minor compaction of the LSM means that incremental data will only merge the files in the first few levels (levels 0 and 1 for example).

This means that the lower levels (which also have most of the data) will not participate in the compaction process(unless there is too much incremental data). These large files can be reused between multiple tags and will not cause redundant file storage amplifications.

In the above scenario, that data needs to be stored for 100 days, Paimon’s file reuse will result in only 1 or 2 files of the data, instead of 100 files.

File reuse happens automatically via Paimon’s Snapshot management; the end user doesn’t have to consider it.

Cost-efficient Streaming ETL

Apache Paimon provides a rich variety of merge engines that control, how data updates are handled. This is important when dealing with data mutation coming from changelog streams.

All states that need to be accumulated and stored, i.e. from computing aggregations; can now leverage object storage instead of Flink’s state.

This reduces local disk storage and managed memory cost requirements.

Supported merge engines include:

Deduplication and First Row: remove duplicates and you can choose whether you want to keep the latest or first entry respectively.
Aggregation: a rich variety of aggregation functions (with more being added continuously). The end-user only needs to specify the aggregation function and Paimon will handle the rest.
Partial-Update: It’s a more efficient way of dealing with table widening on the primary key. It allows for replacing expensive streaming joins and can be used along with the aggregation engine. The partial-update engine also provides built-in functionality via the sequence groups that allow the user to control, how updates are handled; for example when records arrive late.

Compaction Computing Costs

Since Paimon uses LSM and compaction runs automatically, there might be concerns about computing costs. Having the compaction process running along with the job, might raise concerns that more resources will be needed and it might be better to have a scheduled compaction job that runs periodically.

The design goal of Paimon is to be low cost and thus the compaction process is lightweight. For scenarios with much incremental data, the writer might backpressure as it needs to wait for the compaction process to finish. In such scenarios, compaction can be configured to run completely asynchronously.

At the same time, users can always run dedicated compaction jobs if required, which also helps to separate compaction resources

A dedicated compaction job is also recommended for scenarios where multiple jobs are writing to the same table concurrently.

Changelog Producers

Apache Paimon has the concept of the changelog-producers.

Database Systems generate and keep track of the generated changes. Similarly, the changelog-producers generate a complete changelog; i.e. keep track and record the different operations that can take place +I, -U, +U, -D, which is important for downstream consumers to always “see” correct results.

Changelog producers also allow to disable the Normalize operator (set `scan.remove-normalize`to true) that is used by Flink for result correctness, which is an expensive one.

Consumer Mechanism for Streaming Reads

To connect upstream and downstream tables via streaming, the consumers need to leverage the changelog and overall provide strong streaming capabilities.

Table formats provide version management mechanisms that are relatively heavy. They generate snapshots that not only manage incremental files but also manage files during the compaction process.

In production, snapshots are continuously generated, but because too many snapshots will result in a large number of files and redundant data storage, a cleanup mechanism is needed. However consumers with other table formats, consumers are not aware of this.

If the snapshots expire and your consumers are reading “older” data, you are likely to run into FileNotFoundExceptions.

To address this Paimon has a consumer mechanism and leverages consumer-ids. The progress of the consumers is recorded in the file system.

Snapshot expiration will not delete the required snapshots and can ensure the safety of the stream reading and provide a stream reading progress similar to Kafka’s group-id.

Separation of Changelog Lifecycle Management

The consumer mechanism, safeguards against snapshot expirations and running into unpleasant scenarios that your data gets deleted, while stream reading.

But still older snapshots need to be kept around for longer resulting in redundant storage amplification.

Paimon addresses the problem of keeping too many files around by separating the management of Snapshot and Changelog files.

When a snapshot needs to be deleted, its changelog files get stored separately for management, allowing the users to keep around only the relevant data.

For example, you can store the changelog files for 24 hours or a few days, to ensure your downstream consumers have the data available for reading. On the other hand, you can set the snapshots to expire after one or two hours.

This functionality provides an experience consistent with message queues, which you can also implement with Apache Paimon, but at much lower costs.

Ecosystem

When it comes to building a Lakehouse or a Streaming Lakehouse the ecosystem is an important part.

This requirement is set to provide openness and enable organizations to have a true “single source of truth” for their data; then any query engine can be integrated on top.

You always have the option to leverage Flink SQL for your batch and OLAP queries.

Side Note:

Although Apache Flink has been historically known for its stream processing capabilities, over the last releases (see 1.18+) it has reached industry standards like Apache Spark in terms of batch processing.

It has added features like:

✅ Operator Fusion Codegen: to optimize the code generated by SQL planner.
✅ Adaptive local hash aggregate: to dynamically decide whether to use local aggregations.
✅ Runtime filters and dynamic data pruning: to optimize data processing efficiency.
✅ Adaptive Execution Plan (AQE): to perform automatic concurrency inference and dynamic load balancing.

Of course, there is no “one size fits it all” solution, and other query engines, might have richer ecosystems — for example, Apache Spark in terms of data science and ML.

Apache Paimon supports all the usual suspects with good connector implementations, like Apache Spark, Trino, Presto, StarRocks, Apache Doris, and Apache Hive.

The Trino and StarRocks integrations allow for blazing-fast OLAP queries directly on the Lakehouse, especially when used in combination with Deletion Vectors.

Openness and integration is an important piece.

Although Apache Paimon supports all the usual suspects on the open-source side, there is always the need for integration with commercial systems. These systems can include — Databricks (there is already a strong integration with Apache Spark), Starburst, Snowflake, and Amazon Athena.

At the same time, there are two kinds of users currently observed:

Those that have already implemented a Lakehouse and introduced Apache Paimon along with Apache Iceberg tables, to unlock more use cases.
Those getting started invest in Apache Flink and require integrations with commercial offerings like Snowflake and Amazon Athena.

To address the above — and as Apache Paimon shares a similar file layout to Apache Iceberg due to its simplicity — in the next release Apache Paimon will introduce an Apache Iceberg snapshot so that users can integrate with more engines.

Key Takeaways

Some key takeaways here include that when it comes to Apache Paimon:

It has a strong integration with Apache Flink and Flink CDC, providing a seamless analytical solution.
It allows efficient handling of changelog streams — data mutations — on the lake; required for analyzing streaming data.
It provides a single table abstraction for batch, OLAP, and streaming data.
It allows streaming data and real-time analysis on the lake and connecting upstream and downstream tables, by providing lots of safeguards for streaming reads.
It aims to keep resources and latencies low to enable users to make better trade-offs between costs and latencies

To conclude, the current status quo includes real-time streaming via Apache Kafka and Flink and then offloading — typically append-only data — on a Lakehouse; then query engines take over, which is more batch-oriented.

If you want to do more though, handle data mutations and have the processing to Flink’s excellent capabilities; create real-time views ready to be queried by other engines, or replace expensive streaming workloads that can afford some extra latency then Apache Flink along with Paimon, might be a good alternative for you.

Make sure to keep an eye on the project, give it a try and if you like it, don’t forget to give it some ❤️ via ⭐ on GitHub.

A Glimpse into Flink CDC 3.0

Giannis Polyzos — Tue, 21 May 2024 06:14:53 GMT

The Next-generation of Streaming Data Integration

Apache Flink CDC 3.1.0, was released last week marking the first official release after the project was graduated to the Apache Incubator.

Disclaimer: This article is written with Leonard Xu, PMC of Apache Flink and Flink CDC project Lead

This blog post aims to provide some insight, into the goals of Flink CDC 3.0. Flink CDC 2.0 set the bar high as a CDC solution and has been adopted in many large-scale production environments, to address problems users would typically come across with alternative solutions.

This blog post aims to provide, some insight into the project, with the hope of helping engineers and architects better understand why they should consider this solution regardless of the scale.

Introduction

Flink CDC is a real-time data integration framework based on the Change Data Capture (CDC) technology of database changelogs.

It provides multiple advanced features, such as full and incremental data synchronization, lock-free reading, parallel reading, automatic synchronization of schema changes, and distributed architecture, on top of Apache Flink’s excellent processing capability and robust ecosystem.

Full and incremental data synchronization refers to the process of reading all the historical data within the database and then automatically switching to reading the incremental data.

Background

Flink CDC 2.0 received a lot of traction from many users. However, although it provided many benefits compared to existing CDC solutions, resulting in a strong adoption, users experienced the following pain points:

User experience: Flink CDC provides only source connectors and does not support end-to-end data integration, making it difficult to create jobs via SQL syntax or the DataStream API.
Frequent maintenance: Frequent table creation and deletion operations are necessary due to the frequent changes of schemas in source databases.
Scalability: Large amounts of resources are required to synchronize data from thousands of tables and ingest tens of thousands of tables into data lakes or data warehouses. In addition, scaling cannot be automatically performed to handle different resource requirements for the full synchronization and incremental synchronization stages.

To tackle the above challenges Flink CDC 3.0 was introduced.

Flink CDC was donated to the Apache Foundation, aspiring to become a complete streaming integration framework, based on the following design principles:

End-to-end experience: As an end-to-end data integration framework, Flink CDC 3.0 provides high-level abstractions for setting up data movement pipelines easily.
Schema Synchronization: It can automatically synchronize schema changes from upstream to downstream system, allowing users to also add tables to existing jobs at any time.
Elasticity: Idle resources can be automatically reclaimed, and a single sink instance can write to multiple tables simultaneously.
Large data volume: Users' legacy databases can be large, commonly containing over 100 TB of data.
Real-time processing of incremental data: The business value of incremental data is higher than that of historical data but decreases over time, which leads to high requirements for data freshness; new incoming events need to be processed as soon as possible.
Data ordering: Support for global preservation of data ordering to ensure the consistency of processed data.

Design of Flink CDC 3.0

Architecture

The architecture of Flink CDC 3.0 is divided into four layers.

The architecture of Flink CDC 3.0 is divided into four layers:

Flink CDC API: YAML-formatted API operations are provided to help end users configure data synchronization pipelines. Users can call the API operations in Flink CDC CLI.
Flink CDC Connect: Source and sink connectors are provided to interact with external systems. Flink CDC 3.0 encapsulates the source connectors of Apache Flink and Flink CDC to read and write data to external systems.
Flink CDC Composer: This layer translates data synchronization tasks into Flink DataStream jobs.
Flink CDC Runtime: Custom Flink operators are provided for different data synchronization scenarios to implement advanced features, such as schema changes, routing, and transformations.

User-Friendly API Design

Flink CDC 3.0 is tailored for seamless streaming data integration scenarios. Users do not need to worry about the implementation details of the framework.

They can easily create data synchronization pipelines by using a yaml file; configuring data sources, sinks, and intermediate transformations or routes.

The following figure shows a sample yaml for synchronizing data from a MySQL database to Apache Kafka or Paimon.

Ingestion Pipeline from MySQL to Kafka or Paimon

Pipeline Connector API

To facilitate the integration of external systems into data synchronization pipelines, Flink CDC 3.0 introduced the Pipeline Connector API.

DataSource: it is used to collect change events from external systems and pass them to downstream operators. It is composed of the EventSourceProvider and MetadataAccessor. EventSourceProvider builds Flink sources, whereas MetadataAccessor accesses metadata.
DataSink: it is used to apply schema changes received from upstream operators and write the changed data to external systems. It is composed of EventSinkProvider and MetadataApplier. EventSinkProvider builds Flink sinks, whereas MetadataApplier applies metadata changes (such as table schema changes) to the destination system.

To ensure compatibility with the Flink ecosystem, the design of DataSource and DataSink follows the same logic as Apache Flink. Developers can easily integrate external systems with Flink CDC 3.0 by using Flink connectors.

Core Features of Flink CDC 3.0

To achieve high performance in scenarios such as schema changes, full database synchronization, and table merging, Flink CDC 3.0 integrates the capabilities of Apache Flink and provides multiple custom Flink operators to support various synchronization modes.

Schema Evolution

Schema evolution is a common but challenging feature of data synchronization frameworks. Flink CDC 3.0 introduces a SchemaRegistry to map jobs in topology and uses a SchemaOperator to manage schema changes in job topologies.

Here’s how Flink CDC 3.0 handles schema changes:

When a schema change is detected in a data source, SchemaRegistry issues a pause request to SchemaOperator. After receiving the request, SchemaOperator pauses the streaming ingestion and flushes the data to maintain schema consistency.
Once the schema change is synchronized to the external system, SchemaRegistry issues a resume request to SchemaOperator. After receiving the request, SchemaOperator resumes with the streaming ingestion.

Full Database Synchronization

Users can specify a multi-table or full database synchronization task by configuring the DataSource in the configuration file of Flink CDC 3.0.

The schema evolution feature enables automatic synchronization for the entire database. When new tables are detected, SchemaRegistry automatically creates replicas in the destination system.

Table Merging

Another common use case of Flink CDC 3.0 is merging multiple source tables into a single sink table. Flink CDC 3.0 employs a Route mechanism to implement table merging and synchronization. Users can define routing rules in the configuration file of Flink CDC 3.0 by using regular expressions to specify the source tables and the sink table.

High-performance Data Structure

To reduce serialization overhead during data transmission, Flink CDC 3.0 adopts a high-performance data structure.

Schemaless deserialization: Schemaless deserialization decouples schema information from changed data. Before sending changed data, DataSource sends the schema description, which is tracked by the framework. This way, schema information does not need to be bound to each changed record, and the serialization cost for wide tables is significantly reduced.
Binary storage format: Data is stored in a binary format during synchronization. Deserialization is performed only when the detailed data of a field is read (such as when the table is partitioned by the primary key) to reduce serialization costs.

In addition to fundamental data synchronization capabilities, Flink CDC 3.0 provides multiple advanced features, such as automatic synchronization of schema changes, full database synchronization, and table merging and synchronization, to cater to complex data integration scenarios.

The automatic synchronization of schema changes frees users from manual intervention when schema changes occur in a data source, greatly reducing operational costs.

Moreover, only a few operations are needed to configure a multi-table or multi-database synchronization task, facilitating users’ development.

Conclusion

Apache Flink CDC 3.0 sets a new direction and the future looks quite promising.

It supports a rich variety of connectors already, but for the 3.1.0 version, the streaming data integration framework supports out-of-the-box — MySQL, Apache Doris, StarRocks, Apache Kafka and Apache Paimon.

Apache Flink CDC 3.0 is part of the unified stack.

It also powers Ververica’s streaming data movement framework.

Unified Ingestion refers to the process of being able to read all the historical data within the database (batch reads) and then without locking the database, automatically switch to reading the incremental data (streaming reads).

At the same time, the framework needs to be able to ensure data consistency, downscale resources and make sure it doesn't put pressure to the source system.

Make sure to keep an eye on the project, give it a try and if you like it, don’t forget to give it some ❤️ via ⭐ on GitHub.

Apache Paimon: Introducing Deletion Vectors

Giannis Polyzos — Mon, 13 May 2024 06:36:44 GMT

Near real-time updates and extremely fast queries

Apache Paimon is now a Top-Level Project under the Apache Software Foundation.

It started as a project under the Apache Flink umbrella (originally called Flink Table Store) and quickly moved from an early-stage umbrella project to an Apache incubator project. In less than 12 months, it has been deployed in many production environments and has grown a strong community.

Apache Paimon uses LSM trees (Log-structured merge trees), the go-to data structure for high-speed data ingestion, used by many popular data systems like RocksDB, Cassandra, ScyllaDB, Clickhouse, etc. It Innovatively combines the LSM structure with the lake format, bringing real-time streaming updates into the lake architecture.

Apache Flink is a unified compute engine and Apache Paimon provides unified lakehouse storage. When integrated with Flink CDC which provides unified data ingestion, we get an end-to-end unified batch & streaming stack for real-time data analytics.

More information on Apache Flink’s roadmap and the vision for the project (along with Flink CDC and Apache Paimon) can be found here.

Unified Storage aims to provide a single table abstraction for batch, stream, and OLAP queries.

Background

Paimon 0.8, was released last week and marks the first official release after the project graduation.

This new release brings many important features and improvements, along with important doc improvements.

In this blog post, we will explore a new feature called deletion vectors.

We will explore why they are needed, and how they can help provide an even better balance between streaming writes and fast queries.

Disclaimer: This article contains contributions from Jingsong Li, PMC chair of Apache Paimon.

Business Use Case

Let’s take a business example, we have an orders table in our database and want to ingest it into the data lake.

CREATE TABLE orders (
    order_id          BIGINT,
    order_name        STRING,
    order_user_id     BIGINT,
    order_shop_id     BIGINT,
    order_product_id  BIGINT,
    order_fee         DECIMAL(20, 2),
    order_create_time TIMESTAMP(3),
    order_update_time TIMESTAMP(3),
    order_state       INT,
    PRIMARY KEY (order_id) NOT ENFORCED
)

After entering the lake, you can perform batch ETL scheduling and analysis, and query. The general structure is as follows:

Generally speaking, Batch ETL doesn’t have high requirements for read performance and can be completed in minutes. Analytical queries on the other hand need to return within seconds; we need to provide analysts with a good user experience 😄

Next, let’s take a look at how Paimon’s underlying design satisfies the above architecture.

Primary Key Table

This table requires a primary key to be set. It can handle updates automatically in realtime and can also be queried in real-time.

The basic file structure of Paimon is as follows:

The table or partition contains multiple buckets, and each bucket is a separate LSM tree structure containing multiple files.

The writing process of LSM is as follows: With every checkpoint, batches of data get stored on disk. L0 files are flushed on disk and compaction is triggered automatically to merge the data and handle small files.

By default, Paimon doesn’t require you to make MoR or CoW tradeoffs — due to the LSM, but it can mimic similar behavior:

MoR (Merge On Read): The data merging process is semi-asynchronous by default (but when there are too many L0 files; i.e lots of incremental data, it might backpressure the writing); To avoid this you can set the compaction to be fully asynchronous (no backpressure during writing).
CoW (Copy On Write): The merged data can also be set to be synchronous; that is, after writing trigger a full compaction and merge all the files.

Merge-On-Read

The MoR mode requires merging all files. Because files are ordered, a multi-way merging is performed which requires a comparison between the primary keys.

There is a problem here.

A single LSM tree can only be read by a single thread, so the parallelism of reads is limited. If the amount of data within a bucket is too large, it can result in poor reading performance.

The recommended bucket size is between 200MB and 1GB. This allows keeping the query performance < 10 seconds.

On the other hand, if the Bucket is too small, there will be more small files being read and written, which will put pressure on the file system.

In addition, due to the merging process, filter-based data skipping can’t be performed on non-primary key columns, otherwise, new data will be filtered out, resulting in incorrect old data.

This mode provides the best writing performance, as data does not need to be forcibly merged. However, when reading the LSM, there are performance issues due to multi-path merging:

Single LSM, single thread, limited concurrency.
Non-primary key columns cannot be filtered and pushed down.
Multi-way merging requires certain performance consumption.

Copy-On-Write

An intuitive idea is whether the data can be merged directly during writing. We can set full-compaction.delta-commits to 1, to force a full compaction with every new snapshot. (this is the most “forceful mode”, so you can experiment with different intervals)

ALTER TABLE orders SET  ('full-compaction.delta-commits' = '1');

When reading the data now, there is no need to merge multiple files, so we can get really fast queries. However, since we trigger full compaction with every write, we can notice serious write amplifications.

So basically MoR and CoW can provide two extreme alternatives; either really fast writes or really fast queries.

Although Paimon’s MoR (due to the LSM) is sufficient in most cases, such as Batch ETL, there are some shortcomings in some scenarios that require high-performance query analysis.

Is there a mode that we can achieve a better trade-off between reads and writes?

Deletion Vectors

Paimon 0.8 introduces the Deletion Vectors.

The Deletion Vectors mode is designed to take into account both data reading and writing efficiency.

Vectors are generated during writing, representing which data in the file has been deleted. Unnecessary data can be directly filtered out when reading. This is equivalent to the completion of the merge during full compaction and doesn’t affect read performance.

A simple example is as follows:

Delete data can be marked directly to the Delete file. Upsert updates data by first deleting and then adding.

Let’s look at this mode of reading and writing:

The reading performance is good: during reading, data can be directly retrieved by employing data with deletion vectors, avoiding additional merge costs between different files. Furthermore, data reading concurrency is no longer limited, and non-primary key columns can also be used for filter push-down.
Writing performance: additional overhead (looking up LSM Tree and generating the corresponding Deletion File) will be introduced during writing. You need to query and mark data corresponding to the same primary key, and modify the Deletion Vectors of the historical file.

Generally speaking, in this mode, we can get a huge improvement in read performance without losing too much write performance.

Dealing with deletion files is also easy with Paimon, as it uses LSM and the most common application of LSM is point lookups.

This means we can quickly find the files that need to be deleted, along with their line numbers using Paimon LSM’s point lookup capability.

When the data is written, it will go to the Lookup LSM Tree and produce the corresponding Deletion File. The deleted data can be directly filtered out when reading.

For those familiar with Paimon, it uses the same underlying mechanism with the lookup changelog-producer. Each bucket will produce a corresponding Deletion File.

The structure of the file is as follows:

Each file saves its Deletion Vector through Bitmap. One bucket and one deletion file can minimize the problem of too many small files caused by Deletion files.

Each Bitmap uses a RoaringBitmap structure. Apache Iceberg and Delta Lake already use this approach for query acceleration during batch deletion. The RoaringBitmap is a compressed bitmap that can significantly reduce the storage space amplification.

Testing the Effect of Deletion Vectors

Test Environment

Running Apache Flink 1.17 with Paimon 0.8, for writing on Amazon s3. DV by default is disabled, so deletion-vectors.enabled needs to be set to true.

Then we will be querying the data using Spark 3.3.1 and Trino 422. The latest Paimon-Trino version has been optimized for ORC reading.

Data size

The table schema is the orders table defined at the beginning of this blog post. We generate 500 million records using the Flink Datagen connector with the primary key ranging from 1 to 1 billion; using bucket = 8.

After the writing is completed, a single bucket will contain about 40+ files, with a size of ~5 GB (yes more than the recommended; for testing the effect)

Write performance

Without DV enabled: 455 seconds, single concurrent writing of ~135.000 records per second
Turn on DV: 937 seconds, single concurrent writing of ~66.000 records per second

The writing performance is twice as slow, but we will continue to optimize it in subsequent versions.

Query performance

As you can see, there is no difference between Trino and Spark, because they both share the same Reader implementation when they need to be merged.

When DV is turned on, Spark’s query performance is greatly improved, while Trino’s improvement is even greater.

Why? Because Trino uses Trino’s ORC Reader; its column storage structure is not needed when merging. This allows native read performance.

Conclusion

Apache Paimon’s Deletion Vectors mode provides a good balance between reads and writes. We can sacrifice some writing performance, but in return, we can get way faster read queries.

In the future, we can expect optimized Vector Deletion support for StarRocks. We should probably expect a powerful performance experience, as StarRocks can provide the best OLAP for Paimon.

Apache Paimon keeps enhancing the Streaming Lakehouse experience. As a unified lakehouse storage, it allows support for all scenarios; batch, OLAP, and stream at minute-level latencies.

Make sure to keep an eye on the project PIPs and if you like it, don't forget to give it some ❤️ via ⭐ on GitHub.

Understanding Watermarks in Apache Flink

Giannis Polyzos — Tue, 07 Mar 2023 07:29:00 GMT

Stories In the land of Streams

In the realm of time

… one traveler once asked, what is the use of Watermarks?

1. Introduction

Consider the following event stream

and assume we want to perform different operations like sorting the events or performing some aggregates. We perform these operations in specified time intervals, i.e from t1 to t2, t2 to t3, and so on.

As depicted in the picture we can observe events arriving late and out of order. So we ask ourselves:

When do I consider my results complete?
How long should I wait to make sure events don’t arrive out of order?

A watermark helps us address these questions as it provides a way to keep track of the progress of time.

In a stream of events if we see a watermark of time t we know that up to this specific point in time t the stream is complete. — or at least we consider it complete.

2. How do watermarks work?

All the events flowing through Flink pipelines and being processed are considered StreamElements. These StreamElements can be either StreamRecords (i.e every event that is being processed) or a Watermark.

A watermark is nothing more than a special record injected into the stream that carries a timestamp (t).

It flows through the operators and tells each operator that no elements with a timestamp older or equal to the watermark timestamp (t) should arrive.

This way the operator knows that all the results up to this point in time, can be considered complete and is ready to emit those results.

3. How are Watermarks generated in Apache Flink?

When working with event-time you need a way to actually tell Flink how to extract the timestamp from the incoming events and generate Watermarks.

Flink allows you to achieve this by using a WatermarkStrategy. A WatermarkStrategy informs Flink how to extract an event’s timestamp and assign watermarks.

The following snippet uses a WatermarkStrategy to extract the eventTime from a ClickEvent.

WatermarkStrategy
  .forBoundedOutOfOrderness(Duration.ofSeconds(5))
  .withTimestampAssigner(
      SerializableTimestampAssigner { event: ClickEvent, _: Long ->
          event.eventTime
      } as SerializableTimestampAssigner?
)

The same can be achieved in the Flink SQL API using the following snippet

CREATE TABLE events (
  userid STRING,
  eventTime_ltz AS TO_TIMESTAMP_LTZ(eventTime, 3),
  ....
  userSession STRING,
    WATERMARK FOR eventTime_ltz AS eventTime_ltz - INTERVAL '5' SECONDS
) WITH (...)

Notice the BoundedOutOfOrderness in the first snippet and eventTime_ltz — INTERVAL ‘5’ SECONDS in the second one.

A watermark is basically a heuristic and Flink provides a few built-in Watermark Generators with BoundedOutOfOrderness being one of them.

This allows Flink to set a Watermark and the user can set a threshold for which it’s acceptable to wait for results to arrive late — i.e the maximum delay we are allowed to wait.

Let’s assume that in our event stream example assume we can afford to wait up to 5 seconds for out-of-order events, but no more. Within those 5 seconds, element 4 can arrive and be included in the computation, but element 9 might be too late (i.e arrive more than 5 seconds later) and be excluded.

Different applications have different needs and there is a trade-off between completeness and latency.

We can add some extra delay for late arriving events, but how much delay is enough?

Latency-sensitive applications may not afford to wait, so:

they may provide incomplete results
or after providing initial (incomplete) early results, provide updated results as late data arrives.

Less timely applications can afford to wait longer for out-of-order events.

4. How are watermarks propagated in Apache Flink?

Each parallel instance of a source operates independently based on the events it processes. Assume the following graph processing two input Kafka topics with two partitions.

The current watermark for a task with multiple inputs is the minimum watermark from all of its input

Look at the Window(1) operator for example. It receives 29 and 14 as watermark inputs and sets the current watermark to 14 — that is the minimum of both.

When the operator receives the next watermark input it updates its local version of the current watermark and passes it onward, only after the watermark is processed by the task — and along with that the event-time clock also advances.

Note:

Operators that do not internally buffer elements can always forward the watermark that they receive.
Operators that buffer elements, such as window operators, must forward a watermark after the emission of elements that are triggered by the arriving

5. The curse of Idle Sources

Up to this point, we know that Watermarks help make progress and update the event-time clock.

But what happens when a source becomes idle, i.e produces no data?

With no events arriving, an idle source can cause the entire pipeline to stall, as it has no basis for advancing the current watermark. This can frequently be the root cause for cases when while we perform a window or join operation, we observe that no results are being emitted downstream.

Some potential workarounds to the idle source problem can be:

Set the watermark for a mostly idle stream to Watermark.MAX_WATERMARK- e.g for streams that are rarely changing or rarely evolving over time.
Use withIdleness— marks streams as idle after some duration with no events. Idle streams do not hold back watermarks from active streams.
Do a rebalance() on the stream in order to mix idle and non-idle streams— this can cause expensive network shuffles though.
Use Watermark Alignment — currently still in beta

What if all sources become idle?

No events mean no watermarks and thus timers won’t fire — windows and joins won’t send results downstream.

If the idleness is not temporary you can emit keep-alive events from your source(s) — i.e implement watermarking that detects idleness and manually advances the watermark.

What is a good rule of thumb for setting Watermarks with some extra delay?

This is a common question for people starting with Apache Flink and in order to be able to make the right choice, try and answer the following:

Is my application latency-sensitive and needs to emit results as soon as possible?
Can I afford to wait an extra time period — at the cost of adding unnecessary extra delay?
How much delay is good enough to ensure a good trade-off between latency and results completeness?

6. Conclusion

In this article, we took a closer look at watermarks and how they work inside Apache Flink.

Watermarks are one of the most important properties of stream processing and at the same time a source of confusion and the culprit for many unexpected behaviors.

Understanding how they work and testing them throughout your application’s lifecycle can save you a few headaches.

Apache Flink SQL: A Gentle Introduction

Giannis Polyzos — Mon, 06 Feb 2023 00:00:27 GMT

Flink SQL is a powerful high-level API for running queries on streaming (and batch) datasets. In this article we will see:

Why it’s powerful and how it helps democratize Stream Processing and Analytics
Understand basic concepts around Streaming and Flink SQL
Setup Kafka and Flink Clusters and get started with Flink SQL
Understand different kinds of Processing Operators and Functions
Different ways of running Flink SQL Queries

1. Streaming (and Batch) SQL

1.1 Unified Batch and Streaming

When we hear about SQL (referenced as batch SQL here) we think of the following tabular format you typically find in RDBMS, on which we operate and run computations — from simple projections (like SELECT and Filter) to Aggregations to Windowed Functions.

Batch SQL Queries operate on static data, i.e. on data stored on disk, already available and the results are considered complete.

How can Tables relate with Streams?

Let’s think of a data stream now

A stream is basically an unbounded dataset of incoming events, i.e. it has no end. In the heart of a stream is the append-only log, i.e. each incoming event can be considered as a row that gets appended at the end of the log — similar to a database table.

In practice, if we follow this mental model we can think of a stream as a collection of snapshots of bounded datasets.

This is what enables Unified Batch and Streaming Architecture and allows the use of a single API — like Flink SQL — to handle both Batch and Streaming data; no underlying code changes are needed.

1.2 Streaming SQL Semantics

The rules are as follows:

The Input tables are constantly changing and possibly unbounded

Append Only Streams: Keeps all the history in the stream. Every new event is an insert operation in the append-only log
Changelog Streams: Keeps the most recent value (for some key).

2. Query results are never final, continuously updated, and potentially unbounded

On the left side, we have our append-log (or a collection of bounded datasets as we discussed above) and we run a Streaming SQL Query on that table.

As we keep ingesting new events, they get appended as new rows to the log. These events yield changes, which results in the output table is continuously updated.

This is called a Dynamic Table.

Flink consists of catalogs that hold metadata for databases, tables, functions and views.

A catalog can be non-persisted (In Memory Catalog) or persistent backed by an external system like the PostgresCatalog, the PulsarCatalog and the HiveCatalog.

For InMemory catalogs all metadata will be available only for the lifetime of the session.

In contrast, catalogs like PostgresCatalog enables users to connect the two systems and then Flink automatically references existing metadata by mapping them to its corresponding metadata.

For example, Flink can map Postgres tables to its own table automatically, and users don’t have to manually re-writing DDLs in Flink SQL.

Within the catalogs, you create databases and tables in these databases.

When creating a table its full table name identifier is: *..* and when a catalog and/or database is not specified the default ones are used.

3. Environment Setup

As a warmup exercise let’s start the Flink SQL CLI to run a few commands, but first, we need to have a Flink and a Kafka Cluster up and running.

Make sure you have docker-compose installed on your machine, as we will use the following docker-compose.yaml file to set up the required clusters.

version: "3.7"
services:
  zookeeper:
    image: bitnami/zookeeper:3.8.0
    ports:
      - "2181:2181"
    environment:
      ALLOW_ANONYMOUS_LOGIN: "yes"
  kafka:
    image: bitnami/kafka:3.3.1
    ports:
      - "9092:9092"
    environment:
      KAFKA_CFG_ZOOKEEPER_CONNECT: zookeeper:2181
      ALLOW_PLAINTEXT_LISTENER: "yes"
      KAFKA_ADVERTISED_PORT: 9092
      KAFKA_ADVERTISED_HOST_NAME: kafka
      KAFKA_LISTENERS: >-
        INTERNAL://:29092,EXTERNAL://:9092
      KAFKA_ADVERTISED_LISTENERS: >-
        INTERNAL://kafka:29092,EXTERNAL://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: >-
        INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: "INTERNAL"
    depends_on:
      - zookeeper
  jobmanager:
    build: .
    container_name: jobmanager
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
  taskmanager:
    build: .
    depends_on:
      - jobmanager
    command: taskmanager
    deploy:
      replicas: 1
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 1

3.1 The Flink SQL Client

Run docker-compose up, wait for a few seconds and your clusters should be up and running.

Let’s start the Flink SQL CLI by running docker exec -it jobmanager ./bin/sql-client.sh and then execute the following commands as a warmup with the SQL client:

Flink SQL> SHOW CATALOGS;
+-----------------+
|    catalog name |
+-----------------+
| default_catalog |
+-----------------+
1 row in set

// Create a new Database
Flink SQL> CREATE DATABASE mydbl

Flink SQL> SHOW DATABASES;
+------------------+
|    database name |
+------------------+
| default_database |
|            mydbl |
+------------------+
2 rows in set

Flink SQL> use mydbl;
[INFO] Execute statement succeed.

// currently we have no tables
Flink SQL> SHOW TABLES;
Empty set

// a truncated output of some available functions.
Flink SQL> SHOW FUNCTIONS;
+-------------------------------+
|                 function name |
+-------------------------------+
|             AGG_DECIMAL_MINUS |
|              AGG_DECIMAL_PLUS |
|                ARRAY_CONTAINS |
|                      COALESCE |
|             CURRENT_WATERMARK |
|                      GREATEST |
|                        IFNULL |
|                       IS_JSON |
|                    JSON_ARRAY |
|  JSON_ARRAYAGG_ABSENT_ON_NULL |
|    JSON_ARRAYAGG_NULL_ON_NULL |
|                   JSON_EXISTS |
|                   JSON_OBJECT |
| JSON_OBJECTAGG_ABSENT_ON_NULL |
|   JSON_OBJECTAGG_NULL_ON_NULL |
|                    JSON_QUERY |
|                   JSON_STRING |
|                    JSON_VALUE |
|                         LEAST |
|              SOURCE_WATERMARK |

It’s time now to get into some interesting stuff, but before that let’s also create the Kafka topics we will be using for our examples.

docker exec -it depths-of-flink-kafka-1 kafka-topics.sh --create \
  --topic sensor.info \
  --partitions 1 \
  --config cleanup.policy=compact \
  --bootstrap-server localhost:9092

docker exec -it depths-of-flink-kafka-1 kafka-topics.sh --create \
  --topic sensor.readings \
  --partitions 3 \
  --bootstrap-server localhost:9092
  
docker exec -it depths-of-flink-kafka-1 kafka-topics.sh --bootstrap-server localhost:9092 --describe

------- Output -------
Topic: sensor.info TopicId: zFY47WiRS721XIUik2nRBg PartitionCount: 1 ReplicationFactor: 1 Configs: cleanup.policy=compact
 Topic: sensor.info Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001
 
Topic: sensor.readings TopicId: HGvGHOeKQQCxG3cly2R7Lw PartitionCount: 3 ReplicationFactor: 1 Configs:
 Topic: sensor.readings Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001
 Topic: sensor.readings Partition: 1 Leader: 1001 Replicas: 1001 Isr: 1001
 Topic: sensor.readings Partition: 2 Leader: 1001 Replicas: 1001 Isr: 1001

3.2 Create Tables

Let’s go back to our Flink SQL cli and the first thing we need is some tables to work with.

We will be using the Kafka Flink SQL connector to read sensor information and sensor readings from the two Kafka topics we already created. The following block shows how to create a table from a Kafka topic.

Note: We will be using the default catalogs as well as the default database.

CREATE TABLE readings (
  sensorId STRING,
  reading DOUBLE,
  eventTime_ltz AS TO_TIMESTAMP_LTZ(`timestamp`, 3),
  `ts` TIMESTAMP(3) METADATA FROM 'timestamp',
  `timestamp` BIGINT,
    WATERMARK FOR eventTime_ltz AS eventTime_ltz - INTERVAL '30' SECONDS
) WITH (
  'connector' = 'kafka',
  'topic' = 'sensor.readings',
  'properties.bootstrap.servers' = 'kafka:29092',
  'properties.group.id' = 'group.sensor.readings',
  'format' = 'json',
  'scan.startup.mode' = 'earliest-offset',
  'json.timestamp-format.standard' = 'ISO-8601',
  'json.fail-on-missing-field' = 'false',
  'json.ignore-parse-errors' = 'true'
);
  
  
  Flink SQL> DESCRIBE readings;
+---------------+----------------------------+------+-----+-------------------------------------+----------------------------------------+
|          name |                       type | null | key |                              extras |                              watermark |
+---------------+----------------------------+------+-----+-------------------------------------+----------------------------------------+
|      sensorId |                     STRING | TRUE |     |                                     |                                        |
|       reading |                     DOUBLE | TRUE |     |                                     |                                        |
| eventTime_ltz | TIMESTAMP_LTZ(3) *ROWTIME* | TRUE |     | AS TO_TIMESTAMP_LTZ(`timestamp`, 3) | `eventTime_ltz` - INTERVAL '30' SECOND |
|            ts |               TIMESTAMP(3) | TRUE |     |           METADATA FROM 'timestamp' |                                        |
|     timestamp |                     BIGINT | TRUE |     |                                     |                                        |
+---------------+----------------------------+------+-----+-------------------------------------+----------------------------------------+
5 rows in set

The CREATE TABLE syntax consists of column definitions, watermarks and connector properties (more details here).

We can observe the following column types in Flink SQL:

Physical (or regular) columns
Metadata columns: like the ts column in our statement that is basically Kafka metadata for accessing the timestamp from a Kafka Record.
Computed columns: virtual columns like the eventTime_ltz in our statement, which is a formatted timestamp derived from our timestamp BIGINT column. Virtual Columns can reference other columns, perform simple computations (like 5 * reading) or use built-in functions

Note: Specifying Time attributes (here eventTime_ltz) and watermarks is what allows us to operate properly on even time and also set constraints on temporal operators, which we will see shortly.

Let’s also create a table for our sensor information topic.

CREATE TABLE sensors (
  sensorId STRING,
  latitude String,
  longitude String,
  sensorType STRING,
  generation STRING,
  deployed BIGINT
) WITH (
  'connector' = 'kafka',
  'topic' = 'sensor.info',
  'properties.bootstrap.servers' = 'kafka:29092',
  'properties.group.id' = 'group.sensor.info',
  'format' = 'json',
  'scan.startup.mode' = 'earliest-offset',
  'json.timestamp-format.standard' = 'ISO-8601',
  'json.fail-on-missing-field' = 'false',
  'json.ignore-parse-errors' = 'true'
);
  
Flink SQL> DESCRIBE sensors;
+------------+--------+------+-----+--------+-----------+
|       name |   type | null | key | extras | watermark |
+------------+--------+------+-----+--------+-----------+
|   sensorId | STRING | TRUE |     |        |           |
|   latitude | STRING | TRUE |     |        |           |
|  longitude | STRING | TRUE |     |        |           |
| sensorType | STRING | TRUE |     |        |           |
| generation | STRING | TRUE |     |        |           |
|   deployed | BIGINT | TRUE |     |        |           |
+------------+--------+------+-----+--------+-----------+

You might wonder why there is no time attribute or watermark specified on this topic.

The sensor information is basically used to hold state — i.e it’s a changelog stream backed by a compacted Kafka topic (compared to an append-only stream) because we are only interested in keeping the latest sensor information to perform event enrichment later (when we will discuss joins.)

At this point, we have two tables created within our default database.

Flink SQL> SHOW TABLES;
>
+------------+
| table name |
+------------+
|   readings |
|    sensors |
+------------+
2 rows in set

Next up we need some data in our topics (and tables) to work with. I will use the following Producer code that can be found here to generate information for 10 sensors and 10.000 readings. You can modify the code to ingest more data if you want.

Note: I skipped adding the implementation here since our focus is on Flink SQL.

3.3 Run our First Query

Let’s run the following query and see an output similar to the following with information for our 10 sensors.

SELECT * FROM sensors;

------- Sample Output -------

+----+--------------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+----------------------+
| op |                       sensorId |                       latitude |                      longitude |                     sensorType |                     generation |             deployed |
+----+--------------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+----------------------+
| +I |                              1 |                      83.964156 |                      47.567865 |                      PROXIMITY |                              0 |        1610920880369 |
| +I |                              2 |                      70.211600 |                      87.285699 |                      PROXIMITY |                              0 |        1669725385766 |
| +I |                              3 |                     -78.843922 |                     -159.70556 |                    TEMPERATURE |                              2 |        1645179854537 |
| +I |                              4 |                      77.304485 |                      102.32052 |                      PROXIMITY |                              2 |        1596841078647 |
| +I |                              5 |                     -11.876575 |                      57.576944 |                      PROXIMITY |                              3 |        1667359403631 |
| +I |                              6 |                      59.134005 |                     -159.71549 |                      PROXIMITY |                              0 |        1604501300175 |
| +I |                              7 |                     -16.478654 |                      141.49999 |                    TEMPERATURE |                              1 |        1614461145113 |
| +I |                              8 |                     -64.380075 |                      164.37186 |                      PROXIMITY |                              2 |        1673640554153 |
| +I |                              9 |                     -33.693995 |                     -2.4277239 |                    TEMPERATURE |                              3 |        1645551899832 |
| +I |                             10 |                     -88.115880 |                      11.500759 |                      PROXIMITY |                              2 |        1623336384463 |

4. Operators

4.1 Stateless Operators

Stateless Operators are the simplest and include common operations like Projections and Filters that require no state.

Query Only Sensor Readings > 40

SELECT  sensorId, reading, eventTime_ltz
FROM    readings
WHERE   reading > 40

------- Sample Output -------

+----+--------------------------------+--------------------------------+-------------------------+
| op |                       sensorId |                        reading |           eventTime_ltz |
+----+--------------------------------+--------------------------------+-------------------------+
| +I |                              1 |                          40.18 | 2023-01-30 20:17:45.297 |
| +I |                              1 |                          41.87 | 2023-01-30 20:17:45.334 |
| +I |                              1 |                          41.72 | 2023-01-30 20:17:45.577 |
| +I |                              8 |                          40.91 | 2023-01-30 20:17:45.825 |
| +I |                              5 |                          40.94 | 2023-01-30 20:17:46.030 |
| +I |                              7 |                          40.73 | 2023-01-30 20:17:46.164 |
| +I |                              5 |                          40.13 | 2023-01-30 20:17:46.468 |
| +I |                              5 |                          40.22 | 2023-01-30 20:17:46.495 |
| +I |                              7 |                          40.02 | 2023-01-30 20:17:46.890 |
| +I |                              7 |                          40.92 | 2023-01-30 20:17:46.971 |

4.2 Materializing Operators

Materializing Operators perform computations that are not constrained by temporal conditions and thus never complete — the input/output records are constantly updated or deleted.

Consider a GROUP BY sensorId operation. The query needs to maintain the state for every sensorId, in order to update the results accordingly each time a new event for a sensor arrives.

This means the state is kept around forever and constantly growing with every new sensor-generated event.

Query: Total readings per Sensor

SELECT   sensorId, COUNT(reading) as totalReadings
FROM     readings
GROUP BY sensorId

------- Sample Output -------

+----+--------------------------------+----------------------+
| op |                       sensorId |        totalReadings |
+----+--------------------------------+----------------------+
| +I |                              4 |                    1 |
| -D |                              4 |                    1 |
| +I |                              6 |                    1 |
| -D |                              6 |                    1 |
| +I |                              4 |                    2 |
| -D |                              4 |                    2 |
| +I |                              6 |                    2 |
| -D |                              6 |                    2 |
| +I |                              4 |                    6 |
| -D |                              4 |                    6 |
| +I |                              6 |                    3 |
| -D |                              6 |                    3 |
| +I |                              4 |                    7 |
| -D |                              4 |                    7 |
| +I |                              6 |                    7 |
| -D |                              6 |                    7 |
| +I |                              4 |                    8 |
| -D |                              4 |                    8 |

Notice the op column — when we have an update for a given key the previous row is deleted and updated to the new value. For example:

| +I |                              4 |                    1 |
| -D |                              4 |                    1 |
| +I |                              4 |                    2 |

Now consider a query that joins the sensor information and sensor readings tables.

Query: Enrich Sensor readings with Sensor Information (Regular Join)

SELECT 
  sensors.sensorId, 
  reading, 
  eventTime_ltz,
  latitude,
  longitude,
  sensorType
FROM readings
  JOIN sensors ON readings.sensorId = sensors.sensorId
  
------- Sample Output -------

+----+--------------------------------+--------------------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+
| op |                       sensorId |                        reading |           eventTime_ltz |                       latitude |                      longitude |                     sensorType |
+----+--------------------------------+--------------------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+
| +I |                              1 |                          40.18 | 2023-01-30 20:17:45.297 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              1 |                          38.95 | 2023-01-30 20:17:45.301 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              1 |                          41.87 | 2023-01-30 20:17:45.334 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              1 |                          39.92 | 2023-01-30 20:17:45.375 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              1 |                          39.28 | 2023-01-30 20:17:45.408 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              7 |                          39.99 | 2023-01-30 20:17:45.443 |                     -16.478654 |                      141.49999 |                    TEMPERATURE |
| +I |                              1 |                          38.27 | 2023-01-30 20:17:45.551 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              7 |                          38.46 | 2023-01-30 20:17:45.553 |                     -16.478654 |                      141.49999 |                    TEMPERATURE |
| +I |                              1 |                          41.72 | 2023-01-30 20:17:45.577 |                      83.964156 |                      47.567865 |                      PROXIMITY |
| +I |                              7 |                          40.73 | 2023-01-30 20:17:46.164 |                     -16.478654 |                      141.49999 |                    TEMPERATURE |

Note: For data enrichment between tables backed by Kafka topics, there is another approach that can leverage the upsert-kafka connector and use Temporal Joins, but wanted to keep it simple in this blog post — as there is more theory involved.

Both tables are kept in memory which means the state will keep growing for both sides of the joins and thus it’s important to expire state by using a ttl.

You can achieve this using table.exec.state.ttl. Keep in mind though that there is a tradeoff between accuracy and state size, as expiring state too early might result in incomplete results.

If you need to keep a really large state around you will need to configure Flink to use Rocksdb as a state backend,

Unlike Materializing Operators, Temporal Operators (that we will see next) will automatically expire state that is no longer useful, by using time constraints — although you may need to set an idle state retention interval

For example, a time Window that is considered complete doesn’t need to be kept around in the state.

4.3 Temporal Operators

Temporal Operators are constrained by time. Records and computations are associated with a temporal condition, i.e a time window of 30 seconds and accept new records — previously added records can not be updated or deleted.

As we previously mentioned they hold records and/or results in state, but only until they are no longer required.

Query: Find the 1 minute average value for each sensor

SELECT
  sensorId,
  window_start,
  window_end,
  COUNT(reading) AS totalReadings,
  LISTAGG(CAST(reading AS STRING)) AS readingsList,
  ROUND(AVG(reading),1) as averageReading
FROM TABLE(TUMBLE(TABLE readings, DESCRIPTOR(eventTime_ltz), INTERVAL '1' MINUTE))
GROUP BY sensorId, window_start, window_end

------- Sample Output -------

+----+--------------------------------+-------------------------+-------------------------+----------------------+--------------------------------+--------------------------------+
| op |                       sensorId |            window_start |              window_end |        totalReadings |                   readingsList |                 averageReading |
+----+--------------------------------+-------------------------+-------------------------+----------------------+--------------------------------+--------------------------------+
| +I |                              4 | 2023-01-30 16:44:00.000 | 2023-01-30 16:45:00.000 |                   17 | 40.59,40.17,39.98,39.66,40.... |                           40.1 |
| +I |                              1 | 2023-01-30 16:44:00.000 | 2023-01-30 16:45:00.000 |                   26 | 40.23,38.84,36.6,39.31,39.9... |                           39.6 |
| +I |                              4 | 2023-01-30 16:45:00.000 | 2023-01-30 16:46:00.000 |                  440 | 40.31,43.09,40.19,40.35,39.... |                           39.9 |
| +I |                              1 | 2023-01-30 16:45:00.000 | 2023-01-30 16:46:00.000 |                  469 | 41.03,40.12,40.7,38.88,40.8... |                           40.0 |
| +I |                              1 | 2023-01-30 16:46:00.000 | 2023-01-30 16:47:00.000 |                  469 | 39.49,39.42,40.09,40.66,38.... |                           39.9 |
| +I |                              4 | 2023-01-30 16:46:00.000 | 2023-01-30 16:47:00.000 |                  447 | 40.44,40.98,39.79,39.21,40.... |                           40.0 |
| +I |                              4 | 2023-01-30 16:47:00.000 | 2023-01-30 16:48:00.000 |                  459 | 36.82,40.19,39.66,39.83,42.... |                           40.0 |
| +I |                              1 | 2023-01-30 16:47:00.000 | 2023-01-30 16:48:00.000 |                  494 | 40.45,39.37,41.69,40.41,39.... |                           40.1 |
| +I |                              1 | 2023-01-30 16:48:00.000 | 2023-01-30 16:49:00.000 |                  494 | 40.35,39.02,41.26,37.56,41.... |                           40.0 |
| +I |                              4 | 2023-01-30 16:48:00.000 | 2023-01-30 16:49:00.000 |                  447 | 41.15,39.46,38.72,37.01,39.... |                           40.0 |
| +I |                              2 | 2023-01-30 16:44:00.000 | 2023-01-30 16:45:00.000 |                   20 | 39.12,41.12,41.68,38.75,39.... |                           40.3 |
| +I |

Query: Find reading statistics (max, min, average and stddev) for all readings per sensorId over the previous minute.

SELECT 
  eventTime_ltz,
  sensorId, 
  reading,
  ROUND(AVG(reading) OVER minuteInterval, 1) AS minuteAvgTemp,
  MAX(reading) OVER minuteInterval AS minuteMinTemp,
  MIN(reading) OVER minuteInterval AS minuteMaxTemp,
  ROUND(STDDEV(reading) OVER minuteInterval, 5) AS minuteStdevTemp
FROM readings 
WINDOW minuteInterval AS (
  PARTITION BY sensorId
  ORDER BY eventTime_ltz
  RANGE BETWEEN INTERVAL '1' MINUTE PRECEDING AND CURRENT ROW 
);

5. (Temporary) Views

As we mentioned Flink SQL is quite rich and provides a lot of functions — so covering everything in this article is impossible. One more useful feature I want to mention is Temporary Views.

Similar to database Views it can be used to store the results of a query. A view is not physically materialized, but instead it is run every time the view is referenced in a query. Temporary Views are very useful to structure and decompose more complicated queries or reuse queries within other queries.

Once more let us better illustrate this with an example.

Following our previous query that calculates statistics we can make use of Temporary Views to store the output of the query and reuse it the calculated statistics to filter readings; for example find readings > average + 2 * standard deviation.

That’s a simple example that can be used to build more sophisticated outlier detection logic.

--- Create a Temporary View -- CREATE [TEMPORARY] VIEW 
CREATE VIEW readings_stats AS 
SELECT 
  eventTime_ltz,
  sensorId, 
  reading,
  ROUND(AVG(reading) OVER minuteInterval, 1) AS minuteAvgTemp,
  MAX(reading) OVER minuteInterval AS minuteMinTemp,
  MIN(reading) OVER minuteInterval AS minuteMaxTemp,
  ROUND(STDDEV(reading) OVER minuteInterval, 5) AS minuteStdevTemp
FROM readings 
WINDOW minuteInterval AS (
  PARTITION BY sensorId
  ORDER BY eventTime_ltz
  RANGE BETWEEN INTERVAL '1' MINUTE PRECEDING AND CURRENT ROW 
);

--- Run a filter query on the results to get the readings we want
SELECT 
  sensorId,
  reading,
  ROUND(minuteAvgTemp + 2 * minuteStdevTemp, 2) as threshold 
FROM readings_stats
WHERE reading > minuteAvgTemp + 2 * minuteStdevTemp

------- Sample Output -------
+----+--------------------------------+--------------------------------+--------------------------------+
| op |                       sensorId |                        reading |                      threshold |
+----+--------------------------------+--------------------------------+--------------------------------+
| +I |                              5 |                           41.6 |                          41.42 |
| +I |                              3 |                           42.1 |                          41.69 |
| +I |                              5 |                           41.6 |                          41.52 |
| +I |                              3 |                           42.6 |                          42.52 |
| +I |                              5 |                           41.7 |                          41.69 |
| +I |                              3 |                           41.5 |                          41.38 |
| +I |                              7 |                           41.1 |                          41.07 |
| +I |                              4 |                           42.5 |                          41.97 |
| +I |                              1 |                           41.3 |                           41.2 |
| +I |                              1 |                           41.8 |                          41.74 |

6. The TableEnvironment and SQL Queries

Up to this point, we have been using the Flink SQL cli to submit sql queries. For production cases though — or if you are running in environments like Kubernetes for example using the Flink Operator, you might need other ways to achieve this.

Note 1: Flink 1.16 introduced Flink SQL Gateway that you can use to submit queries.

Next, we will see how we can use the TableEnvironment to run such queries through code.

Note 2: If you are running on kubernetes using the Flink Operator you might wanna also check these example here

Note 3: Seeing the sample code below might seem weird as I’m using Kotlin. Whether you are using Java, Kotlin or Scala should be exactly the same — I’m just using kotlin these days and because Java 17 is unfortunately not supported yet at Flink I wanted to leverage Kotlin for multiline strings to write my queries.

6.1 Running SQL Queries with Code

class EnrichmentStream {
    private val checkpointsDir  = "file://${System.getProperty("user.dir")}/checkpoints/"
    private val rocksDBStateDir = "file://${System.getProperty("user.dir")}/state/rocksdb/"

    companion object {
        @JvmStatic
        fun main(args: Array) {
            EnrichmentStream().runStream()
        }
    }

    fun runStream() {
        val environment = StreamExecutionEnvironment
            .createLocalEnvironmentWithWebUI(Configuration())

        environment.parallelism = 3

        // Checkpoint Configurations
        environment.enableCheckpointing(5000)
        environment.checkpointConfig.minPauseBetweenCheckpoints = 100
        environment.checkpointConfig.setCheckpointStorage(checkpointsDir)

        val stateBackend = EmbeddedRocksDBStateBackend()
        stateBackend.setDbStoragePath(rocksDBStateDir)
        environment.stateBackend = stateBackend

        environment.checkpointConfig.externalizedCheckpointCleanup =
            CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION

        // Configure Restart Strategy
        environment.restartStrategy = RestartStrategies.fixedDelayRestart(5, Time.seconds(5))

        val tableEnvironment = StreamTableEnvironment.create(environment)

        // Run some SQL queries to check the existing Catalogs, Databases and Tables
        tableEnvironment
            .executeSql("SHOW CATALOGS")
            .print()

        tableEnvironment
            .executeSql("SHOW DATABASES")
            .print()

        tableEnvironment
            .executeSql("SHOW TABLES")
            .print()

        tableEnvironment
            .executeSql(Queries.CREATE_SENSORS_TABLE)
            .print()

        tableEnvironment
            .executeSql(Queries.CREATE_READINGS_TABLE)
            .print()

        tableEnvironment
            .executeSql("SHOW TABLES")
            .print()

        tableEnvironment
            .executeSql(Queries.JOIN_SENSOR_READINGS_WITH_INFO_QUERY)
            .print()
    }
}

The TableEnvironment is the entrypoint for Table API and SQL integration and is responsible for:

Registering a Table in the internal catalog
Registering catalogs
Loading pluggable modules
Executing SQL queries
Registering a user-defined (scalar, table, or aggregation) function
Converting between DataStream and Table (in case of StreamTableEnvironment)

The code snippet illustrated above runs the Join Operations we saw before (you can find the queries here) and the output should be similar.

6.2 A Short Discussion: Checkpoints and State

As extras you might notice two things:

Checkpoint is enabled
Rocksdb as a state backend is enabled.

We are not going into detail about these concepts here as it’s a story for another day. For those interested, if you run the code above you should see two output directories, checkpoints and state.

Since we are consuming (at most) from 3 kafka partitions, I’m using a parallelism of 3 as you might have also noticed in the code and so the stateful operator that runs the join will store state for each of the three tasks.

You can use a sample code file I have added here to see what gets written in the state.

Processing state of operator: job_848920cbd5178c2a525827b244d1e530_op_StreamingJoinOperator_8b481b930a189b6b1762a9d95a61ada1__2_3__uuid_97fb2e4b-e6f4-413e-98a5-0c4ad24ed20e}
 Column Family 'right-records' has 3 entries.
 Column Family 'left-records' has 670 entries.
Processing state of operator: job_848920cbd5178c2a525827b244d1e530_op_StreamingJoinOperator_8b481b930a189b6b1762a9d95a61ada1__3_3__uuid_46df8b7d-d015-4cc3-ad53-b86b7ecb0f1e}
 Column Family 'right-records' has 4 entries.
 Column Family 'left-records' has 879 entries.
Processing state of operator: job_848920cbd5178c2a525827b244d1e530_op_StreamingJoinOperator_8b481b930a189b6b1762a9d95a61ada1__1_3__uuid_6166550d-6e03-489f-94ab-4004ea7ec50e}
 Column Family 'right-records' has 3 entries.
 Column Family 'left-records' has 521 entries.

You can see that we have two column families — one for the left side of the join (sensor readings) and one for the right (sensor information). Notice for example right-records since we have 10 sensor ids these keys are distributed among the three tasks.

Also, note as discussed previously since there are no time constraints you will notice the state growing indefinitely (unless table.exec.state.ttl is configured).

I hope I sparked some interest in those curious that want to dive deeper themselves. In this article though we will conclude with this high-level overview.

7. Wrap Up

Flink is a powerful Stateful Stream Processing engine, enabling Unified Batch and Streaming architectures.

Flink SQL is a high-level API, using the well-known SQL syntax making it easy for everyone — like scientists or non-JVM (or python) engineers to leverage the power of Stream Processing with Apache Flink.

Flink SQL is extremely rich and supports a wide variety of built-in operators and functions

Unless there are some really sophisticated use cases (that can not be expressed in SQL and need low-level Datastream API access) Flink SQL is the best candidate for Stream Processing

Originally published at https://blog.rockthejvm.com on February 6, 2023.

In the Land of Streams — Kafka Part 4: My Cluster is Lost!! — Embracing Failure

Giannis Polyzos — Tue, 13 Dec 2022 14:24:05 GMT

In the Land of Streams — Kafka Part 4: My Cluster is Lost!! — Embracing Failure

A Kafka Streaming Ledger

https://www.vecteezy.com/free-vector/squirrel-cartoon

The Blob post series consists of the following parts:
- Part 1: A Producer’s Message
- Part 2: The Rise of the Consumers
- Part 3: Offsets and how to handle them
- Part 4: My Cluster is Lost!! — Embracing Failure (this blog post)

In the previous parts of the series, we have looked at how things work mainly from the application’s perspective.

In this final part, we will focus a little bit on the infrastructure part.

Typically enterprises use Kafka as the backbone of their whole data platform. This means it accommodates a wide range of business-critical workloads and sooner or later no matter how well-prepared you are, things are doomed to fail. This is why it is important to embrace failure when designing (yes fallacies are real) your overall architecture and think in terms of backups and disaster recovery.

In the previous parts, I mentioned (and used) Aiven for Kafka. Aiven is a complete future-proof data platform and what I really like is how easy it is to set up disaster recovery solutions whether it’s on one or multiple cloud providers.

Let’s assume you want to implement a Multi-Cloud Disaster Recovery Solution with Aiven for Kafka. There are two things to highlight here:

1. As Aiven offers no cloud vendor lock-in it allows to easily deploy clusters across different clouds.

2. For data replication to design Disaster Recovery solutions it uses MirrorMaker2.

The two main disaster recovery patterns are:

Active / Passive Pattern:
uses a secondary cluster that acts as a backup
Active / Active Pattern:
uses two clusters replicating data between them

Some data syncing/replication patterns can also account for disaster (since data is replicated as well)

Fan-out Pattern:
data is replicated from one cluster to multiple clusters.
Use Case: replicate data to different clouds and/or regions
Aggregation Pattern:
many edge clusters send data to a centralized cluster that acts as the aggregator
Use Case: aggregate data to a centralized location; for example for data warehouse or operational
Full-Mesh Pattern:
many clusters send data to each other
Use Case: different clusters operate in different regions/countries and data needs to become available as a whole in every region/country for operational requirements for example.

The most common pattern (at least based on my own experience) is the Active/Passive — i.e having one cluster active that syncs its data to a secondary one that will become active in case the currently active one becomes unavailable (failover).

Data Replication between multi cloud clusters

An important thing to note for the Aiven services is that the overall communication between the different clouds takes place using an IPSec Tunnel to provide secure networking.

Data syncing between the active and the failover clusters.

MirrorMaker 2 leverages the Connect framework to replicate data between Kafka clusters. You can deploy MirrorMaker2 either on AWS or GCP.

MM2 includes several new features, like:

topic data and consumer group replication
topic configuration and ACLs replication
cross-cluster offsets synchronization
partitioning is preserved

In the happy path, you have the producing and consuming applications operating on the Active cluster (on AWS in this case), and then boom the unhappy path reveals itself.

AWS cluster becomes unavailable

The whole cluster goes down and the applications become unavailable.

When working with distributed systems and planning for disaster, one crucial question you and your team need to answer is — what disaster means for us?

For example, your cluster might temporarily become unavailable due to some temporary networking issue, but after a few minutes time, everything is operational again. Is this kind of scenario acceptable for your business? Can you tolerate this or does your application needs to switch as soon as the cluster seems lost? Do you need to switch all your applications immediately?

Answering these kinds of questions will put you in a better position when designing your solution.

The last missing piece is how failover will actually look in practice.
One approach to that problem would be to wrap your application logic with some retry logic.
There are many good libraries out there that provide fault tolerance and resiliency like Resilience4j for Java, Arrow Fx for Kotlin, and ZIO for Scala (notice how all of them are functional libraries).

As depicted in the snippet above, the goal here is — once the applications lose connectivity to the cluster, catch that, have the retry logic kick in, and eventually if needed fallback to the healthy cluster.

Note: You might already be running hundreds of apps in production or have many different teams each creating their own apps. Adding resiliency to your app logic may be hard between implementation and coordination. Alternatives can be adding a LoadBalancer to handle the traffic or using Service Mesh technologies.

Wrapping Up

Failure is something that will happen sooner or later and you need to account for it. When designing disaster recovery solutions you might need to ask:

What should be the definition of — my cluster is lost — within our business context?
How fast do my apps need to switch to a healthy cluster and be operational?
What is the best approach to implementing an application failover?

This is the end of In the land of Streams with Kafka series.

Stay around for more StreamingLedger stories 👋😁

In the Land of Streams — Kafka Part 3: Offsets and How to Handle Them

Giannis Polyzos — Tue, 13 Dec 2022 14:23:31 GMT

In the Land of Streams — Kafka Part 3: Offsets and How to Handle Them

A Kafka Streaming Ledger

https://www.vecteezy.com/free-vector/squirrel-cartoon

The Blob post series consists of the following parts:
- Part 1: A Producer’s Message
- Part 2: The Rise of the Consumers
- Part 3: Offsets and how to handle them (this blog post)
- Part 4: My Cluster is Lost!! — Embracing Failure

This part aims to cover the following:

What’s the role of offsets in Kafka
What are the caveats when working with offsets
Different approaching for handling offsets

It’s a cycle — the message lifecycle

Up to this point, we have seen the whole message lifecycle in Kafka — PPC (Produce, Persist, Consume)

One thing really important though — especially when you need to trust your system provides the best guarantees when processing each message exactly once — is committing offsets.

Fetching messages from Kafka, processing them and marking them as processed, by actually providing such guarantees has a few pitfalls and is not provided out of the box.

This is what we will see next, i.e what do I need to take into account to get the best possible exactly-once processing guarantees out of my applications?

Committing Offsets Scenarios

We will take a look at a few scenarios for committing offsets and what caveats each approach might have.

Scenario 1: Committing Offsets Automatically

This is the default behavior with enable.auto.commit set to true. The caveat here is that the message is consumed and the offsets will be committed periodically, BUT this doesn’t mean the message has been successfully processed. If the message fails for some reason, its offset might have been committed and as far as Kafka is concerned that message has been processed successfully.

Scenario 2: Committing Offsets Manually

Setting enable.auto.commit to false takes Kafka consumers out of the “autopilot mode” and it’s up to the application to commit the offsets. This can be achieved by using the commitSync() or commitAsync() methods on the consumer API.
When committing offsets manually we can do so either when the whole batch returned from the poll() method has finished processing in which case all the offsets up to the highest one will be committed or we might want to commit after each individual message is done with it’s processing for even stronger guarantees.

Commit/Message

Committing offsets per message

Commit/Batch

Committing offsets per batch

This gives us control over how message offsets are committed and we can trust that we will wait for the actual processing to finish before committing the offset.
For those who want to account for (or at least try to) every unhappy path there is also the possibility that things fail in the commit process itself. In this case the message will be reprocessed

Scenario 3: Idempotency with External Storage

You can use an external data store and keep track of the offsets there (for example like cassandra).

Consuming messages and using something like a transaction for both processing the message as well as committing the offsets will guarantee that either both will succeed or fail and thus idempotency is ensured.

One thing to note here is that offsets are now stored in an external datastore. When starting a new consumer or a rebalancing takes place you need to make sure your consumer fetches the offsets from the external datastore.

One way to achieve this can be adding a ConsumerRebalanceListener and when onPartitionsRevoked and onPartitionsAssigned methods are called store (commit) or retrieve the offsets from the external datastore.

Wrapping Up

In this post, we saw the importance of offsets. As key takeaways here:

Consuming a message is different from actually processing it successfully
Auto-committing offsets can have a negative impact on your application guarantees

and how consuming the messages from actually processing is different.

We also reviewed a few different of how you might want to approach committing offsets back to Kafka and different caveats you might encounter with each approach.

Check Next: Part 4 My Cluster is Lost!! — Embracing Failure

In the Land of Streams — Kafka Part 2: The rise of the Consumers

Giannis Polyzos — Tue, 13 Dec 2022 14:22:54 GMT

In the Land of Streams — Kafka Part 2: The rise of the Consumers

A Kafka Streaming Ledger

https://www.vecteezy.com/free-vector/squirrel-cartoon

The Blob post series consists of the following parts:
- Part 1: A Producer’s Message
- Part 2: The Rise of the Consumers (this blog post)
- Part 3: Offsets and how to handle them
- Part 4: My Cluster is Lost!! — Embracing Failure

In the previous post, we discussed how the producing side works when we sent messages, and with data stored inside the topic let’s zoom in on the consuming side now.

This part aims to cover the following:

How the consuming side works
How scaling consumer groups works
How scaling with the parallel consumer works
Tuning to avoid slow consumers

Switching to the other side of the wall

You can find the relevant code samples on Github here.

A typical Kafka consumer loop should look similar to the following snippet

Consumer poll() Loop

We trigger the poll() method on the consumer, simulate a small amount of work, and finally show the records it processed.

Note: The show() method on the records comes from a helper extension function for printing the records in a nice and structured way:

So let’s try to better understand what happens here. The following diagram provides a more detailed explanation.

Kafka uses a pull-based model for data fetching. At the “heart of the consumer” sits the poll loop. The poll loop is important for two reasons:

It is responsible for fetching data (providing ConsumerRecords) for the consumer to process and
Sends heartbeats and coordinates the consumers so the consumer group knows the available consumers and if a rebalancing needs to take place.

The consuming applications maintain TCP connections with the brokers and sent fetch requests to fetch data. The data is cached and periodically returned from the poll() method. When data is returned from the poll() method the actual processing takes place and once it’s finished more data is requested and so on.

What’s important to note here (and we will dive deeper into it in the next part) is committing message offsets. This is Kafka’s way of knowing that a message has been fetched and processed successfully. By default, offsets are committed automatically at regular intervals.

The amount of data - how much it is going to be fetched, when more data needs to be requested etc. are dictated by configuration options like, fetch.min.bytes, max.partition.fetch.bytes, fetch.max.bytes, fetch.max.wait.ms. You might think that the default options might be ok for you, but it’s important to test them out and think through your use case carefully.

To make this more clear let’s assume that you fetch 500 records from the poll() loop to process, but the processing for some reason takes too long for each message. max.poll.interval.ms dictates the maximum time a consumer can be idle before fetching more records; i.e calling the poll method and if this threshold is reached the consumer is considered lost and a rebalance will be triggered — although our application was just slow on processing.

So decreasing the number of records the poll() loop should return and/or better tuning some configurations like heartbeat.interval.ms and session.timeout.ms used for consumer group coordination might be reasonable in this case.

Running the Consumer

At this point, I will start one consuming instance on my ecommerce.events. Remember from part 1 that this topic consists of 5 partitions. I will execute against my Aiven for Kafka cluster, using the default consumer configuration options and my goal is to see how long it takes for a consumer to read 10.000 messages from the topic, assuming a 20ms processing time per message. You can find the code here.

We can see that it takes a single consumer around 4 minutes for this kind of processing. So how can we do better?

Scaling the Consuming Side

Consumer Groups and the Parallel Consumer Pattern

Consumer Groups are Kafka’s way of sharing the work between different consumers and also the level of parallelism. The highest level of parallelism you can achieve with Kafka is having one consumer consuming from each partition of a topic.

Scenario 1: #Partitions > #Consumers

In the scenario, the available partitions will be shared equally among the available consumers of the group and each consumer will have ownership of those partitions.

Partitions are shared among the available consumers

Scenario 2: #Partitions = #Consumers

When the partition number is equal to the available consumers each consumer will be reading from exactly one partition. In this scenario, we also reach the maximum parallelism we can achieve on a particular topic.

1:1 consumer-partition mapping

Scenario 3: #Partitions < #Consumers

This scenario is similar to the previous one, only now we will have one consumer running but stays idle. On the one hand, this means we waste resources, but we can also use this consumer as a Failover in case another one in the group goes down.

#consumer > #partitions the extra consumers are idle

When a consumer goes down or similarly a new one joins the group, Kafka will have to trigger a rebalance. This means that partitions need to be revoked and reassigned to the available consumers in the group.

Let’s run again our previous example — consuming 10k messages — but this time having 5 consumers in our consumer group. I will be creating 5 consuming instances from within a single JVM (using kotlin coroutines), but you can easily re-adjust the code (found here) and just start multiple JVMs.

As expected we can see that the consumption time dropped to less than a minute time.

But if Kafka’s maximum level of parallelism is one consumer per partition, does this mean we hit the scaling limit? Let’s see how to tackle this next.

What about the parallel consumer pattern?

Up to this point, we might have two questions in mind:

If #partitions = #consumers in the consumer group, how can I scale even further if needed? It’s not always easy to calculate the number of partitions beforehand and/or I might need to account for sudden spikes.
How can I minimize rebalancing time?

One solution to this can be the parallel consumer pattern. You can have consumers in your group consuming from one or more partitions of the topic, but then they propagate the actual processing to other threads.

One such implementation can be found here.

It provides three ordering guarantees — Unordered, Keyed and Partition.

Unordered — provides no guarantees
Key — guarantees ordering per key BUT with the caveat that the keyspace needs to be quite large, otherwise you might not observe much performance improvement.
Partition—Only one message will be processed per partition at any time.

Along with that it also provides different ways for committing offset. This is a pretty nice library you might want to look at.

The Parallel Consumer Pattern

Going once more back to our example to answer the question — how can we break the scaling limit? We will be using the parallel consumer pattern — you can find the code here.
Using one parallel consumer instance on our 5-partition topic, specifying a Key Ordering, and using a parallelism of 100 threads

1 parallel consuming instance

makes the consuming and processing time of 10k messages take as much as 6 seconds.
Notice on the screenshot how different batches are processed on different threads on the same consumer instance.

and if we use 5 parallel consumer instances

5 parallel consuming instances

we accomplished getting that down to 3 seconds.
Notice in the screenshot how different batches are processed on different threads on different consumer instances.

Wrapping Up

In this part, we saw how the consuming side of Kafka works. As takeaways when creating consuming applications:

We need to take into account the number of partitions each topic has
Think of our requirements in terms of processing and try to account for slow consumers.
How we can scale both with consumer groups and the parallel consumer pattern?
Message ordering, the number of keyspace, and partition guarantees need to be taken into account here and see what approach works the best (or a combination of both).

Check Next: Part 3 Offsets and how to handle them.