[doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior#1924
Conversation
79edc11 to
7575354
Compare
| To enable lakehouse functionality as a tiered storage solution for a table, you must create the table with the configuration option `table.datalake.enabled = true`. | ||
| Return to the `SQL client` and execute the following SQL statement to create a table with data lake integration enabled: | ||
| ```sql title="Flink SQL" | ||
| CREATE TABLE datalake_enriched_orders ( |
There was a problem hiding this comment.
It's pity that CREATE TABLE datalake_enriched_orders and INSERT INTO datalake_enriched_orders can't shared by paimon and iceberg. Union read in batch mode for pk table is not supported. In iceberg, you can only use appendonly table.
There was a problem hiding this comment.
Hi @luoyuxia , table of contents don't work with tabs, see: facebook/docusaurus#5343. The only solution maybe we have is to disable the table of contents in tabs: facebook/docusaurus#5343 (comment)
There was a problem hiding this comment.
Hi @luoyuxia , I've updated the doc with the changes we discussed offline. please take a look.
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the Flink quickstart documentation by consolidating duplicate content between the Paimon and Iceberg guides into shared files, and implementing a tabbed interface to switch between the two data lake formats.
- Extracted common sections (prerequisites, Flink SQL operations, cleanup) into reusable shared files
- Reorganized the main
flink.mdto use a tabbed interface with Paimon and Iceberg as separate tabs - Created new MDX files (
_flink-paimon.mdxand_flink-iceberg.mdx) that compose the shared content - Removed the standalone
flink-iceberg.mdfile
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
website/docs/quickstart/flink.md |
Converted to a tabbed interface that displays Paimon and Iceberg guides |
website/docs/quickstart/flink-iceberg.md |
Deleted - content moved to _flink-iceberg.mdx |
website/docs/quickstart/_flink-paimon.mdx |
New file containing Paimon-specific quickstart content with shared imports |
website/docs/quickstart/_flink-iceberg.mdx |
New file containing Iceberg-specific quickstart content with shared imports |
website/docs/quickstart/_shared-prerequisites.md |
Extracted common prerequisites section |
website/docs/quickstart/_shared-flink-sql.md |
Extracted common Flink SQL operations, table creation, queries, and update/delete operations |
website/docs/quickstart/_shared-streaming-into-fluss.md |
Extracted common datalake-enabled table creation and streaming data writing |
website/docs/quickstart/_shared-cleanup.md |
Extracted common cleanup and learn more sections |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| `nation_name` STRING, | ||
| PRIMARY KEY (`order_key`) NOT ENFORCED |
There was a problem hiding this comment.
The PRIMARY KEY definition is missing from the datalake_enriched_orders table in the shared file. Looking at the original flink.md (Paimon version), the table had PRIMARY KEY (order_key) NOT ENFORCED on line 20, but the original flink-iceberg.md did not have this constraint. The shared file now includes the PRIMARY KEY, which changes the behavior for Iceberg users who previously created tables without primary keys. This could break existing workflows or cause unexpected behavior.
| `nation_name` STRING, | |
| PRIMARY KEY (`order_key`) NOT ENFORCED | |
| `nation_name` STRING |
| includes the [fluss-flink](engine-flink/getting-started.md), [iceberg-flink](https://iceberg.apache.org/docs/latest/flink/) and | ||
| [flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide. | ||
|
|
||
| 3. To start all containers, run: |
There was a problem hiding this comment.
The step numbering is incorrect. Step 2 creates the lib directory and downloads the Hadoop jar (lines 29-34), but then step 3 appears twice: once for creating the docker-compose.yml file (line 49) and again for starting containers (line 138). The docker-compose.yml creation should be labeled as step 3, and starting containers should be step 4.
| 3. To start all containers, run: | |
| 4. To start all containers, run: |
4585132 to
7575354
Compare
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| 1. Create a working directory for this guide. | ||
|
|
||
| ```shell | ||
| mkdir fluss-quickstart-flink |
There was a problem hiding this comment.
| mkdir fluss-quickstart-flink | |
| mkdir fluss-quickstart-flink-paimon |
| @@ -437,8 +447,103 @@ LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n | |||
| ON c.nation_key = n.nation_key; | |||
| ``` | |||
|
|
|||
| </TabItem> | |||
| </Tabs> | |||
|
|
|||
| ### Real-Time Analytics on Fluss datalake-enabled Tables | |||
There was a problem hiding this comment.
There is no much difference in this step, can these two tabs be merged into one except view the files?
| ## Learn more | ||
| Now that you're up and running with Fluss and Flink with Iceberg, check out the [Apache Flink Engine](engine-flink/getting-started.md) docs to learn more features with Flink or [this guide](/maintenance/observability/quickstart.md) to learn how to set up an observability stack for Fluss and Flink. No newline at end of file | ||
| </TabItem> | ||
| </Tabs> No newline at end of file |
| ## Streaming into Fluss | ||
|
|
||
| First, run the following SQL to sync data from source tables to Fluss tables: | ||
| Next, perform streaming data writing into the **datalake-enabled** table, `datalake_enriched_orders`: |
There was a problem hiding this comment.
From
Next, perform streaming data writing into the **datalake-enabled**
to
INSERT INTO datalake_enriched_orders
we can reuse same content, but remember use
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
o.cust_key,
o.total_price,
o.order_date,
o.order_priority,
o.clerk,
c.name,
c.phone,
c.acctbal,
c.mktsegment,
n.name
FROM (
SELECT *, PROCTIME() as ptime
FROM `default_catalog`.`default_database`.source_order
) o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF o.ptime AS c
ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n
ON c.nation_key = n.nation_key;
There was a problem hiding this comment.
This content is short. I think we can keep as it is?
| the `fluss_orders` table with information from the `fluss_customer` and `fluss_nation` primary-key tables. | ||
| ```sql title="Flink SQL" | ||
| -- execute DML job asynchronously | ||
| SET 'table.dml-sync' = 'false'; |
There was a problem hiding this comment.
don't need this since it's false by default.
| -- switch to batch mode | ||
| SET 'execution.runtime-mode' = 'batch'; | ||
| ``` | ||
|
|
There was a problem hiding this comment.
SET 'sql-client.execution.result-mode' = 'tableau';Also change this to make the screen looks well.
| SET 'execution.runtime-mode' = 'batch'; | ||
| -- use tableau result mode | ||
| SET 'sql-client.execution.result-mode' = 'tableau'; | ||
| ``` |
There was a problem hiding this comment.
I think set to batch mode is still required? Right?
wuchong
left a comment
There was a problem hiding this comment.
Thanks @xx789633 for the great work!
Currently, we have three shared Markdown snippets:
_shared-cleanup.md_shared-create-table.md_shared-lake-analytics.md
While content sharing can reduce duplication, it also harms readability and flow—especially in documentation meant for users. In this case:
_shared-lake-analytics.mdis only used byflink-lake.md, so there’s no real benefit to extracting it._shared-cleanup.mdcontains just a few lines, keeping it inline improves clarity._shared-create-table.mddoes reuse some content, but it disrupts the logical order of theCREATE TABLEstatements, which feels unnatural and confusing from a user’s perspective.
The original intent of this issue was to maximize reuse of the “Real-Time Analytics with Flink” section across the Paimon and Iceberg quickstarts. But now that we’ve split them into two distinct guides (“Real-Time Analytics with Flink” and “Building a Streaming Lakehouse”) we’ve already addressed the core concern.
Therefore, I suggest we avoid content reuse altogether here. Our goal isn’t to eliminate all duplication at the cost of usability and readability.
Let me know what you think!
| import CreateTable from './_shared-create-table.md'; | ||
|
|
||
| For more information on working with Flink, refer to the [Apache Flink Engine](engine-flink/getting-started.md) section. | ||
| This guide will help you set up a basic streaming Lakehouse using Fluss with Paimon or Iceberg. |
There was a problem hiding this comment.
| This guide will help you set up a basic streaming Lakehouse using Fluss with Paimon or Iceberg. | |
| This guide will help you set up a basic Streaming Lakehouse using Fluss with Paimon or Iceberg, and help you better understand the powerful feature of Union Read. |
|
|
||
| ```shell | ||
| mkdir fluss-quickstart-flink-paimon | ||
| cd fluss-quickstart-flink-paimon |
There was a problem hiding this comment.
Can we just name this directory fluss-quickstart-paimon? In the future, we want to introduce Trino and other query engines into this quickstart. So binding to a specific engine is not feasible. The same to the iceberg directory.
| @@ -285,104 +384,15 @@ SELECT o.order_key, | |||
| c.mktsegment, | |||
| n.name | |||
| FROM fluss_order o | |||
There was a problem hiding this comment.
fluss_order is not defined on this page.
Besides, there is no records in fluss_customer, fluss_order and fluss_nation as there is no insert into job to these tables.
Besides, this query is different with the one in iceberg tab. I think we need keep consistent between them?
| ├── LATEST | ||
| └── snapshot-1 | ||
| ``` | ||
| The files adhere to Paimon's standard format, enabling seamless querying with other engines such as [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/). |
There was a problem hiding this comment.
Keep consistent with Iceberg tab? use "Trino" and "Spark" here?
| @@ -1,15 +1,16 @@ | |||
| --- | |||
| title: Real-Time Analytics with Flink (Iceberg) | |||
| title: Build Streaming Lakehouse | |||
There was a problem hiding this comment.
- Rename the title from
Build Streaming LakehousetoBuilding a Streaming Lakehouse, this matches standard quickstart naming conventions. - Consider renaming the file from
flink-lake.mdtolakehouse.mdfor better clarity and consistency, since the guide now focuses on the broader lakehouse architecture, not just Flink integration.
wuchong
left a comment
There was a problem hiding this comment.
I appended a commit to do some minor improvements.
|
I cherry-picked this PR to release-0.8 branch: c9dcf80 @xx789633 could you help to verify whether all the v0.8 quickstart (flink, iceberg, paimon) still work as expected? https://fluss.apache.org/docs/quickstart/flink/ |
|
I'll do this tomorrow. |
|
This SQL in Iceberg tab -- insert tuples into datalake_enriched_orders
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
o.cust_key,
o.total_price,
o.order_date,
o.order_priority,
o.clerk,
c.name,
c.phone,
c.acctbal,
c.mktsegment,
n.name
FROM fluss_order o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime` AS `n`
ON c.nation_key = n.nation_key;needs to be changed to: -- insert tuples into datalake_enriched_orders
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
o.cust_key,
o.total_price,
o.order_date,
o.order_priority,
o.clerk,
c.name,
c.phone,
c.acctbal,
c.mktsegment,
n.name
FROM (
SELECT *, PROCTIME() as ptime
FROM `default_catalog`.`default_database`.source_order
) o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF o.ptime AS c
ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n
ON c.nation_key = n.nation_key;I have created a pull request to fix that :#1964 Other than that, everything works as expected. |
…Real-Time Analytics with Flink" (apache#1924)

Purpose
Linked issue: close #1921
Brief change log
Tests
API and Format
Documentation