Skip to content

[doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior#1924

Merged
wuchong merged 17 commits into
apache:mainfrom
xx789633:quick-improve
Nov 10, 2025
Merged

[doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior#1924
wuchong merged 17 commits into
apache:mainfrom
xx789633:quick-improve

Conversation

@xx789633
Copy link
Copy Markdown
Contributor

@xx789633 xx789633 commented Nov 3, 2025

Purpose

Linked issue: close #1921

Brief change log

Tests

API and Format

Documentation

Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xx789633 Thanks for the pr. Left minor comment.
But one thing is that I found the side navigator won't work now with tab:

  • Find same navigator item
image
  • Click the navigator item, it won't jump to the section

To enable lakehouse functionality as a tiered storage solution for a table, you must create the table with the configuration option `table.datalake.enabled = true`.
Return to the `SQL client` and execute the following SQL statement to create a table with data lake integration enabled:
```sql title="Flink SQL"
CREATE TABLE datalake_enriched_orders (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pity that CREATE TABLE datalake_enriched_orders and INSERT INTO datalake_enriched_orders can't shared by paimon and iceberg. Union read in batch mode for pk table is not supported. In iceberg, you can only use appendonly table.

Copy link
Copy Markdown
Contributor Author

@xx789633 xx789633 Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @luoyuxia , table of contents don't work with tabs, see: facebook/docusaurus#5343. The only solution maybe we have is to disable the table of contents in tabs: facebook/docusaurus#5343 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @luoyuxia , I've updated the doc with the changes we discussed offline. please take a look.

@luoyuxia luoyuxia requested a review from Copilot November 3, 2025 06:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Flink quickstart documentation by consolidating duplicate content between the Paimon and Iceberg guides into shared files, and implementing a tabbed interface to switch between the two data lake formats.

  • Extracted common sections (prerequisites, Flink SQL operations, cleanup) into reusable shared files
  • Reorganized the main flink.md to use a tabbed interface with Paimon and Iceberg as separate tabs
  • Created new MDX files (_flink-paimon.mdx and _flink-iceberg.mdx) that compose the shared content
  • Removed the standalone flink-iceberg.md file

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
website/docs/quickstart/flink.md Converted to a tabbed interface that displays Paimon and Iceberg guides
website/docs/quickstart/flink-iceberg.md Deleted - content moved to _flink-iceberg.mdx
website/docs/quickstart/_flink-paimon.mdx New file containing Paimon-specific quickstart content with shared imports
website/docs/quickstart/_flink-iceberg.mdx New file containing Iceberg-specific quickstart content with shared imports
website/docs/quickstart/_shared-prerequisites.md Extracted common prerequisites section
website/docs/quickstart/_shared-flink-sql.md Extracted common Flink SQL operations, table creation, queries, and update/delete operations
website/docs/quickstart/_shared-streaming-into-fluss.md Extracted common datalake-enabled table creation and streaming data writing
website/docs/quickstart/_shared-cleanup.md Extracted common cleanup and learn more sections

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +19 to +20
`nation_name` STRING,
PRIMARY KEY (`order_key`) NOT ENFORCED
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRIMARY KEY definition is missing from the datalake_enriched_orders table in the shared file. Looking at the original flink.md (Paimon version), the table had PRIMARY KEY (order_key) NOT ENFORCED on line 20, but the original flink-iceberg.md did not have this constraint. The shared file now includes the PRIMARY KEY, which changes the behavior for Iceberg users who previously created tables without primary keys. This could break existing workflows or cause unexpected behavior.

Suggested change
`nation_name` STRING,
PRIMARY KEY (`order_key`) NOT ENFORCED
`nation_name` STRING

Copilot uses AI. Check for mistakes.
includes the [fluss-flink](engine-flink/getting-started.md), [iceberg-flink](https://iceberg.apache.org/docs/latest/flink/) and
[flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide.

3. To start all containers, run:
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step numbering is incorrect. Step 2 creates the lib directory and downloads the Hadoop jar (lines 29-34), but then step 3 appears twice: once for creating the docker-compose.yml file (line 49) and again for starting containers (line 138). The docker-compose.yml creation should be labeled as step 3, and starting containers should be step 4.

Suggested change
3. To start all containers, run:
4. To start all containers, run:

Copilot uses AI. Check for mistakes.
@luoyuxia luoyuxia requested a review from Copilot November 5, 2025 06:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread website/docs/quickstart/flink.md Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xx789633 Thanks for the pr. It should be ready to be merge in next interation. After you modify your doc, please go through the quickstart doc to make sure it works

Comment thread website/docs/quickstart/flink-lake.md Outdated
1. Create a working directory for this guide.

```shell
mkdir fluss-quickstart-flink
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mkdir fluss-quickstart-flink
mkdir fluss-quickstart-flink-paimon

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@@ -437,8 +447,103 @@ LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n
ON c.nation_key = n.nation_key;
```

</TabItem>
</Tabs>

### Real-Time Analytics on Fluss datalake-enabled Tables
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no much difference in this step, can these two tabs be merged into one except view the files?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread website/docs/quickstart/flink-lake.md Outdated
## Learn more
Now that you're up and running with Fluss and Flink with Iceberg, check out the [Apache Flink Engine](engine-flink/getting-started.md) docs to learn more features with Flink or [this guide](/maintenance/observability/quickstart.md) to learn how to set up an observability stack for Fluss and Flink. No newline at end of file
</TabItem>
</Tabs> No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need clean up

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

## Streaming into Fluss

First, run the following SQL to sync data from source tables to Fluss tables:
Next, perform streaming data writing into the **datalake-enabled** table, `datalake_enriched_orders`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From

Next, perform streaming data writing into the **datalake-enabled**

to

INSERT INTO datalake_enriched_orders

we can reuse same content, but remember use

INSERT INTO datalake_enriched_orders
SELECT o.order_key,
       o.cust_key,
       o.total_price,
       o.order_date,
       o.order_priority,
       o.clerk,
       c.name,
       c.phone,
       c.acctbal,
       c.mktsegment,
       n.name
FROM (
    SELECT *, PROCTIME() as ptime
    FROM `default_catalog`.`default_database`.source_order
) o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF o.ptime AS c
    ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n
    ON c.nation_key = n.nation_key;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This content is short. I think we can keep as it is?

Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xx789633 Left two minor comments again

Comment thread website/docs/quickstart/flink-lake.md Outdated
the `fluss_orders` table with information from the `fluss_customer` and `fluss_nation` primary-key tables.
```sql title="Flink SQL"
-- execute DML job asynchronously
SET 'table.dml-sync' = 'false';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this since it's false by default.

-- switch to batch mode
SET 'execution.runtime-mode' = 'batch';
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET 'sql-client.execution.result-mode' = 'tableau';

Also change this to make the screen looks well.

SET 'execution.runtime-mode' = 'batch';
-- use tableau result mode
SET 'sql-client.execution.result-mode' = 'tableau';
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think set to batch mode is still required? Right?

Copy link
Copy Markdown
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xx789633 for the great work!

Currently, we have three shared Markdown snippets:

  • _shared-cleanup.md
  • _shared-create-table.md
  • _shared-lake-analytics.md

While content sharing can reduce duplication, it also harms readability and flow—especially in documentation meant for users. In this case:

  • _shared-lake-analytics.md is only used by flink-lake.md, so there’s no real benefit to extracting it.
  • _shared-cleanup.md contains just a few lines, keeping it inline improves clarity.
  • _shared-create-table.md does reuse some content, but it disrupts the logical order of the CREATE TABLE statements, which feels unnatural and confusing from a user’s perspective.

The original intent of this issue was to maximize reuse of the “Real-Time Analytics with Flink” section across the Paimon and Iceberg quickstarts. But now that we’ve split them into two distinct guides (“Real-Time Analytics with Flink” and “Building a Streaming Lakehouse”) we’ve already addressed the core concern.

Therefore, I suggest we avoid content reuse altogether here. Our goal isn’t to eliminate all duplication at the cost of usability and readability.

Let me know what you think!

Comment thread website/docs/quickstart/flink-lake.md Outdated
import CreateTable from './_shared-create-table.md';

For more information on working with Flink, refer to the [Apache Flink Engine](engine-flink/getting-started.md) section.
This guide will help you set up a basic streaming Lakehouse using Fluss with Paimon or Iceberg.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This guide will help you set up a basic streaming Lakehouse using Fluss with Paimon or Iceberg.
This guide will help you set up a basic Streaming Lakehouse using Fluss with Paimon or Iceberg, and help you better understand the powerful feature of Union Read.

Comment thread website/docs/quickstart/flink-lake.md Outdated

```shell
mkdir fluss-quickstart-flink-paimon
cd fluss-quickstart-flink-paimon
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just name this directory fluss-quickstart-paimon? In the future, we want to introduce Trino and other query engines into this quickstart. So binding to a specific engine is not feasible. The same to the iceberg directory.

@@ -285,104 +384,15 @@ SELECT o.order_key,
c.mktsegment,
n.name
FROM fluss_order o
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fluss_order is not defined on this page.

Besides, there is no records in fluss_customer, fluss_order and fluss_nation as there is no insert into job to these tables.

Besides, this query is different with the one in iceberg tab. I think we need keep consistent between them?

Comment thread website/docs/quickstart/flink-lake.md Outdated
├── LATEST
└── snapshot-1
```
The files adhere to Paimon's standard format, enabling seamless querying with other engines such as [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep consistent with Iceberg tab? use "Trino" and "Spark" here?

Comment thread website/docs/quickstart/flink-lake.md Outdated
@@ -1,15 +1,16 @@
---
title: Real-Time Analytics with Flink (Iceberg)
title: Build Streaming Lakehouse
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Rename the title from Build Streaming Lakehouse to Building a Streaming Lakehouse, this matches standard quickstart naming conventions.
  2. Consider renaming the file from flink-lake.md to lakehouse.md for better clarity and consistency, since the guide now focuses on the broader lakehouse architecture, not just Flink integration.

Copy link
Copy Markdown
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appended a commit to do some minor improvements.

@wuchong wuchong merged commit 8faff0e into apache:main Nov 10, 2025
2 checks passed
wuchong pushed a commit that referenced this pull request Nov 10, 2025
…Real-Time Analytics with Flink" (#1924)

(cherry picked from commit 8faff0e)
@wuchong
Copy link
Copy Markdown
Member

wuchong commented Nov 10, 2025

I cherry-picked this PR to release-0.8 branch: c9dcf80

@xx789633 could you help to verify whether all the v0.8 quickstart (flink, iceberg, paimon) still work as expected?

https://fluss.apache.org/docs/quickstart/flink/
https://fluss.apache.org/docs/quickstart/lakehouse/

@xx789633
Copy link
Copy Markdown
Contributor Author

I'll do this tomorrow.

@xx789633
Copy link
Copy Markdown
Contributor Author

This SQL in Iceberg tab

-- insert tuples into datalake_enriched_orders
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
       o.cust_key,
       o.total_price,
       o.order_date,
       o.order_priority,
       o.clerk,
       c.name,
       c.phone,
       c.acctbal,
       c.mktsegment,
       n.name
FROM fluss_order o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
    ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime` AS `n`
    ON c.nation_key = n.nation_key;

needs to be changed to:

-- insert tuples into datalake_enriched_orders
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
       o.cust_key,
       o.total_price,
       o.order_date,
       o.order_priority,
       o.clerk,
       c.name,
       c.phone,
       c.acctbal,
       c.mktsegment,
       n.name
FROM (
    SELECT *, PROCTIME() as ptime
    FROM `default_catalog`.`default_database`.source_order
) o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF o.ptime AS c
    ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n
    ON c.nation_key = n.nation_key;

I have created a pull request to fix that :#1964

Other than that, everything works as expected.

Ugbot pushed a commit to Ugbot/fluss that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior

4 participants