[Flink] Support Array type in Flink connector#2040

XuQianJin-Stars · 2025-11-27T08:19:49Z

Purpose

Linked issue: close #1978

This PR adds comprehensive support for Array type in the Flink connector, enabling users to read and write array data between Flink and Fluss tables. This addresses a key limitation in the connector's type support system.

Brief change log

Core Array type support in common module:
- Enhanced Arrow format readers and writers (ArrowReader, ArrowWriter, ArrowArrayWriter, ArrowFieldWriter) to handle array serialization and deserialization
- Updated ArrowUtils to support Array type schema conversion
Flink connector integration:
- Implemented bidirectional type conversion in FlussRowToFlinkRowConverter and FlinkRowToFlussRowConverter to support Array types
- Added FlinkAsFlussArray wrapper class to convert Flink's ArrayData to Fluss's InternalArray
- Updated FlinkAsFlussRow to support array field access
Comprehensive test coverage:
- Added FlinkArrayTypeITCase as the base test class with comprehensive integration tests covering:
  - Arrays of primitive types (int, bigint, string, boolean, double)
  - Arrays with null elements
  - Nested arrays (array of arrays)
  - Arrays in primary key tables
  - Array operations and access
- Created version-specific test implementations for Flink 1.18, 1.19, 1.20, and 2.1

Tests

Unit Tests:

Arrow array writer and reader tests for serialization/deserialization

Integration Tests:

Flink118ComplexTypeITCase: Array type support for Flink 1.18
Flink119ComplexTypeITCase: Array type support for Flink 1.19
Flink120ComplexTypeITCase: Array type support for Flink 1.20
Flink21ComplexTypeITCase: Array type support for Flink 2.1

Test scenarios include:

testArrayOfPrimitiveTypesInLogTable: Verifies arrays of int, bigint, string, boolean, and double types
testArrayWithNullElements: Ensures proper handling of null elements within arrays and null arrays
testNestedArrays: Tests multi-dimensional arrays (array of arrays)
testArrayInPrimaryKeyTable: Validates array support in tables with primary keys
testArrayWithAllDataTypes: Comprehensive test with all supported data types in arrays
testArrayAccessAndCardinality: Tests array operations and element access

API and Format

API Changes:

No breaking API changes
Array type now supported in Flink connector's type mapping

Storage Format:

Enhanced Arrow format to support Array type serialization
Backward compatible with existing storage formats

Documentation

Feature Introduction:

This change introduces Array type support as a new feature in the Flink connector
Users can now create tables with array columns and perform read/write operations through Flink SQL

Documentation Updates Needed:

Update website/docs/engine-flink/getting-started.md to reflect Array type support in the type mapping table (change from "Not supported" to "Supported")
Add examples of using Array types in Flink SQL with Fluss tables

leekeiabstraction · 2025-11-27T12:09:06Z

+                        + "bigint_array array<bigint>, "
+                        + "double_array array<double>, "
+                        + "boolean_array array<boolean>, "
+                        + "string_array array<string>"


Are the types here exhaustive? E.g. float, smallint, tinyint, timestamp etc. isn't tested.

https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/types/
https://fluss.apache.org/docs/next/table-design/data-types/

leekeiabstraction · 2025-11-27T12:14:46Z

Integ tests are failing with stack traces that seem related to Arrow*Writer changes

Error:  org.apache.fluss.server.kv.KvTabletTest.testPartialUpdateAndDelete  Time elapsed: 0.043 s  <<< ERROR!
java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: range(0, 0))
	at org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:319)
	at org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf.chk(ArrowBuf.java:306)
	at org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf.getByte(ArrowBuf.java:508)
	at org.apache.fluss.shaded.arrow.org.apache.arrow.vector.BitVectorHelper.setBit(BitVectorHelper.java:82)
	at org.apache.fluss.shaded.arrow.org.apache.arrow.vector.IntVector.set(IntVector.java:160)
	at org.apache.fluss.row.arrow.writers.ArrowIntWriter.doWrite(ArrowIntWriter.java:38)
	at org.apache.fluss.row.arrow.writers.ArrowFieldWriter.write(ArrowFieldWriter.java:59)
	at org.apache.fluss.row.arrow.ArrowWriter.writeRow(ArrowWriter.java:201)
	```

XuQianJin-Stars · 2025-11-27T14:09:35Z

hi @leekeiabstraction Don't review this PR before the CI passes – I will continue to make revisions.

wuchong · 2025-11-30T07:00:46Z

+import static org.assertj.core.api.Assertions.assertThat;
+
+/** Integration tests for Array type support in Flink connector. */
+abstract class FlinkArrayTypeITCase extends AbstractTestBase {


IT case is very heavy, we should avoid add too many tests just for one purpose. I suggest to do the following updates:

Rename FlinkArrayTypeITCase to FlinkComplexTypeITCase to cover future Map and Row types.

Add a test for LOG TABLE that covers all array types (ARRAY, ARRAY, several ARRAY nested types). Writing reocords (+ null elements if the element type is nullable) into the table, and read from the table.

Add a test for PK TABLE that is the same with above types. Also writing and reading from the table. But we should test updating and deleting as well. Take org.apache.fluss.flink.source.FlinkTableSourceITCase#testReadKvTableWithScanStartupModeEqualsFull as an example about how to verify reading kv table for both snapshot read and incremental read. Then test lookup join the table.

Add exception test that Array type can't be as primary key, or bucket key, or partition key. Add tests to verify this.

wuchong · 2025-11-30T07:01:12Z

+    }
+
+    @Test
+    void testSimpleLogTableWithSinkAPI() throws Exception {


This test is not about array type and is not needed.

wuchong · 2025-11-30T07:01:50Z

            reuseWriter.complete();

-            return reuseArray;
+            return reuseArray.copy();


We should avoid copy(), this introduces performance regression.

wuchong · 2025-11-30T07:01:54Z

+
        CompactedRow row = new CompactedRow(fieldDataTypes.length, compactedRowDeserializer);
-        row.pointTo(writer.segment(), 0, writer.position());
+        row.pointTo(org.apache.fluss.memory.MemorySegment.wrap(rowBytes), 0, rowSize);


We should avoid deep copy, this introduces performance regression.

wuchong · 2025-11-30T07:02:06Z

+
        CompactedRow compactedRow = new CompactedRow(fieldDataTypes.length, deserializer);
-        compactedRow.pointTo(segment, offset, sizeInBytes);
+        compactedRow.pointTo(MemorySegment.wrap(rowBytes), 0, sizeInBytes);


We should avoid deep copy, this introduces performance regression.

wuchong · 2025-11-30T07:03:11Z

+                        context instanceof LogRecordReadContext
+                                ? (LogRecordReadContext) context
+                                : null);


This force casting is hack and error-prone in the future. Because this may break if we introduce another ReadContext implementation, and here will use null.

wuchong · 2025-11-30T07:03:36Z

        this.bufferAllocator = bufferAllocator;
        this.selectedFieldGetters = selectedFieldGetters;
        this.projectionPushDowned = projectionPushDowned;
+        this.batchRoots = Collections.synchronizedList(new ArrayList<>());


synchronizedList is performant bad, please avoid to use this.

wuchong · 2025-11-30T07:08:49Z

+     * Fixes writerIndex for all buffers in all vectors after VectorLoader.load().
+     * VectorLoader.load() sets the capacity but not the writerIndex for buffers.
+     */
+    private static void fixVectorBuffers(VectorSchemaRoot schemaRoot) {


Why do we need this for array type? Why we don't need this before?

wuchong · 2025-11-30T07:11:40Z

    public void open(InitializationContext context) throws Exception {
        this.converter = new FlinkAsFlussRow();
+        // For primary key tables (non-append-only), we need to encode the row immediately
+        // to avoid issues with Flink reusing RowData objects


I don't understand this, if it's for primary key tables, then UpsertWriter will materialize/encode the row into binary format which already avoid reuse object problems.

… type Flink IT cases (apache#2040)

…fix Arrow IndexOutOfBoundsException (apache#2040) This fixes exception: Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: range(0, 0)) at org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:319)

wuchong

I appended 2 commits to improve the implementation. Please take a look.

… type Flink IT cases (apache#2040)

…fix Arrow IndexOutOfBoundsException (apache#2040) This fixes exception: Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: range(0, 0)) at org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:319)

XuQianJin-Stars · 2025-12-02T06:28:57Z

+     * VectorizedColumnBatch#getString(int, int)}. This can be removed once we supports object reuse
+     * for Arrow {@link ColumnarRow}, see {@code CompletedFetch#toScanRecord(LogRecord)}.
+     */
+    static FieldGetter createDeepFieldGetter(DataType fieldType, int fieldPos) {


… type Flink IT cases (#2040)

(cherry picked from commit fe7d0cb)

… type Flink IT cases (apache#2040)

…fix Arrow IndexOutOfBoundsException (apache#2040) This fixes exception: Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: range(0, 0)) at org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:319)

leekeiabstraction reviewed Nov 27, 2025

View reviewed changes

XuQianJin-Stars force-pushed the feature/issue-1978-array-flink-connector branch 2 times, most recently from 3be6ba1 to 1e40482 Compare November 29, 2025 12:08

wuchong requested changes Nov 30, 2025

View reviewed changes

XuQianJin-Stars force-pushed the feature/issue-1978-array-flink-connector branch 2 times, most recently from 2449ade to 87eebed Compare December 2, 2025 05:49

wuchong pushed a commit to XuQianJin-Stars/fluss that referenced this pull request Dec 2, 2025

[flink] Support Array type in Flink connector (apache#2040)

bba5e12

wuchong added a commit to XuQianJin-Stars/fluss that referenced this pull request Dec 2, 2025

[server][flink] Validate array type in server side and add more array…

f33d9d3

… type Flink IT cases (apache#2040)

wuchong force-pushed the feature/issue-1978-array-flink-connector branch from 87eebed to c993199 Compare December 2, 2025 06:17

wuchong approved these changes Dec 2, 2025

View reviewed changes

XuQianJin-Stars and others added 3 commits December 2, 2025 14:18

[flink] Support Array type in Flink connector (apache#2040)

310bc88

[server][flink] Validate array type in server side and add more array…

4493cbe

… type Flink IT cases (apache#2040)

wuchong force-pushed the feature/issue-1978-array-flink-connector branch from c993199 to 1e8123a Compare December 2, 2025 06:21

XuQianJin-Stars commented Dec 2, 2025

View reviewed changes

wuchong merged commit f1a75ca into apache:main Dec 2, 2025
5 checks passed

wuchong pushed a commit that referenced this pull request Dec 2, 2025

[flink] Support Array type in Flink connector (#2040)

fe7d0cb

wuchong added a commit that referenced this pull request Dec 2, 2025

[server][flink] Validate array type in server side and add more array…

eb8736b

… type Flink IT cases (#2040)

zcoo pushed a commit to zcoo/fluss that referenced this pull request Dec 3, 2025

[flink] Support Array type in Flink connector (apache#2040)

64b3f55

(cherry picked from commit fe7d0cb)

zcoo pushed a commit to zcoo/fluss that referenced this pull request Dec 4, 2025

[flink] Support Array type in Flink connector (apache#2040)

344b2ae

(cherry picked from commit fe7d0cb)

zcoo pushed a commit to zcoo/fluss that referenced this pull request Dec 4, 2025

[flink] Support Array type in Flink connector (apache#2040)

ba6d769

Ugbot pushed a commit to Ugbot/fluss that referenced this pull request Apr 26, 2026

[flink] Support Array type in Flink connector (apache#2040)

2f26a9f

Ugbot pushed a commit to Ugbot/fluss that referenced this pull request Apr 26, 2026

[server][flink] Validate array type in server side and add more array…

6a774b6

… type Flink IT cases (apache#2040)

Conversation

XuQianJin-Stars commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leekeiabstraction commented Nov 27, 2025

Uh oh!

XuQianJin-Stars commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuchong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XuQianJin-Stars commented Nov 27, 2025 •

edited

Loading