Skip to content

feat: support binary and string types for concat UDFs#22244

Open
theirix wants to merge 15 commits into
apache:mainfrom
theirix:concat-binary-udf
Open

feat: support binary and string types for concat UDFs#22244
theirix wants to merge 15 commits into
apache:mainfrom
theirix:concat-binary-udf

Conversation

@theirix
Copy link
Copy Markdown
Contributor

@theirix theirix commented May 16, 2026

Which issue does this PR close?

Rationale for this change

While #21883 introduced binary argument support for the pipe operator, this PR targets three UDFs: concat, concat_ws, and Spark's concat to harmonise all their behaviour.

After the first attempt at banning mixed string+concat operations, I switched to allowing them for concat UDFs and the pipe operator - as seen in DuckDB, Spark, Databricks. Previously, mixed behaviour was allowed in #20787 (not released yet). Thus, no breaking API changes since the klast release

What changes are included in this PR?

  • Added support for binary types (four variants)
  • Allow mixed string/binary operations for UDFs and pipe operator
  • Fixed edge cases when binary->string type coercion overrode UDF rules, so mixed calls were allowed
  • Refactored the three UDFs by extracting duplicate code into shared helpers to keep the logic centralised
  • Refined concat_ws behavior for different separator types

The diff is quite large. Detailed code changes:

  • Added a trait ConcatBuilder to abstract string/binary/view/array operations
  • Abstracted StringArray/LargeStringArray into a generic ConcatGenericStringBuilder - less duplication
  • Introduced mirrored builders for binary types in a new file binaries.rs
  • Introduced more ColumnarValueRef variants to handle nullable and non-nullable binary types
  • Extracted the ColumnarValue -> ColumnarValueRef builder to from_columnar_value - simplified call sites. Scalar code path stays mostly the same
  • Switched from Signature::variadic to Signature::UserDefined to allow different argument types.Variadic required every argument (including binaries) to be coerced to the same string type, so the UDF cannot distinguish between binary and string inputs. It's a relatively uncommon - happy to discuss
  • Simplified Spark's concat significantly by reusing the concat implementation, so it handles only Spark-specific null-handling
  • Moved SLTs from Spark SLT (spark/concat.slt) to a generic SLT, so we test both generic and Spark-specific behaviour

Are these changes tested?

  • Added more unit tests for previously uncovered major code paths
  • Added more SLTs, especially for type coercion

Are there any user-facing changes?

No. Also, reverted a breaking api change for || operator

@github-actions github-actions Bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation spark labels May 16, 2026
@theirix theirix marked this pull request as ready for review May 16, 2026 10:01
@theirix
Copy link
Copy Markdown
Contributor Author

theirix commented May 22, 2026

@Jefffrey could you please have a look? The last concat binary improvement

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try take a proper look at this, but might take some time as there seems to be quite a lot happening in this PR.

cc @neilconway as you worked on the concat builder code previously

Comment thread datafusion/expr/src/type_coercion/mod.rs Outdated
// Logic is matched with pipe operator in the following table.
// Support only string + string concatenation,
// or binary + binary concatenation.
// Mixed string + binary concatenation is rejected,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might constitute a breaking change 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try take a proper look at this, but might take some time as there seems to be quite a lot happening in this PR.
cc @neilconway as you worked on the concat builder code previously

Thank you! Also, I kindly ask @kosiew who reviewed the previous PR #20787

I think this might constitute a breaking change 🤔

The preliminary mixed binary support was introduced recently in #20787 (I guess it is not released yet - 53.1.0 or so), and now we're removing it to make it more consistent. I tried to describe the logic and rationale in the PR description

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a release being planned soon:

So those changes (#20787) would be included in that, meaning we'd need this PR to also go into the release to avoid a breaking change across versions 🤔

I'll try prioritize this review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it would still be a breaking change; before #20787 I think it would still coerce binary to string. So this restriction would be new, even without #20787

Copy link
Copy Markdown
Contributor Author

@theirix theirix May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a breaking change note. For the release, this note should replace the note for 20787

@github-actions github-actions Bot removed the logical-expr Logical plan and expressions label May 23, 2026
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments

Comment thread datafusion/functions/src/string/concat.rs Outdated
Comment thread datafusion/functions/src/string/concat.rs Outdated
Comment thread datafusion/functions/src/string/concat.rs Outdated
Comment thread datafusion/functions/src/string/concat_ws.rs Outdated
Comment thread datafusion/functions/src/strings.rs Outdated
@Jefffrey
Copy link
Copy Markdown
Contributor

run benchmarks concat concat_ws

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4533379349-309-w49bv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing concat-binary-udf (7438088) to 7ad8e2c (merge-base) diff
BENCH_NAME=concat
BENCH_COMMAND=cargo bench --features=parquet --bench concat
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4533379349-310-zcmr7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing concat-binary-udf (7438088) to 7ad8e2c (merge-base) diff
BENCH_NAME=concat_ws
BENCH_COMMAND=cargo bench --features=parquet --bench concat_ws
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                  concat-binary-udf                      main
-----                                  -----------------                      ----
concat_ws function/concat_ws/1024      1.00     18.6±0.04µs        ? ?/sec    1.13     20.9±0.42µs        ? ?/sec
concat_ws function/concat_ws/4096      1.00     72.9±0.12µs        ? ?/sec    1.13     82.7±1.18µs        ? ?/sec
concat_ws function/concat_ws/8192      1.00    145.2±0.29µs        ? ?/sec    1.13    163.7±3.35µs        ? ?/sec
concat_ws function/concat_ws/scalar    1.11  1077.9±20.15ns        ? ?/sec    1.00    968.3±3.34ns        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 45.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 52.2s
CPU sys 0.8s
Peak spill 0 B

branch

Metric Value
Wall time 45.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 50.9s
CPU sys 0.2s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                               concat-binary-udf                      main
-----                               -----------------                      ----
concat function/concat/1024         1.00     14.3±0.03µs        ? ?/sec    1.12     16.0±0.46µs        ? ?/sec
concat function/concat/4096         1.00     56.4±0.22µs        ? ?/sec    1.12     63.2±1.86µs        ? ?/sec
concat function/concat/8192         1.00    112.5±0.34µs        ? ?/sec    1.12    126.6±3.67µs        ? ?/sec
concat function/concat/scalar       1.05   1090.1±2.90ns        ? ?/sec    1.00   1036.8±5.12ns        ? ?/sec
concat function/concat_view/1024    1.00     30.7±0.06µs        ? ?/sec    1.26     38.8±0.12µs        ? ?/sec
concat function/concat_view/4096    1.00    137.8±0.36µs        ? ?/sec    1.39    191.4±0.43µs        ? ?/sec
concat function/concat_view/8192    1.00    300.0±0.56µs        ? ?/sec    1.37    410.7±1.72µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 75.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 86.6s
CPU sys 0.8s
Peak spill 0 B

branch

Metric Value
Wall time 70.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 86.4s
CPU sys 0.2s
Peak spill 0 B

File an issue against this benchmark runner

# Coercion rules from Binary to Utf8 do no apply compared to generic `concat`,
# so `concat` produces an explicit error
query error Error during planning: concat does not support mixed string and binary inputs
SELECT concat(arrow_cast('hello', 'Utf8'), arrow_cast(' world', 'Binary'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm a little confused now since I was quickly testing with Spark and got this:

>>> spark.version
'4.1.2'
>>> spark.sql("select concat(x'1234', '1234')").show();
+---------------------+
|concat(X'1234', 1234)|
+---------------------+
|               41234|
+---------------------+

>>> spark.sql("select concat(x'1234', '1234')").explain(True);
== Parsed Logical Plan ==
'Project [unresolvedalias('concat(0x1234, 1234))]
+- OneRowRelation

== Analyzed Logical Plan ==
concat(X'1234', 1234): string
Project [concat(cast(0x1234 as string), 1234) AS concat(X'1234', 1234)#19]
+- OneRowRelation

== Optimized Logical Plan ==
Project [41234 AS concat(X'1234', 1234)#19]
+- OneRowRelation

== Physical Plan ==
*(1) Project [41234 AS concat(X'1234', 1234)#19]
+- *(1) Scan OneRowRelation[]

Seems it'll automatically cast binaries to string?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. It happened that Spark's concat didn't apply coercion rules via coerce_types - fixed, now it should

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused here since apparently this should not error?

I think it might be better not to do any changes to Spark concat in this PR to try cut down on scope a little

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to build and test an old build (a few commits before #20787 - e.g. 0383a88). If I introduce this simple select concat(x'1234', '1234') to Spark SLT, it fails with "Concat function does not support scalar type 1234". With this PR, it fails with "function does not support mixed string and binary inputs". So it's consistent. The only difference should be the null handling, if I'm not mistaken.

I am confused about why the Spark SLT is different from the spark.sql calls above. Can you try again?

Regarding the PR scope, I thought about it, but it's not feasible. Spark UDF reuses the majority of the general concat UDF. And we didn't have SLTs for generic concat before, that's why I moved and adjusted tests for both functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jefffrey , now I can reproduce the Spark behaviour as in your example on Spark, databricks and duckdb. All of them actually cast binaries to text and produce a text output for mixed cases.

So I agree, it could break Spark, and let's avoid merging this PR. So far, the only breaking api change shipping is #21883 with a disallowed "binary || text", and "binary || binary" returning binary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick update: I fixed this breaking change for all UDFs, the pipe operator (previous PR) for generic and the Spark dialect - binaries coerce to strings as before. Now we don't have any breaking changes for concats in the upcoming release. Sorry for the confusion!

# Coercion rules from Binary to Utf8 do no apply compared to generic `concat`,
# so `concat` produces an explicit error
query error Error during planning: concat does not support mixed string and binary inputs
SELECT concat(arrow_cast('hello', 'Utf8'), arrow_cast(' world', 'Binary'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused here since apparently this should not error?

I think it might be better not to do any changes to Spark concat in this PR to try cut down on scope a little


# test variable length arguments
query TTTBI rowsort
select specific_name, data_type, parameter_mode, is_variadic, rid from information_schema.parameters where specific_name = 'concat';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we keep this test case but replace it with a different function that is variadic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it was the only Variadic function before. The count is VariadicAny, so it returns nothing.

// Logic is matched with pipe operator in the following table.
// Support only string + string concatenation,
// or binary + binary concatenation.
// Mixed string + binary concatenation is rejected,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it would still be a breaking change; before #20787 I think it would still coerce binary to string. So this restriction would be new, even without #20787

@github-actions github-actions Bot added the logical-expr Logical plan and expressions label May 29, 2026
@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented Jun 3, 2026

I'm a bit lost on this PR now; it seems we're also changing the || operator in addition? Did we change what desired behaviour we want?

@theirix
Copy link
Copy Markdown
Contributor Author

theirix commented Jun 3, 2026

I'm a bit lost on this PR now; it seems we're also changing the || operator in addition? Did we change what desired behaviour we want?

Sorry, the logic changed after the initial submission. The PR description reflects the actual logic. Yes, UDFs and the pipe operator should be in sync. The previous PR #21883 disallows mixed string-binary concatenation, which is wrong, as you've shown above. I am enabling coercion for mixed concatenation of UDFs and the pipe operator - it is consistent now.

If it is merged, no breaking changes are included in the release - cc @alamb .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation logical-expr Logical plan and expressions spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary string (BYTEA, Binary) concatenation

3 participants