feat: support binary and string types for concat UDFs#22244
Conversation
|
@Jefffrey could you please have a look? The last |
Jefffrey
left a comment
There was a problem hiding this comment.
I'll try take a proper look at this, but might take some time as there seems to be quite a lot happening in this PR.
cc @neilconway as you worked on the concat builder code previously
| // Logic is matched with pipe operator in the following table. | ||
| // Support only string + string concatenation, | ||
| // or binary + binary concatenation. | ||
| // Mixed string + binary concatenation is rejected, |
There was a problem hiding this comment.
I think this might constitute a breaking change 🤔
There was a problem hiding this comment.
I'll try take a proper look at this, but might take some time as there seems to be quite a lot happening in this PR.
cc @neilconway as you worked on the concat builder code previously
Thank you! Also, I kindly ask @kosiew who reviewed the previous PR #20787
I think this might constitute a breaking change 🤔
The preliminary mixed binary support was introduced recently in #20787 (I guess it is not released yet - 53.1.0 or so), and now we're removing it to make it more consistent. I tried to describe the logic and rationale in the PR description
There was a problem hiding this comment.
We do have a release being planned soon:
So those changes (#20787) would be included in that, meaning we'd need this PR to also go into the release to avoid a breaking change across versions 🤔
I'll try prioritize this review
There was a problem hiding this comment.
Added a breaking change note. For the release, this note should replace the note for 20787
|
run benchmarks concat concat_ws |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing concat-binary-udf (7438088) to 7ad8e2c (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing concat-binary-udf (7438088) to 7ad8e2c (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| # Coercion rules from Binary to Utf8 do no apply compared to generic `concat`, | ||
| # so `concat` produces an explicit error | ||
| query error Error during planning: concat does not support mixed string and binary inputs | ||
| SELECT concat(arrow_cast('hello', 'Utf8'), arrow_cast(' world', 'Binary')); |
There was a problem hiding this comment.
Actually I'm a little confused now since I was quickly testing with Spark and got this:
>>> spark.version
'4.1.2'
>>> spark.sql("select concat(x'1234', '1234')").show();
+---------------------+
|concat(X'1234', 1234)|
+---------------------+
| 41234|
+---------------------+
>>> spark.sql("select concat(x'1234', '1234')").explain(True);
== Parsed Logical Plan ==
'Project [unresolvedalias('concat(0x1234, 1234))]
+- OneRowRelation
== Analyzed Logical Plan ==
concat(X'1234', 1234): string
Project [concat(cast(0x1234 as string), 1234) AS concat(X'1234', 1234)#19]
+- OneRowRelation
== Optimized Logical Plan ==
Project [41234 AS concat(X'1234', 1234)#19]
+- OneRowRelation
== Physical Plan ==
*(1) Project [41234 AS concat(X'1234', 1234)#19]
+- *(1) Scan OneRowRelation[]Seems it'll automatically cast binaries to string?
There was a problem hiding this comment.
That's interesting. It happened that Spark's concat didn't apply coercion rules via coerce_types - fixed, now it should
There was a problem hiding this comment.
I'm still a little confused here since apparently this should not error?
I think it might be better not to do any changes to Spark concat in this PR to try cut down on scope a little
There was a problem hiding this comment.
I just tried to build and test an old build (a few commits before #20787 - e.g. 0383a88). If I introduce this simple select concat(x'1234', '1234') to Spark SLT, it fails with "Concat function does not support scalar type 1234". With this PR, it fails with "function does not support mixed string and binary inputs". So it's consistent. The only difference should be the null handling, if I'm not mistaken.
I am confused about why the Spark SLT is different from the spark.sql calls above. Can you try again?
Regarding the PR scope, I thought about it, but it's not feasible. Spark UDF reuses the majority of the general concat UDF. And we didn't have SLTs for generic concat before, that's why I moved and adjusted tests for both functions.
There was a problem hiding this comment.
@Jefffrey , now I can reproduce the Spark behaviour as in your example on Spark, databricks and duckdb. All of them actually cast binaries to text and produce a text output for mixed cases.
So I agree, it could break Spark, and let's avoid merging this PR. So far, the only breaking api change shipping is #21883 with a disallowed "binary || text", and "binary || binary" returning binary.
There was a problem hiding this comment.
A quick update: I fixed this breaking change for all UDFs, the pipe operator (previous PR) for generic and the Spark dialect - binaries coerce to strings as before. Now we don't have any breaking changes for concats in the upcoming release. Sorry for the confusion!
| # Coercion rules from Binary to Utf8 do no apply compared to generic `concat`, | ||
| # so `concat` produces an explicit error | ||
| query error Error during planning: concat does not support mixed string and binary inputs | ||
| SELECT concat(arrow_cast('hello', 'Utf8'), arrow_cast(' world', 'Binary')); |
There was a problem hiding this comment.
I'm still a little confused here since apparently this should not error?
I think it might be better not to do any changes to Spark concat in this PR to try cut down on scope a little
|
|
||
| # test variable length arguments | ||
| query TTTBI rowsort | ||
| select specific_name, data_type, parameter_mode, is_variadic, rid from information_schema.parameters where specific_name = 'concat'; |
There was a problem hiding this comment.
nit: can we keep this test case but replace it with a different function that is variadic?
There was a problem hiding this comment.
Seems like it was the only Variadic function before. The count is VariadicAny, so it returns nothing.
| // Logic is matched with pipe operator in the following table. | ||
| // Support only string + string concatenation, | ||
| // or binary + binary concatenation. | ||
| // Mixed string + binary concatenation is rejected, |
|
I'm a bit lost on this PR now; it seems we're also changing the |
Sorry, the logic changed after the initial submission. The PR description reflects the actual logic. Yes, UDFs and the pipe operator should be in sync. The previous PR #21883 disallows mixed string-binary concatenation, which is wrong, as you've shown above. I am enabling coercion for mixed concatenation of UDFs and the pipe operator - it is consistent now. If it is merged, no breaking changes are included in the release - cc @alamb . |
Which issue does this PR close?
BYTEA,Binary) concatenation #12709.Rationale for this change
While #21883 introduced binary argument support for the pipe operator, this PR targets three UDFs:
concat,concat_ws, and Spark'sconcatto harmonise all their behaviour.After the first attempt at banning mixed string+concat operations, I switched to allowing them for concat UDFs and the pipe operator - as seen in DuckDB, Spark, Databricks. Previously, mixed behaviour was allowed in #20787 (not released yet). Thus, no breaking API changes since the klast release
What changes are included in this PR?
concat_wsbehavior for different separator typesThe diff is quite large. Detailed code changes:
ConcatBuilderto abstract string/binary/view/array operationsStringArray/LargeStringArrayinto a genericConcatGenericStringBuilder- less duplicationbinaries.rsColumnarValueRefvariants to handle nullable and non-nullable binary typesfrom_columnar_value- simplified call sites. Scalar code path stays mostly the sameSignature::variadictoSignature::UserDefinedto allow different argument types.Variadicrequired every argument (including binaries) to be coerced to the same string type, so the UDF cannot distinguish between binary and string inputs. It's a relatively uncommon - happy to discussconcatimplementation, so it handles only Spark-specific null-handlingspark/concat.slt) to a generic SLT, so we test both generic and Spark-specific behaviourAre these changes tested?
Are there any user-facing changes?
No. Also, reverted a breaking api change for
||operator