Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For dictionary arrays, the code for the concat simply copies the data from multiple places into a single contigous block: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L55
For dictionary arrays, this results in potentially duplicating the dictionary values. For example:
Array1: "A, C, A, B"
Dictionary: {A, B, C}
Values: {0, 2, 0, 1}
Array2: "C, C, A"
Dictionary: {A, B, C}
Values: {2, 2, 0}
Note the dictionary is the same between Array 1 and Array 2. Concatenating them together results in
Array3: "A, C, A, B, C, C, A"
Dictionary: {A, B, C, A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}
Note the duplicated values in the dictionary as well as the fact the other values aren't even used 🤣
In IOx it turns out that we concatenate rather large numbers of such arrays leading to very large duplicated dictionaries (read all about the gory details here): https://github.com/influxdata/influxdb_iox/issues/1822
Describe the solution you'd like
When the dictionary has the exact same contents (e.g. same pointer) there is no need to copy the dictionary
Here is a reproducer showing the actual issue:
fn test_string_dictionary_repeated_dictionary() {
let array : DictionaryArray<Int8Type> = vec!["a", "a", "b", "c"].into_iter().collect();
let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();
// dictionary is "a", "b", "c"
assert_eq!(array.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef));
assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
// concatenate it with itself
let combined = concat(&[&array_copy as _, &array as _])
.unwrap();
let combined = combined
.as_any()
.downcast_ref::<DictionaryArray<Int8Type>>()
.unwrap();
// Since dictionares are the same, no need a copy -- it could be the same "a", "b", "c"
assert_eq!(combined.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
"Actual: {:#?}", combined);
assert_eq!(combined.keys(), &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2]));
}
The actual output shows that the resulting dictionary is "a", "b", "c", "a", "b", "c" when it was literally the same pointer
---- compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary stdout ----
thread 'compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary' panicked at 'assertion failed: `(left == right)`
left: `StringArray
[
"a",
"b",
"c",
"a",
"b",
"c",
]`,
right: `StringArray
[
"a",
"b",
"c",
]`: Actual: DictionaryArray {keys: PrimitiveArray<Int8>
[
0,
0,
1,
2,
3,
3,
4,
5,
] values: StringArray
[
"a",
"b",
"c",
"a",
"b",
"c",
]}
', arrow/src/compute/kernels/concat.rs:528:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Describe alternatives you've considered
There are times when the dictionary may not be the exact same pointer, but contain mostly/all the same values which would also be useful to compact, but I will file that under a separate ticket as it will likely involve a tradeoff between CPU and memory
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For dictionary arrays, the code for the
concatsimply copies the data from multiple places into a single contigous block: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L55For dictionary arrays, this results in potentially duplicating the dictionary values. For example:
Array1: "A, C, A, B"
Dictionary:
{A, B, C}Values:
{0, 2, 0, 1}Array2: "C, C, A"
Dictionary:
{A, B, C}Values:
{2, 2, 0}Note the dictionary is the same between Array 1 and Array 2. Concatenating them together results in
Array3: "A, C, A, B, C, C, A"
Dictionary:
{A, B, C, A, B, C}Values:
{0, 2, 0, 1, 2, 2, 0}Note the duplicated values in the dictionary as well as the fact the other values aren't even used 🤣
In IOx it turns out that we concatenate rather large numbers of such arrays leading to very large duplicated dictionaries (read all about the gory details here): https://github.com/influxdata/influxdb_iox/issues/1822
Describe the solution you'd like
When the dictionary has the exact same contents (e.g. same pointer) there is no need to copy the dictionary
Here is a reproducer showing the actual issue:
The actual output shows that the resulting dictionary is
"a", "b", "c", "a", "b", "c"when it was literally the same pointerDescribe alternatives you've considered
There are times when the dictionary may not be the exact same pointer, but contain mostly/all the same values which would also be useful to compact, but I will file that under a separate ticket as it will likely involve a tradeoff between CPU and memory