Skip to content

Do not copy dictionary values when they are the same in concat #504

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For dictionary arrays, the code for the concat simply copies the data from multiple places into a single contigous block: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L55

For dictionary arrays, this results in potentially duplicating the dictionary values. For example:

Array1: "A, C, A, B"
Dictionary: {A, B, C}
Values: {0, 2, 0, 1}

Array2: "C, C, A"
Dictionary: {A, B, C}
Values: {2, 2, 0}

Note the dictionary is the same between Array 1 and Array 2. Concatenating them together results in
Array3: "A, C, A, B, C, C, A"
Dictionary: {A, B, C, A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

Note the duplicated values in the dictionary as well as the fact the other values aren't even used 🤣

In IOx it turns out that we concatenate rather large numbers of such arrays leading to very large duplicated dictionaries (read all about the gory details here): https://github.com/influxdata/influxdb_iox/issues/1822

Describe the solution you'd like
When the dictionary has the exact same contents (e.g. same pointer) there is no need to copy the dictionary

Here is a reproducer showing the actual issue:

    fn test_string_dictionary_repeated_dictionary() {
        let array : DictionaryArray<Int8Type> = vec!["a", "a", "b", "c"].into_iter().collect();
        let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();

        // dictionary is "a", "b", "c"
        assert_eq!(array.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef));
        assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));

        // concatenate it with itself
        let combined = concat(&[&array_copy as _, &array as _])
            .unwrap();
        let combined = combined
            .as_any()
            .downcast_ref::<DictionaryArray<Int8Type>>()
            .unwrap();

        // Since dictionares are the same, no need a copy -- it could be the same "a", "b", "c"
        assert_eq!(combined.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
                   "Actual: {:#?}", combined);
        assert_eq!(combined.keys(), &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2]));
    }

The actual output shows that the resulting dictionary is "a", "b", "c", "a", "b", "c" when it was literally the same pointer

---- compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary stdout ----
thread 'compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary' panicked at 'assertion failed: `(left == right)`
  left: `StringArray
[
  "a",
  "b",
  "c",
  "a",
  "b",
  "c",
]`,
 right: `StringArray
[
  "a",
  "b",
  "c",
]`: Actual: DictionaryArray {keys: PrimitiveArray<Int8>
[
  0,
  0,
  1,
  2,
  3,
  3,
  4,
  5,
] values: StringArray
[
  "a",
  "b",
  "c",
  "a",
  "b",
  "c",
]}
', arrow/src/compute/kernels/concat.rs:528:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Describe alternatives you've considered
There are times when the dictionary may not be the exact same pointer, but contain mostly/all the same values which would also be useful to compact, but I will file that under a separate ticket as it will likely involve a tradeoff between CPU and memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogperformance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions