Skip to content

Implement AnyRee#9959

Merged
Jefffrey merged 7 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/Ree-AnyArray
May 27, 2026
Merged

Implement AnyRee#9959
Jefffrey merged 7 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/Ree-AnyArray

Conversation

@Rich-T-kid
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

closes #9909.

Rationale for this change

makes the API simpler to work with & less code duplication

What changes are included in this PR?

Replace the per-key-type RunEndEncoded match arms in length/bit_length (arrow-string) and date_part (arrow-arith) with a single dispatch through the new AsArray::as_any_ree_opt/as_any_ree returning &dyn AnyRunEndArray, mirroring the existing dictionary handling. This removes the
now-unused ree_map! macro, leaving one trait-object code path for all Int16/Int32/Int64 run-end types.

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 11, 2026
Comment thread arrow-array/src/array/run_array.rs Outdated
Comment thread arrow-array/src/array/run_array.rs
Comment thread arrow-array/src/array/run_array.rs
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/Ree-AnyArray branch from f398606 to d626ef8 Compare May 11, 2026 15:04
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/Ree-AnyArray branch from d626ef8 to b7e7b73 Compare May 11, 2026 15:04
Comment thread arrow-arith/src/temporal.rs Outdated
Comment thread arrow-string/src/length.rs Outdated
Comment thread arrow-string/src/length.rs Outdated
Comment thread arrow-array/src/array/run_array.rs Outdated
Comment thread arrow-array/src/array/run_array.rs
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/Ree-AnyArray branch from c352224 to f601fc0 Compare May 12, 2026 13:36
Comment thread arrow-array/src/array/run_array.rs Outdated
Comment thread arrow-array/src/array/run_array.rs Outdated
Comment thread arrow-array/src/array/run_array.rs Outdated
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/Ree-AnyArray branch from bbe594c to 931003b Compare May 15, 2026 04:52
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is dragging out so long, came back with a fresh set of eyes. Just wanna ensure we're getting this correct from the start 😅

Comment thread arrow-array/src/array/run_array.rs Outdated
Comment on lines +797 to +806
/// Returns the run ends of this array as a primitive array.
fn run_ends(&self) -> ArrayRef;

/// Returns the values of this array.
fn values(&self) -> &Arc<dyn Array>;

/// Returns a new run-end encoded array with the given values, preserving the
/// existing run ends.
fn with_values(&self, values: ArrayRef) -> ArrayRef;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of it, how do these APIs handle when the original RunArray is sliced? I think they return unsliced original data, and especially for run_ends it erases the logical slicing that may have been applied?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im a little confused by what you mean. Do you have an example of this case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if theres a case like this:

    #[test]
    fn test123() {
        let run = Int32Array::from(vec![3, 6, 9]);
        let values = Int32Array::from(vec![1, 2, 3]);
        // [1, 1, 1, 2, 2, 2, 3, 3, 3]
        let array = RunArray::try_new(&run, &values).unwrap();

        // [2, 2, 2]
        // But still references same underlying arrays as above
        // Essentially:
        // {1, 1, 1, [2, 2, 2], 3, 3, 3}
        let array_sliced = array.slice(3, 3);

        let new_values = Int32Array::from(vec![7, 8, 9]);
        // [8, 8, 8]
        let new_array_sliced = array_sliced.with_values(Arc::new(new_values));
    }

When we slice the array we still hold on to the original values array; so array_sliced.values() would return [1, 2, 3] even though array_sliced only has value of 2 repeating. This isn't too big an issue, as I think at most its just a performance thing. For example:

let values = date_part(array.values(), part)?;
let new_array = array.with_values(values);

If array is sliced, then we could be doing extra work since we're calculating date_part for the values which aren't in the slice.

But for run_ends() method its harder to consider since we don't have a use case for it currently; is it a potential footgun that calling run_ends() on a sliced run array ignores any slicing 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the example you provided, shouldn't let array_sliced = array.slice(3, 3); ideally be represented as run_ends:[3],values:[2] instead of
{run_ends: [3, 6, 9], values: PrimitiveArray<Int32> [ 1, 2, 3, ]}
which is currently what array.slice() produces. due to REE.slice() copying the values arrays.

I think the best approach for now would be to remove the run_ends() function and add it in later if there is a need.
Im not sure how we would avoid the extra work for with_values() without updating the .slice() method. But this would require looking at all the CallSites for slice first to make sure this doesn't cause breaking changes. In theory this shouldn't causes regressions though it should be an implementation detail.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the example you provided, shouldn't let array_sliced = array.slice(3, 3); ideally be represented as run_ends:[3],values:[2] instead of
{run_ends: [3, 6, 9], values: PrimitiveArray<Int32> [ 1, 2, 3, ]}
which is currently what array.slice() produces. due to REE.slice() copying the values arrays.

Technically yes, but its just a result of how slicing is currently implemented on run arrays (can be a little confusing)

I think the best approach for now would be to remove the run_ends() function and add it in later if there is a need.
Im not sure how we would avoid the extra work for with_values() without updating the .slice() method. But this would require looking at all the CallSites for slice first to make sure this doesn't cause breaking changes. In theory this shouldn't causes regressions though it should be an implementation detail.

This sounds good, thanks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracking this in #10017, ill work on it when I get a chance.

Comment thread arrow-array/src/array/run_array.rs Outdated
@Rich-T-kid Rich-T-kid requested review from Jefffrey and alamb May 24, 2026 20:03
@Jefffrey Jefffrey merged commit 2eeb805 into apache:main May 27, 2026
26 checks passed
@Jefffrey
Copy link
Copy Markdown
Contributor

Thanks @Rich-T-kid & @alamb

@rluvaton
Copy link
Copy Markdown
Member

Thanks for that! I wanted to do that but did not find the time

Rich-T-kid added a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?
closes apache#9909.
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#9909.

# Rationale for this change
makes the API simpler to work with & less code duplication
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
Replace the per-key-type RunEndEncoded match arms in length/bit_length
(arrow-string) and date_part (arrow-arith) with a single dispatch
through the new `AsArray::as_any_ree_opt/as_any_ree` returning &dyn
AnyRunEndArray, mirroring the existing dictionary handling. This removes
the
now-unused `ree_map!` macro, leaving one trait-object code path for all
Int16/Int32/Int64 run-end types.
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
no
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
@alamb alamb mentioned this pull request Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AnyRunArray trait

4 participants