feat(parquet): generalize value encoder inputs#9955
Conversation
Changes `byte_array` encoder methods (`FallbackEncoder::encode`, `DictEncoder::encode`, etc) and all `get_*_array_slice` functions from `&[usize]` to `impl ExactSizeIterator<Item = usize>`. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing generalize_value_enc_inputs (bba585a) to 7abb225 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
@alamb @etseidl This is ready for review. I am not sure why parquet_derive / Test (pull_request) timed out. It passes locally. This is the second-to-last PR in my series #9653. We're almost there! 💪 |
|
run benchmark arrow_writer |
1 similar comment
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing generalize_value_enc_inputs (bba585a) to 7abb225 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing generalize_value_enc_inputs (bba585a) to 7abb225 (merge-base) diff File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @HippoBaro -- makes sense to me
|
|
||
| /// Encode `values` to the in-progress page | ||
| fn encode<T>(&mut self, values: T, indices: &[usize]) | ||
| fn encode<T>(&mut self, values: T, indices: impl ExactSizeIterator<Item = usize>) |
There was a problem hiding this comment.
I was missing the bigger picture here -- specifically I couldn't figure out why you wanted to make this generic (and when would it be better than &[usize])? It looked to me like all the callsites are passing in &[usize] anyways -- so it is not clear to me what benefit the extra templating achieves
However, after asking Claude, it seems the answer is that making this PR generic avoids collecting a Vec in the next PR in the series -- specifically avoiding the non_null_indices: Vec<usize> in
There was a problem hiding this comment.
Yes apologies for being lazy with my PR descriptions... That is accurate. The next PR in the series will use this so that we avoid materializing intermediary [usize] in the cases where that's possible.
|
Onward! |
FWIW I restarted this test and it seems to have passed on re-run |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Spawn off from apache#9653 - Contributes to apache#9731 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> See apache#9731 # What changes are included in this PR? Changes `byte_array` encoder methods (`FallbackEncoder::encode`, `DictEncoder::encode`, etc) and all `get_*_array_slice` functions from `&[usize]` to `impl ExactSizeIterator<Item = usize>`. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> All tests passing. # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> None. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Which issue does this PR close?
Rationale for this change
See #9731
What changes are included in this PR?
Changes
byte_arrayencoder methods (FallbackEncoder::encode,DictEncoder::encode, etc) and allget_*_array_slicefunctions from&[usize]toimpl ExactSizeIterator<Item = usize>.Are these changes tested?
All tests passing.
Are there any user-facing changes?
None.