feat(parquet): fuse level encoding with counting and histogram updates#9795
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Looks nice, thanks @HippoBaro.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fuse_lvl_encoding_hist_counting (849685c) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @HippoBaro -- I think this looks really close. Thank you @etseidl for the review
| self.rep_levels_encoder | ||
| .put_with_observer(levels, |level, count| { | ||
| new_rows += (count as u32) * (level == 0) as u32; | ||
| if let Some(ref mut h) = self.page_metrics.repetition_level_histogram { |
There was a problem hiding this comment.
You might be able to move this check out of the loop (so call put if self.page_metrics.repetition_level_histogram is none and and call with_with_observer if it s some
Maybe could improve the inner loop even more
There was a problem hiding this comment.
I did a quick test of that idea earlier and it didn't seem worth the added complexity. But it probably deserves a second look on better hardware 😄.
There was a problem hiding this comment.
Just did the experiment locally with this:
match self.page_metrics.definition_level_histogram.as_mut() {
Some(histogram) => encoder.put_with_observer(levels, |level, count| {
values_to_write += count * (level == max_def) as usize;
histogram.increment_by(level, count as i64);
}),
None => encoder.put_with_observer(levels, |level, count| {
values_to_write += count * (level == max_def) as usize;
}),
};Benchmarks show a 2-3% improvements on average for list_primitive.
There was a problem hiding this comment.
I've pushed the above change!
849685c to
8c4bd85
Compare
Add `put_with_observer()` to `LevelEncoder` so callers can piggyback row/null counting and histogram updates onto level encoding without extra passes over the level buffers. Update `write_mini_batch()` to encode definition and repetition levels while collecting the associated metrics in the same pass, and hoist the histogram-enabled branch out of the inner loop. Add `LevelHistogram::increment_by()` for counted updates, keep `update_from_levels()` as a deprecated compatibility wrapper, and remove the now-unnecessary PageMetrics histogram update helpers. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
8c4bd85 to
28f25c0
Compare
|
run benchmark arrow_writer |
|
Hi @HippoBaro, thanks for the request (#9795 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
I thought so 😅 Was worth a try! |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fuse_lvl_encoding_hist_counting (28f25c0) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
I think it looks very nice persoanlly. The only last reamining question in my mind is if we should be super safe and restore put with a deprecated note.
I think if we want to include it in the next arrow (58.2.0, I am hoping to cut the release in the next day or two) we should include the deprecation
If we are ok with waiting until arrow 59 (in a month or so) we can probably not worry about put
|
Thanks @etseidl and @HippoBaro |
|
@etseidl do you have any concerns about this PR or would you like to review it again before we merge? |
Just a fix to the deprecation message then I think we're good to go 🚀 |
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
🚀 |
apache#9795) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Spawn off from apache#9653 - Contributes to apache#9731 # Rationale for this change See apache#9731 # What changes are included in this PR? Add `put_with_observer()` to `LevelEncoder` that calls an `FnMut(i16, usize)` observer for each value during encoding. This allows callers to piggyback counting and histogram updates into the encoding pass without extra iterations over the level buffer. Previously, `write_mini_batch()` made 3 separate passes over each level array: one to count non-null values or row boundaries, one to update the level histogram, and one to RLE-encode. Now all three operations happen in a single pass via the observer closure. Replace `LevelHistogram::update_from_levels()` with a new `LevelHistogram::increment_by()` that accepts a count, and remove the now-unnecessary `update_definition_level_histogram()` and `update_repetition_level_histogram()` methods from PageMetrics. # Are these changes tested? All tests passing; existing tests give 100% coverage. # Are there any user-facing changes? None --------- Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com> Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
See #9731
What changes are included in this PR?
Add
put_with_observer()toLevelEncoderthat calls anFnMut(i16, usize)observer for each value during encoding. This allows callers to piggyback counting and histogram updates into the encoding pass without extra iterations over the level buffer.Previously,
write_mini_batch()made 3 separate passes over each level array: one to count non-null values or row boundaries, one to update the level histogram, and one to RLE-encode. Now all three operations happen in a single pass via the observer closure.Replace
LevelHistogram::update_from_levels()with a newLevelHistogram::increment_by()that accepts a count, and remove the now-unnecessaryupdate_definition_level_histogram()andupdate_repetition_level_histogram()methods from PageMetrics.Are these changes tested?
All tests passing; existing tests give 100% coverage.
Are there any user-facing changes?
None