feat(parquet): add all-null fast paths for level building#9954
Conversation
alamb
left a comment
There was a problem hiding this comment.
Code looks good to me -- thank you @HippoBaro
I think we need a few more tests and I will launch some benchmarks
|
run benchmarks arrow_writer |
1 similar comment
|
run benchmarks arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing all_null_fast_path (e1d948f) to 7abb225 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing all_null_fast_path (e1d948f) to 7abb225 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
etseidl
left a comment
There was a problem hiding this comment.
Thanks @HippoBaro, this looks nice. The few regressions look to be noise.
When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via `extend_uniform_levels` in O(1). Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
e1d948f to
3039241
Compare
…umns When writing a nullable leaf (primitive) Arrow array, `write_leaf` built the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rows of branchy work and allocates a num_rows level buffer even though almost every level is the same value. Add a length-gated bulk-fill path: when the column is majority-null and the sub-range is large enough to amortize the gate's per-call cost, build the definition levels by bulk-filling the null level (a vectorized memset) and overwriting only the non-null positions found via `NullBuffer::valid_indices()`. The per-row path is kept for non-majority-null arrays and for the small sub-ranges produced by list/struct write paths, so those shapes are not regressed. Contributes to apache#9731. Complements apache#9954's all-null fast path by covering the sparse (mostly-but-not-entirely-null) case it does not handle. Threshold sweep on Ryzen 9 9950X (parquet/arrow_writer benches, /default variant, vs main): T primitive list_primitive primitive_sparse list_primitive_sparse ---------------------------------------------------------------------- 0 -3.0% +2.6% -36.1% +7.8% 16 -1.4% +1.8% -34.8% +2.8% 32 -1.1% -0.1% -35.1% +1.7% 64 -1.1% +0.7% -34.5% +1.7% <- chosen 128 -1.0% +1.5% -35.1% +2.4% 256 -1.4% +1.4% -35.1% +2.7% T=0 reproduces the per-call slice/popcount regression on list_primitive_sparse_99pct_null (+7.8%, matches the criterion bot's original measurement on the unguarded version). The +1.7% floor at T>=32 is the structural cost of evaluating the gate itself across ~10K small write_leaf calls in the list path; reducing it further would require hoisting the decision into the caller. T=64 matches T=32 on every shape and gives ~12x margin over the avg list length of ~5. Final benchmarks vs main on Ryzen 9 9950X (T=64, /default variants): primitive/default -1.5% primitive_non_null/default -2.8% primitive_sparse_99pct_null/default -35.1% primitive_all_null/default -66.4% list_primitive/default +1.8% (within noise) list_primitive_non_null/default -0.7% (no change, p=0.45) list_primitive_sparse_99pct_null/default +3.0% (gate-check floor) struct_sparse_99pct_null/default -4.9% bool/default +2.2%
alamb
left a comment
There was a problem hiding this comment.
Looks great to me -- thank you @HippoBaro and @etseidl
## Which issue does this PR close? - Contributes to #9731. ## AI assistance Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in [Threshold (`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's [CONTRIBUTING guidance on AI-generated submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions). ## Rationale for this change When writing a nullable leaf (primitive) Arrow array, `write_leaf` builds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~`num_rows` of branchy work and allocates a `num_rows`-element level buffer even though almost every produced level is the same value. #9954 adds an O(1) fast path for the *entirely* null case; this PR covers the *sparse* (mostly-but-not-entirely null) case it doesn't handle, the literal subject of #9731 ("a column that is 99% null … ~100x more work than necessary"). ## What changes are included in this PR? A single popcount pass over the null mask (`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorized `Vec::resize` memset) and only the non-null positions (from `NullBuffer::valid_indices()`) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the same `let range_nulls = nulls.slice(range.start, len)` slicing idiom; the slow path uses `range_nulls.iter()` for the def-level map and `range_nulls.valid_indices().map(|i| i + range.start)` for `non_null_indices`, with no `unsafe`. Output is byte-identical: the level *values* are unchanged, just produced via memset+scatter (fast path) or via the high-level `NullBuffer` iterators (slow path) instead of a manual `BitIndexIterator` walk. ## Threshold (`BULK_FILL_MIN_LEN`) The bulk-fill fast path is gated on two conditions: - `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths call `write_leaf` many times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins. - `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact. Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller. ## Are these changes tested? Existing tests cover this path: `cargo test -p parquet --features arrow --lib arrow_writer` is green (136 tests, full of nulls and roundtrips); full `cargo test -p parquet --features arrow` green modulo the pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on `main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt --check` clean. The `unsafe get_unchecked_mut` flagged in the original revision was replaced via `NullBuffer::valid_indices()`; the slow-path also dropped its `unsafe value_unchecked` for the same reason. ## Are there any user-facing changes? None. ## Benchmarks `cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X: ``` primitive_sparse_99pct_null/default 11.88 ms -> 9.13 ms (-23%) <- the case #9731 calls out primitive_all_null/default 5.65 ms -> 2.33 ms (-59%) (subsumed by #9954's O(1) path if that lands first) struct_sparse_99pct_null/default 5.67 ms -> 5.32 ms (-6%) struct_all_null/default 1.52 ms -> 1.31 ms (-14%) list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string: within noise (no regression) ``` The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases: ``` primitive_all_null/default 2.47x (11.0ms -> 4.4ms) primitive_sparse_99pct_null/default 1.60x (16.8ms -> 10.5ms) primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2} 1.38x to 2.48x primitive_sparse_99pct_null/{...} 1.28x to 1.59x list_primitive*, list_primitive_sparse_99pct_null*: 1.00x to 1.01x (within noise) ``` Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the `>= 50% null` guard is conservative. This is the *materialization*-cost half of #9731 (~30% of the 99%-null write); the *walk*-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all `num_rows` levels, is the larger structural change #9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work. --------- Co-authored-by: Ryan Stewart <noreply@example.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Spawn off from apache#9653 - Contributes to apache#9731 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> See apache#9731 # What changes are included in this PR? When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via `extend_uniform_levels` in O(1). # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> All tests passing + additional all null unit tests. # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> None. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
## Which issue does this PR close? - Contributes to apache#9731. ## AI assistance Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in [Threshold (`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's [CONTRIBUTING guidance on AI-generated submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions). ## Rationale for this change When writing a nullable leaf (primitive) Arrow array, `write_leaf` builds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~`num_rows` of branchy work and allocates a `num_rows`-element level buffer even though almost every produced level is the same value. apache#9954 adds an O(1) fast path for the *entirely* null case; this PR covers the *sparse* (mostly-but-not-entirely null) case it doesn't handle, the literal subject of apache#9731 ("a column that is 99% null … ~100x more work than necessary"). ## What changes are included in this PR? A single popcount pass over the null mask (`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorized `Vec::resize` memset) and only the non-null positions (from `NullBuffer::valid_indices()`) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the same `let range_nulls = nulls.slice(range.start, len)` slicing idiom; the slow path uses `range_nulls.iter()` for the def-level map and `range_nulls.valid_indices().map(|i| i + range.start)` for `non_null_indices`, with no `unsafe`. Output is byte-identical: the level *values* are unchanged, just produced via memset+scatter (fast path) or via the high-level `NullBuffer` iterators (slow path) instead of a manual `BitIndexIterator` walk. ## Threshold (`BULK_FILL_MIN_LEN`) The bulk-fill fast path is gated on two conditions: - `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths call `write_leaf` many times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins. - `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact. Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller. ## Are these changes tested? Existing tests cover this path: `cargo test -p parquet --features arrow --lib arrow_writer` is green (136 tests, full of nulls and roundtrips); full `cargo test -p parquet --features arrow` green modulo the pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on `main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt --check` clean. The `unsafe get_unchecked_mut` flagged in the original revision was replaced via `NullBuffer::valid_indices()`; the slow-path also dropped its `unsafe value_unchecked` for the same reason. ## Are there any user-facing changes? None. ## Benchmarks `cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X: ``` primitive_sparse_99pct_null/default 11.88 ms -> 9.13 ms (-23%) <- the case apache#9731 calls out primitive_all_null/default 5.65 ms -> 2.33 ms (-59%) (subsumed by apache#9954's O(1) path if that lands first) struct_sparse_99pct_null/default 5.67 ms -> 5.32 ms (-6%) struct_all_null/default 1.52 ms -> 1.31 ms (-14%) list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string: within noise (no regression) ``` The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases: ``` primitive_all_null/default 2.47x (11.0ms -> 4.4ms) primitive_sparse_99pct_null/default 1.60x (16.8ms -> 10.5ms) primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2} 1.38x to 2.48x primitive_sparse_99pct_null/{...} 1.28x to 1.59x list_primitive*, list_primitive_sparse_99pct_null*: 1.00x to 1.01x (within noise) ``` Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the `>= 50% null` guard is conservative. This is the *materialization*-cost half of apache#9731 (~30% of the 99%-null write); the *walk*-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all `num_rows` levels, is the larger structural change apache#9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work. --------- Co-authored-by: Ryan Stewart <noreply@example.com>
Which issue does this PR close?
Rationale for this change
See #9731
What changes are included in this PR?
When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via
extend_uniform_levelsin O(1).Are these changes tested?
All tests passing + additional all null unit tests.
Are there any user-facing changes?
None.