Skip to content

feat(parquet): add all-null fast paths for level building#9954

Merged
alamb merged 1 commit into
apache:mainfrom
HippoBaro:all_null_fast_path
May 14, 2026
Merged

feat(parquet): add all-null fast paths for level building#9954
alamb merged 1 commit into
apache:mainfrom
HippoBaro:all_null_fast_path

Conversation

@HippoBaro

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

See #9731

What changes are included in this PR?

When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via extend_uniform_levels in O(1).

Are these changes tested?

All tests passing + additional all null unit tests.

Are there any user-facing changes?

None.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 10, 2026
@HippoBaro

Copy link
Copy Markdown
Contributor Author

@alamb @etseidl This one is short and sweet and makes the all-null write case an O(1) operation.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me -- thank you @HippoBaro

I think we need a few more tests and I will launch some benchmarks

Comment thread parquet/src/arrow/arrow_writer/levels.rs
Comment thread parquet/src/arrow/arrow_writer/levels.rs
@alamb

alamb commented May 12, 2026

Copy link
Copy Markdown
Contributor

run benchmarks arrow_writer

1 similar comment
@alamb

alamb commented May 12, 2026

Copy link
Copy Markdown
Contributor

run benchmarks arrow_writer

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4434256987-20-cl8cw 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing all_null_fast_path (e1d948f) to 7abb225 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4434257436-21-zwvn6 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing all_null_fast_path (e1d948f) to 7abb225 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              all_null_fast_path                     main
-----                                              ------------------                     ----
bool/bloom_filter                                  1.00     13.0±0.04ms    19.2 MB/sec    1.01     13.1±0.11ms    19.1 MB/sec
bool/cdc                                           1.01     15.9±0.06ms    15.7 MB/sec    1.00     15.7±0.09ms    15.9 MB/sec
bool/default                                       1.00     10.9±0.03ms    22.9 MB/sec    1.01     11.0±0.11ms    22.7 MB/sec
bool/parquet_2                                     1.00     14.7±0.05ms    17.0 MB/sec    1.01     14.8±0.12ms    16.9 MB/sec
bool/zstd                                          1.00     11.4±0.04ms    21.9 MB/sec    1.01     11.5±0.10ms    21.7 MB/sec
bool/zstd_parquet_2                                1.00     15.1±0.04ms    16.6 MB/sec    1.01     15.2±0.12ms    16.5 MB/sec
bool_non_null/bloom_filter                         1.00      7.1±0.03ms    17.7 MB/sec    1.00      7.1±0.02ms    17.6 MB/sec
bool_non_null/cdc                                  1.00      6.9±0.03ms    18.1 MB/sec    1.00      6.9±0.03ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    28.9 MB/sec    1.00      4.3±0.02ms    28.9 MB/sec
bool_non_null/parquet_2                            1.00      9.1±0.04ms    13.8 MB/sec    1.00      9.1±0.03ms    13.7 MB/sec
bool_non_null/zstd                                 1.00      4.7±0.02ms    26.7 MB/sec    1.00      4.7±0.02ms    26.7 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.04ms    13.2 MB/sec    1.00      9.5±0.04ms    13.1 MB/sec
float_with_nans/bloom_filter                       1.00     93.6±0.36ms   149.5 MB/sec    1.00     93.9±0.41ms   149.1 MB/sec
float_with_nans/cdc                                1.00     82.2±0.30ms   170.3 MB/sec    1.00     82.1±0.45ms   170.6 MB/sec
float_with_nans/default                            1.00     74.5±0.25ms   188.0 MB/sec    1.01     75.0±1.17ms   186.6 MB/sec
float_with_nans/parquet_2                          1.00     95.2±0.37ms   147.1 MB/sec    1.00     95.0±0.37ms   147.3 MB/sec
float_with_nans/zstd                               1.00    112.2±0.23ms   124.8 MB/sec    1.00    112.4±0.22ms   124.5 MB/sec
float_with_nans/zstd_parquet_2                     1.00    132.6±0.64ms   105.6 MB/sec    1.00    132.3±0.36ms   105.8 MB/sec
list_primitive/bloom_filter                        1.02    337.8±0.79ms  1614.6 MB/sec    1.00    330.1±1.76ms  1652.1 MB/sec
list_primitive/cdc                                 1.02    365.9±0.98ms  1490.5 MB/sec    1.00    359.9±0.87ms  1515.5 MB/sec
list_primitive/default                             1.02    255.8±0.65ms     2.1 GB/sec    1.00    249.9±1.22ms     2.1 GB/sec
list_primitive/parquet_2                           1.02    274.8±0.60ms  1984.7 MB/sec    1.00    268.3±0.48ms  2032.5 MB/sec
list_primitive/zstd                                1.01    506.8±0.74ms  1076.2 MB/sec    1.00    500.4±0.86ms  1089.9 MB/sec
list_primitive/zstd_parquet_2                      1.01    498.0±0.52ms  1095.0 MB/sec    1.00    492.4±0.48ms  1107.5 MB/sec
list_primitive_non_null/bloom_filter               1.00    405.2±3.38ms  1343.1 MB/sec    1.07    435.6±4.14ms  1249.5 MB/sec
list_primitive_non_null/cdc                        1.00    442.4±7.06ms  1230.2 MB/sec    1.00    442.2±8.87ms  1230.6 MB/sec
list_primitive_non_null/default                    1.00    267.9±3.17ms  2031.2 MB/sec    1.10    294.7±3.08ms  1846.8 MB/sec
list_primitive_non_null/parquet_2                  1.00    292.3±3.82ms  1862.1 MB/sec    1.07   312.0±13.71ms  1744.5 MB/sec
list_primitive_non_null/zstd                       1.00    684.7±3.43ms   794.9 MB/sec    1.05    717.1±7.09ms   759.0 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    670.7±0.33ms   811.5 MB/sec    1.00    671.3±0.48ms   810.8 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.03     11.6±0.05ms     3.2 GB/sec    1.00     11.2±0.06ms     3.2 GB/sec
list_primitive_sparse_99pct_null/cdc               1.02     23.1±0.08ms  1616.0 MB/sec    1.00     22.7±0.07ms  1647.4 MB/sec
list_primitive_sparse_99pct_null/default           1.03     11.2±0.05ms     3.3 GB/sec    1.00     10.9±0.05ms     3.3 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.03     11.2±0.06ms     3.2 GB/sec    1.00     10.9±0.05ms     3.3 GB/sec
list_primitive_sparse_99pct_null/zstd              1.03     13.1±0.07ms     2.8 GB/sec    1.00     12.7±0.05ms     2.9 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.02     11.4±0.06ms     3.2 GB/sec    1.00     11.1±0.04ms     3.3 GB/sec
primitive/bloom_filter                             1.01    151.0±0.53ms   297.2 MB/sec    1.00    150.2±0.65ms   298.8 MB/sec
primitive/cdc                                      1.00    159.2±0.44ms   281.9 MB/sec    1.00    159.9±0.90ms   280.6 MB/sec
primitive/default                                  1.01    119.0±0.59ms   377.1 MB/sec    1.00    118.4±0.59ms   379.1 MB/sec
primitive/parquet_2                                1.00    133.1±0.26ms   337.2 MB/sec    1.00    133.3±0.57ms   336.6 MB/sec
primitive/zstd                                     1.00    147.7±0.20ms   303.9 MB/sec    1.00    147.8±0.57ms   303.7 MB/sec
primitive/zstd_parquet_2                           1.00    166.4±0.30ms   269.6 MB/sec    1.00    166.3±0.66ms   269.9 MB/sec
primitive_all_null/bloom_filter                    1.00    905.0±1.81µs    48.4 GB/sec    12.79    11.6±0.19ms     3.8 GB/sec
primitive_all_null/cdc                             1.00     19.4±0.54ms     2.3 GB/sec    1.58     30.7±0.41ms  1459.8 MB/sec
primitive_all_null/default                         1.00    273.0±0.84µs   160.6 GB/sec    39.95    10.9±0.15ms     4.0 GB/sec
primitive_all_null/parquet_2                       1.00    280.0±1.70µs   156.5 GB/sec    39.04    10.9±0.19ms     4.0 GB/sec
primitive_all_null/zstd                            1.00    387.7±1.12µs   113.0 GB/sec    28.45    11.0±0.15ms     4.0 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    355.4±1.85µs   123.3 GB/sec    31.12    11.1±0.21ms     4.0 GB/sec
primitive_non_null/bloom_filter                    1.00    107.9±0.38ms   407.7 MB/sec    1.05    113.4±1.29ms   387.9 MB/sec
primitive_non_null/cdc                             1.00     90.4±0.26ms   486.8 MB/sec    1.00     90.2±0.55ms   487.8 MB/sec
primitive_non_null/default                         1.00     67.8±0.20ms   649.1 MB/sec    1.00     67.6±0.21ms   650.8 MB/sec
primitive_non_null/parquet_2                       1.00     89.6±0.26ms   490.9 MB/sec    1.00     89.5±0.25ms   491.7 MB/sec
primitive_non_null/zstd                            1.00     98.6±0.19ms   446.1 MB/sec    1.06    105.0±0.22ms   418.9 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    123.4±0.19ms   356.5 MB/sec    1.05    130.1±1.76ms   338.3 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     18.6±0.14ms     2.4 GB/sec    1.00     18.6±0.12ms     2.4 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     36.9±0.41ms  1217.4 MB/sec    1.02     37.5±0.34ms  1196.9 MB/sec
primitive_sparse_99pct_null/default                1.00     16.9±0.06ms     2.6 GB/sec    1.00     16.9±0.05ms     2.6 GB/sec
primitive_sparse_99pct_null/parquet_2              1.00     17.0±0.07ms     2.6 GB/sec    1.00     16.9±0.09ms     2.6 GB/sec
primitive_sparse_99pct_null/zstd                   1.00     20.3±0.07ms     2.2 GB/sec    1.00     20.2±0.06ms     2.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.01     19.0±0.15ms     2.3 GB/sec    1.00     18.8±0.06ms     2.3 GB/sec
string/bloom_filter                                1.00    205.5±6.07ms     2.5 GB/sec    1.12   230.0±26.26ms     2.2 GB/sec
string/cdc                                         1.00    220.4±3.05ms     2.3 GB/sec    1.00    221.3±6.04ms     2.3 GB/sec
string/default                                     1.00    118.6±4.89ms     4.3 GB/sec    1.23   145.4±26.31ms     3.5 GB/sec
string/parquet_2                                   1.00    111.0±4.77ms     4.6 GB/sec    1.14    126.0±0.72ms     4.1 GB/sec
string/zstd                                        1.00    420.4±0.91ms  1247.1 MB/sec    1.01    426.0±3.05ms  1230.7 MB/sec
string/zstd_parquet_2                              1.00    394.7±0.51ms  1328.4 MB/sec    1.00    394.8±0.47ms  1327.7 MB/sec
string_and_binary_view/bloom_filter                1.00     64.3±0.26ms   501.9 MB/sec    1.01     64.8±0.29ms   497.5 MB/sec
string_and_binary_view/cdc                         1.01     59.1±0.13ms   545.4 MB/sec    1.00     58.6±0.27ms   549.9 MB/sec
string_and_binary_view/default                     1.00     48.1±0.14ms   670.3 MB/sec    1.00     48.0±0.21ms   671.6 MB/sec
string_and_binary_view/parquet_2                   1.00     59.0±0.13ms   546.8 MB/sec    1.00     58.9±0.29ms   547.5 MB/sec
string_and_binary_view/zstd                        1.00     84.6±0.12ms   381.4 MB/sec    1.00     84.4±0.23ms   382.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     72.8±0.11ms   442.9 MB/sec    1.00     72.7±0.28ms   443.5 MB/sec
string_dictionary/bloom_filter                     1.03     92.9±0.49ms     2.8 GB/sec    1.00     89.7±0.77ms     2.9 GB/sec
string_dictionary/cdc                              1.00     54.3±0.68ms     4.7 GB/sec    1.59     86.2±0.71ms     3.0 GB/sec
string_dictionary/default                          1.02     49.7±0.84ms     5.2 GB/sec    1.00     48.8±0.36ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     53.7±0.25ms     4.8 GB/sec    1.01     54.3±0.24ms     4.8 GB/sec
string_dictionary/zstd                             1.00    210.7±0.58ms  1253.4 MB/sec    1.00    210.3±0.76ms  1255.7 MB/sec
string_dictionary/zstd_parquet_2                   1.00    199.9±0.86ms  1321.1 MB/sec    1.00    199.3±0.22ms  1325.1 MB/sec
string_non_null/bloom_filter                       1.00   256.4±11.03ms  2043.5 MB/sec    1.02   260.4±16.27ms  2012.5 MB/sec
string_non_null/cdc                                1.00    271.6±9.03ms  1929.1 MB/sec    1.00    271.0±9.60ms  1933.6 MB/sec
string_non_null/default                            1.04   135.7±11.22ms     3.8 GB/sec    1.00   130.3±13.67ms     3.9 GB/sec
string_non_null/parquet_2                          1.00    131.2±2.76ms     3.9 GB/sec    1.08   142.3±12.13ms     3.6 GB/sec
string_non_null/zstd                               1.05    557.8±6.54ms   939.3 MB/sec    1.00    533.7±1.53ms   981.9 MB/sec
string_non_null/zstd_parquet_2                     1.00    509.8±4.05ms  1027.8 MB/sec    1.00    508.1±2.26ms  1031.2 MB/sec
struct_all_null/bloom_filter                       1.00    378.0±1.04µs    41.7 GB/sec    6.69      2.5±0.00ms     6.2 GB/sec
struct_all_null/cdc                                1.00      7.9±0.16ms  2040.6 MB/sec    1.25      9.9±0.22ms  1630.4 MB/sec
struct_all_null/default                            1.00    118.7±0.38µs   132.7 GB/sec    18.96     2.3±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00    120.5±0.52µs   130.7 GB/sec    18.69     2.3±0.00ms     7.0 GB/sec
struct_all_null/zstd                               1.00    166.7±0.84µs    94.5 GB/sec    13.80     2.3±0.00ms     6.8 GB/sec
struct_all_null/zstd_parquet_2                     1.00    153.2±0.60µs   102.8 GB/sec    14.92     2.3±0.00ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.00     46.2±0.10ms   346.3 MB/sec    1.05     48.7±0.13ms   328.4 MB/sec
struct_non_null/cdc                                1.00     45.5±0.14ms   351.7 MB/sec    1.01     45.9±0.15ms   348.3 MB/sec
struct_non_null/default                            1.00     32.2±0.16ms   496.5 MB/sec    1.01     32.6±0.11ms   491.0 MB/sec
struct_non_null/parquet_2                          1.00     40.7±0.12ms   392.9 MB/sec    1.01     41.2±0.12ms   388.3 MB/sec
struct_non_null/zstd                               1.00     40.7±0.08ms   393.0 MB/sec    1.01     41.3±0.10ms   387.5 MB/sec
struct_non_null/zstd_parquet_2                     1.00     54.7±0.11ms   292.5 MB/sec    1.01     55.3±0.11ms   289.2 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      7.5±0.03ms     2.1 GB/sec    1.01      7.6±0.03ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.00     14.5±0.10ms  1109.5 MB/sec    1.07     15.5±0.11ms  1040.5 MB/sec
struct_sparse_99pct_null/default                   1.00      6.9±0.02ms     2.3 GB/sec    1.01      7.0±0.03ms     2.3 GB/sec
struct_sparse_99pct_null/parquet_2                 1.00      7.0±0.03ms     2.3 GB/sec    1.00      7.0±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.3±0.02ms  1941.8 MB/sec    1.00      8.3±0.02ms  1938.0 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.7±0.02ms     2.0 GB/sec    1.00      7.7±0.03ms     2.0 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1945.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1887.7s
CPU sys 57.1s
Peak spill 0 B

branch

Metric Value
Wall time 1920.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1890.6s
CPU sys 29.4s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              all_null_fast_path                     main
-----                                              ------------------                     ----
bool/bloom_filter                                  1.01     13.1±0.08ms    19.1 MB/sec    1.00     13.0±0.03ms    19.2 MB/sec
bool/cdc                                           1.03     16.0±0.11ms    15.6 MB/sec    1.00     15.6±0.04ms    16.1 MB/sec
bool/default                                       1.01     11.0±0.08ms    22.7 MB/sec    1.00     10.9±0.03ms    22.9 MB/sec
bool/parquet_2                                     1.00     14.8±0.09ms    16.9 MB/sec    1.00     14.7±0.05ms    17.0 MB/sec
bool/zstd                                          1.01     11.5±0.09ms    21.7 MB/sec    1.00     11.4±0.03ms    21.9 MB/sec
bool/zstd_parquet_2                                1.00     15.1±0.08ms    16.5 MB/sec    1.00     15.2±0.03ms    16.5 MB/sec
bool_non_null/bloom_filter                         1.00      7.1±0.03ms    17.7 MB/sec    1.00      7.1±0.03ms    17.7 MB/sec
bool_non_null/cdc                                  1.00      6.9±0.03ms    18.1 MB/sec    1.00      6.9±0.04ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    28.9 MB/sec    1.01      4.4±0.03ms    28.7 MB/sec
bool_non_null/parquet_2                            1.00      9.1±0.04ms    13.8 MB/sec    1.01      9.1±0.04ms    13.7 MB/sec
bool_non_null/zstd                                 1.00      4.7±0.02ms    26.7 MB/sec    1.00      4.7±0.03ms    26.6 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.04ms    13.2 MB/sec    1.01      9.5±0.03ms    13.1 MB/sec
float_with_nans/bloom_filter                       1.00     92.0±0.37ms   152.2 MB/sec    1.00     92.3±0.37ms   151.7 MB/sec
float_with_nans/cdc                                1.00     81.5±0.18ms   171.8 MB/sec    1.00     81.2±0.19ms   172.5 MB/sec
float_with_nans/default                            1.00     73.9±0.23ms   189.6 MB/sec    1.00     73.9±0.24ms   189.5 MB/sec
float_with_nans/parquet_2                          1.00     93.8±0.41ms   149.3 MB/sec    1.00     94.0±0.39ms   148.9 MB/sec
float_with_nans/zstd                               1.00    111.6±0.24ms   125.5 MB/sec    1.00    111.6±0.22ms   125.4 MB/sec
float_with_nans/zstd_parquet_2                     1.00    131.0±0.39ms   106.9 MB/sec    1.00    131.2±0.40ms   106.7 MB/sec
list_primitive/bloom_filter                        1.02    327.9±1.16ms  1663.3 MB/sec    1.00    322.8±0.38ms  1689.3 MB/sec
list_primitive/cdc                                 1.01    362.3±1.91ms  1505.4 MB/sec    1.00    357.0±1.94ms  1527.6 MB/sec
list_primitive/default                             1.02    249.3±1.68ms     2.1 GB/sec    1.00    245.5±0.34ms     2.2 GB/sec
list_primitive/parquet_2                           1.02    273.1±0.68ms  1996.7 MB/sec    1.00    267.4±0.43ms  2039.7 MB/sec
list_primitive/zstd                                1.01    499.2±2.69ms  1092.4 MB/sec    1.00    494.9±0.45ms  1102.0 MB/sec
list_primitive/zstd_parquet_2                      1.01    497.4±0.79ms  1096.5 MB/sec    1.00    490.1±0.38ms  1112.7 MB/sec
list_primitive_non_null/bloom_filter               1.03    429.3±4.23ms  1267.8 MB/sec    1.00    417.5±5.77ms  1303.6 MB/sec
list_primitive_non_null/cdc                        1.00    436.2±7.99ms  1247.7 MB/sec    1.01    438.7±7.81ms  1240.7 MB/sec
list_primitive_non_null/default                    1.01    290.4±4.14ms  1874.1 MB/sec    1.00    286.7±2.78ms  1898.6 MB/sec
list_primitive_non_null/parquet_2                  1.04    323.7±2.71ms  1681.1 MB/sec    1.00   310.3±13.54ms  1754.0 MB/sec
list_primitive_non_null/zstd                       1.00    705.6±7.92ms   771.4 MB/sec    1.01    712.6±4.39ms   763.8 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    686.5±1.70ms   792.8 MB/sec    1.00    684.5±0.46ms   795.1 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.02     11.4±0.02ms     3.2 GB/sec    1.00     11.1±0.22ms     3.3 GB/sec
list_primitive_sparse_99pct_null/cdc               1.01     22.7±0.07ms  1643.0 MB/sec    1.00     22.4±0.05ms  1667.1 MB/sec
list_primitive_sparse_99pct_null/default           1.03     11.1±0.07ms     3.3 GB/sec    1.00     10.8±0.02ms     3.4 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.03     11.1±0.02ms     3.3 GB/sec    1.00     10.8±0.02ms     3.4 GB/sec
list_primitive_sparse_99pct_null/zstd              1.02     12.9±0.02ms     2.8 GB/sec    1.00     12.6±0.02ms     2.9 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.03     11.2±0.03ms     3.3 GB/sec    1.00     10.9±0.03ms     3.3 GB/sec
primitive/bloom_filter                             1.01    150.8±0.54ms   297.7 MB/sec    1.00    149.2±0.42ms   300.9 MB/sec
primitive/cdc                                      1.00    159.1±0.65ms   282.0 MB/sec    1.00    158.6±0.52ms   282.9 MB/sec
primitive/default                                  1.01    118.5±0.50ms   378.7 MB/sec    1.00    117.2±0.18ms   382.8 MB/sec
primitive/parquet_2                                1.01    133.2±0.50ms   336.8 MB/sec    1.00    132.1±0.17ms   339.7 MB/sec
primitive/zstd                                     1.01    147.6±0.48ms   303.9 MB/sec    1.00    146.8±0.17ms   305.7 MB/sec
primitive/zstd_parquet_2                           1.00    166.4±0.57ms   269.7 MB/sec    1.00    165.6±0.41ms   270.9 MB/sec
primitive_all_null/bloom_filter                    1.00    893.9±2.41µs    49.0 GB/sec    12.93    11.6±0.24ms     3.8 GB/sec
primitive_all_null/cdc                             1.00     19.5±0.56ms     2.2 GB/sec    1.57     30.6±0.49ms  1465.1 MB/sec
primitive_all_null/default                         1.00    272.6±0.80µs   160.8 GB/sec    40.11    10.9±0.21ms     4.0 GB/sec
primitive_all_null/parquet_2                       1.00    277.8±1.24µs   157.8 GB/sec    39.26    10.9±0.15ms     4.0 GB/sec
primitive_all_null/zstd                            1.00    384.6±0.98µs   114.0 GB/sec    28.68    11.0±0.18ms     4.0 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    354.0±1.24µs   123.8 GB/sec    31.10    11.0±0.18ms     4.0 GB/sec
primitive_non_null/bloom_filter                    1.00    106.4±0.30ms   413.5 MB/sec    1.06    112.8±1.33ms   390.2 MB/sec
primitive_non_null/cdc                             1.00     89.8±0.25ms   490.0 MB/sec    1.00     89.9±0.53ms   489.2 MB/sec
primitive_non_null/default                         1.00     67.1±0.17ms   655.4 MB/sec    1.00     67.4±0.14ms   652.7 MB/sec
primitive_non_null/parquet_2                       1.00     88.9±0.23ms   495.1 MB/sec    1.00     89.2±0.21ms   493.2 MB/sec
primitive_non_null/zstd                            1.00     98.2±0.36ms   448.3 MB/sec    1.07    104.8±0.19ms   420.0 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    122.7±0.35ms   358.7 MB/sec    1.06    129.8±1.74ms   338.9 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     18.0±0.07ms     2.4 GB/sec    1.00     18.0±0.08ms     2.4 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     36.0±0.37ms  1247.9 MB/sec    1.03     36.9±0.34ms  1216.2 MB/sec
primitive_sparse_99pct_null/default                1.00     16.7±0.05ms     2.6 GB/sec    1.00     16.7±0.03ms     2.6 GB/sec
primitive_sparse_99pct_null/parquet_2              1.00     16.7±0.04ms     2.6 GB/sec    1.00     16.7±0.04ms     2.6 GB/sec
primitive_sparse_99pct_null/zstd                   1.00     20.0±0.05ms     2.2 GB/sec    1.00     20.0±0.04ms     2.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.00     18.6±0.05ms     2.4 GB/sec    1.00     18.6±0.05ms     2.4 GB/sec
string/bloom_filter                                1.00    201.4±5.93ms     2.5 GB/sec    1.16   234.4±25.88ms     2.2 GB/sec
string/cdc                                         1.00    219.3±3.10ms     2.3 GB/sec    1.02    222.7±5.87ms     2.3 GB/sec
string/default                                     1.00    116.4±4.74ms     4.4 GB/sec    1.25   145.2±25.94ms     3.5 GB/sec
string/parquet_2                                   1.00    110.1±4.86ms     4.6 GB/sec    1.16    127.9±0.21ms     4.0 GB/sec
string/zstd                                        1.00    416.0±0.62ms  1260.1 MB/sec    1.02    425.8±2.49ms  1231.4 MB/sec
string/zstd_parquet_2                              1.00    393.9±0.54ms  1330.9 MB/sec    1.01    396.2±0.29ms  1323.0 MB/sec
string_and_binary_view/bloom_filter                1.00     64.3±0.30ms   501.7 MB/sec    1.00     64.2±0.23ms   502.3 MB/sec
string_and_binary_view/cdc                         1.01     59.4±0.26ms   542.9 MB/sec    1.00     58.9±0.11ms   547.4 MB/sec
string_and_binary_view/default                     1.00     48.1±0.28ms   670.3 MB/sec    1.00     48.2±0.09ms   668.6 MB/sec
string_and_binary_view/parquet_2                   1.00     59.1±0.26ms   545.6 MB/sec    1.00     59.2±0.11ms   545.2 MB/sec
string_and_binary_view/zstd                        1.00     85.0±0.28ms   379.6 MB/sec    1.00     84.8±0.12ms   380.5 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     73.1±0.29ms   441.3 MB/sec    1.01     73.5±0.10ms   438.7 MB/sec
string_dictionary/bloom_filter                     1.01     93.1±0.45ms     2.8 GB/sec    1.00     91.8±0.70ms     2.8 GB/sec
string_dictionary/cdc                              1.00     54.0±0.40ms     4.8 GB/sec    1.59     86.0±0.62ms     3.0 GB/sec
string_dictionary/default                          1.01     49.5±0.20ms     5.2 GB/sec    1.00     49.0±0.31ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     54.0±0.34ms     4.8 GB/sec    1.00     54.1±0.09ms     4.8 GB/sec
string_dictionary/zstd                             1.00    209.4±0.45ms  1261.4 MB/sec    1.00    209.5±0.57ms  1260.7 MB/sec
string_dictionary/zstd_parquet_2                   1.00    199.4±0.37ms  1324.4 MB/sec    1.00    198.8±0.09ms  1328.8 MB/sec
string_non_null/bloom_filter                       1.00   258.5±10.24ms  2027.0 MB/sec    1.01   260.8±15.68ms  2009.1 MB/sec
string_non_null/cdc                                1.01    271.5±9.42ms  1930.0 MB/sec    1.00    269.3±9.52ms  1945.9 MB/sec
string_non_null/default                            1.09   140.3±10.40ms     3.6 GB/sec    1.00   129.0±13.04ms     4.0 GB/sec
string_non_null/parquet_2                          1.08    153.8±0.92ms     3.3 GB/sec    1.00   142.7±12.07ms     3.6 GB/sec
string_non_null/zstd                               1.05    554.7±4.83ms   944.6 MB/sec    1.00    529.3±1.71ms   989.9 MB/sec
string_non_null/zstd_parquet_2                     1.02    515.9±3.44ms  1015.7 MB/sec    1.00    505.9±2.31ms  1035.8 MB/sec
struct_all_null/bloom_filter                       1.00    376.6±2.15µs    41.8 GB/sec    6.70      2.5±0.00ms     6.2 GB/sec
struct_all_null/cdc                                1.00      7.9±0.18ms  2032.5 MB/sec    1.24      9.8±0.11ms  1638.4 MB/sec
struct_all_null/default                            1.00    118.3±0.26µs   133.1 GB/sec    19.03     2.3±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00    120.4±0.49µs   130.8 GB/sec    18.69     2.3±0.00ms     7.0 GB/sec
struct_all_null/zstd                               1.00    165.7±0.42µs    95.0 GB/sec    13.88     2.3±0.00ms     6.8 GB/sec
struct_all_null/zstd_parquet_2                     1.00    152.4±0.56µs   103.3 GB/sec    14.99     2.3±0.00ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.00     45.9±0.12ms   348.8 MB/sec    1.01     46.3±0.14ms   345.3 MB/sec
struct_non_null/cdc                                1.00     45.3±0.16ms   353.2 MB/sec    1.00     45.5±0.20ms   351.6 MB/sec
struct_non_null/default                            1.00     32.0±0.17ms   500.8 MB/sec    1.00     31.8±0.11ms   502.8 MB/sec
struct_non_null/parquet_2                          1.00     40.4±0.11ms   396.4 MB/sec    1.01     40.7±0.49ms   393.1 MB/sec
struct_non_null/zstd                               1.00     40.5±0.09ms   395.2 MB/sec    1.00     40.6±0.09ms   393.9 MB/sec
struct_non_null/zstd_parquet_2                     1.00     54.4±0.12ms   294.0 MB/sec    1.00     54.7±0.12ms   292.7 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      7.4±0.02ms     2.1 GB/sec    1.00      7.4±0.02ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.00     14.4±0.09ms  1118.9 MB/sec    1.07     15.4±0.09ms  1049.4 MB/sec
struct_sparse_99pct_null/default                   1.00      6.9±0.02ms     2.3 GB/sec    1.00      6.9±0.01ms     2.3 GB/sec
struct_sparse_99pct_null/parquet_2                 1.00      6.9±0.01ms     2.3 GB/sec    1.00      6.9±0.01ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.3±0.02ms  1953.3 MB/sec    1.00      8.3±0.01ms  1953.3 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.7±0.02ms     2.1 GB/sec    1.00      7.7±0.01ms     2.1 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1945.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1882.1s
CPU sys 57.7s
Peak spill 0 B

branch

Metric Value
Wall time 1930.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1878.3s
CPU sys 49.6s
Peak spill 0 B

File an issue against this benchmark runner

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HippoBaro, this looks nice. The few regressions look to be noise.

Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
When an entire list, struct, fixed-size list, or leaf array is null,
skip per-row iteration and emit bulk uniform def/rep levels via
`extend_uniform_levels` in O(1).

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@HippoBaro HippoBaro force-pushed the all_null_fast_path branch from e1d948f to 3039241 Compare May 13, 2026 03:27
@HippoBaro

Copy link
Copy Markdown
Contributor Author

Thank you @alamb @etseidl 🙇 The branch is updated with your feedback.

RyanJamesStewart pushed a commit to RyanJamesStewart/arrow-rs that referenced this pull request May 13, 2026
…umns

When writing a nullable leaf (primitive) Arrow array, `write_leaf` built the
definition-level buffer one element at a time, mapping each null bit to a
level. For columns that are mostly null this does ~num_rows of branchy work
and allocates a num_rows level buffer even though almost every level is the
same value.

Add a length-gated bulk-fill path: when the column is majority-null and the
sub-range is large enough to amortize the gate's per-call cost, build the
definition levels by bulk-filling the null level (a vectorized memset) and
overwriting only the non-null positions found via `NullBuffer::valid_indices()`.
The per-row path is kept for non-majority-null arrays and for the small
sub-ranges produced by list/struct write paths, so those shapes are not
regressed.

Contributes to apache#9731. Complements apache#9954's all-null fast path by covering the
sparse (mostly-but-not-entirely-null) case it does not handle.

Threshold sweep on Ryzen 9 9950X (parquet/arrow_writer benches, /default
variant, vs main):

  T   primitive  list_primitive  primitive_sparse  list_primitive_sparse
  ----------------------------------------------------------------------
  0    -3.0%      +2.6%           -36.1%            +7.8%
  16   -1.4%      +1.8%           -34.8%            +2.8%
  32   -1.1%      -0.1%           -35.1%            +1.7%
  64   -1.1%      +0.7%           -34.5%            +1.7%   <- chosen
  128  -1.0%      +1.5%           -35.1%            +2.4%
  256  -1.4%      +1.4%           -35.1%            +2.7%

T=0 reproduces the per-call slice/popcount regression on
list_primitive_sparse_99pct_null (+7.8%, matches the criterion bot's
original measurement on the unguarded version). The +1.7% floor at T>=32
is the structural cost of evaluating the gate itself across ~10K small
write_leaf calls in the list path; reducing it further would require
hoisting the decision into the caller. T=64 matches T=32 on every shape
and gives ~12x margin over the avg list length of ~5.

Final benchmarks vs main on Ryzen 9 9950X (T=64, /default variants):

  primitive/default                            -1.5%
  primitive_non_null/default                   -2.8%
  primitive_sparse_99pct_null/default         -35.1%
  primitive_all_null/default                  -66.4%
  list_primitive/default                       +1.8%  (within noise)
  list_primitive_non_null/default              -0.7%  (no change, p=0.45)
  list_primitive_sparse_99pct_null/default     +3.0%  (gate-check floor)
  struct_sparse_99pct_null/default             -4.9%
  bool/default                                 +2.2%

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @HippoBaro and @etseidl

@alamb alamb merged commit 2108f20 into apache:main May 14, 2026
16 checks passed
alamb pushed a commit that referenced this pull request May 20, 2026
## Which issue does this PR close?

- Contributes to #9731.

## AI assistance

Implementation drafted with AI assistance and iterated against the
benchmarks below. I've reviewed and own the code, including the gate
threshold which I picked from the sweep in [Threshold
(`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's
[CONTRIBUTING guidance on AI-generated
submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions).

## Rationale for this change

When writing a nullable leaf (primitive) Arrow array, `write_leaf`
builds the definition-level buffer one element at a time, mapping each
null bit to a level. For columns that are mostly null this does
~`num_rows` of branchy work and allocates a `num_rows`-element level
buffer even though almost every produced level is the same value. #9954
adds an O(1) fast path for the *entirely* null case; this PR covers the
*sparse* (mostly-but-not-entirely null) case it doesn't handle, the
literal subject of #9731 ("a column that is 99% null … ~100x more work
than necessary").

## What changes are included in this PR?

A single popcount pass over the null mask
(`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid
values in the range. When the slice is majority-null, the
definition-level buffer is bulk-filled with the null level (a vectorized
`Vec::resize` memset) and only the non-null positions (from
`NullBuffer::valid_indices()`) are overwritten. The existing per-row
path is kept for non-majority-null slices, so balanced and null-light
columns are unaffected. Both branches share the same `let range_nulls =
nulls.slice(range.start, len)` slicing idiom; the slow path uses
`range_nulls.iter()` for the def-level map and
`range_nulls.valid_indices().map(|i| i + range.start)` for
`non_null_indices`, with no `unsafe`. Output is byte-identical: the
level *values* are unchanged, just produced via memset+scatter (fast
path) or via the high-level `NullBuffer` iterators (slow path) instead
of a manual `BitIndexIterator` walk.

## Threshold (`BULK_FILL_MIN_LEN`)

The bulk-fill fast path is gated on two conditions:

- `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call
slice/popcount/iterator overhead only amortizes on sizable sub-ranges.
List/struct paths call `write_leaf` many times with tiny ranges (avg
list length 1-5); paying any per-call popcount there would regress them.
A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X
shows the regression floor settles by T=32, and the choice of 64 gives
~12x margin over the average list length without losing the
flat-primitive wins.
- `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is
O(1), so this check is free. We use the buffer-wide density as a
heuristic for the sub-range; for full-array writes (the primary target,
flat primitive columns) it's exact.

Even when the gate skips the fast path, evaluating it across
high-frequency call sites (~10K calls in some list benchmarks) is a
small structural cost (~1-2% on list-sparse cases). The wins on the
targeted shapes (-35% sparse-primitive, -66% all-null primitive) far
outweigh that. Reducing the cost further would require hoisting the
decision into the caller.

## Are these changes tested?

Existing tests cover this path: `cargo test -p parquet --features arrow
--lib arrow_writer` is green (136 tests, full of nulls and roundtrips);
full `cargo test -p parquet --features arrow` green modulo the
pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on
`main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt
--check` clean. The `unsafe get_unchecked_mut` flagged in the original
revision was replaced via `NullBuffer::valid_indices()`; the slow-path
also dropped its `unsafe value_unchecked` for the same reason.

## Are there any user-facing changes?

None.

## Benchmarks

`cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
primitive columns, local Ryzen 9 9950X:

```
primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   <- the case #9731 calls out
primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   (subsumed by #9954's O(1) path if that lands first)
struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string:  within noise (no regression)
```

The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the
post-fixup revision shows the same shape with stronger relative wins on
the targeted cases:

```
primitive_all_null/default              2.47x (11.0ms -> 4.4ms)
primitive_sparse_99pct_null/default     1.60x (16.8ms -> 10.5ms)
primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}    1.38x to 2.48x
primitive_sparse_99pct_null/{...}        1.28x to 1.59x
list_primitive*, list_primitive_sparse_99pct_null*:                    1.00x to 1.01x (within noise)
```

Microbench of the definition-level fill in isolation: 10.3x @ 100%-null,
8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈
12-15% null, clean win above ~25%; the `>= 50% null` guard is
conservative.

This is the *materialization*-cost half of #9731 (~30% of the 99%-null
write); the *walk*-cost half, a run-length input to the level encoder so
the column writer doesn't even iterate all `num_rows` levels, is the
larger structural change #9653 is heading toward. This PR is
deliberately small and isolated so it lands independently of and rebases
cleanly under that work.

---------

Co-authored-by: Ryan Stewart <noreply@example.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See apache#9731

# What changes are included in this PR?

When an entire list, struct, fixed-size list, or leaf array is null,
skip per-row iteration and emit bulk uniform def/rep levels via
`extend_uniform_levels` in O(1).

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

All tests passing + additional all null unit tests.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
## Which issue does this PR close?

- Contributes to apache#9731.

## AI assistance

Implementation drafted with AI assistance and iterated against the
benchmarks below. I've reviewed and own the code, including the gate
threshold which I picked from the sweep in [Threshold
(`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's
[CONTRIBUTING guidance on AI-generated
submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions).

## Rationale for this change

When writing a nullable leaf (primitive) Arrow array, `write_leaf`
builds the definition-level buffer one element at a time, mapping each
null bit to a level. For columns that are mostly null this does
~`num_rows` of branchy work and allocates a `num_rows`-element level
buffer even though almost every produced level is the same value. apache#9954
adds an O(1) fast path for the *entirely* null case; this PR covers the
*sparse* (mostly-but-not-entirely null) case it doesn't handle, the
literal subject of apache#9731 ("a column that is 99% null … ~100x more work
than necessary").

## What changes are included in this PR?

A single popcount pass over the null mask
(`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid
values in the range. When the slice is majority-null, the
definition-level buffer is bulk-filled with the null level (a vectorized
`Vec::resize` memset) and only the non-null positions (from
`NullBuffer::valid_indices()`) are overwritten. The existing per-row
path is kept for non-majority-null slices, so balanced and null-light
columns are unaffected. Both branches share the same `let range_nulls =
nulls.slice(range.start, len)` slicing idiom; the slow path uses
`range_nulls.iter()` for the def-level map and
`range_nulls.valid_indices().map(|i| i + range.start)` for
`non_null_indices`, with no `unsafe`. Output is byte-identical: the
level *values* are unchanged, just produced via memset+scatter (fast
path) or via the high-level `NullBuffer` iterators (slow path) instead
of a manual `BitIndexIterator` walk.

## Threshold (`BULK_FILL_MIN_LEN`)

The bulk-fill fast path is gated on two conditions:

- `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call
slice/popcount/iterator overhead only amortizes on sizable sub-ranges.
List/struct paths call `write_leaf` many times with tiny ranges (avg
list length 1-5); paying any per-call popcount there would regress them.
A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X
shows the regression floor settles by T=32, and the choice of 64 gives
~12x margin over the average list length without losing the
flat-primitive wins.
- `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is
O(1), so this check is free. We use the buffer-wide density as a
heuristic for the sub-range; for full-array writes (the primary target,
flat primitive columns) it's exact.

Even when the gate skips the fast path, evaluating it across
high-frequency call sites (~10K calls in some list benchmarks) is a
small structural cost (~1-2% on list-sparse cases). The wins on the
targeted shapes (-35% sparse-primitive, -66% all-null primitive) far
outweigh that. Reducing the cost further would require hoisting the
decision into the caller.

## Are these changes tested?

Existing tests cover this path: `cargo test -p parquet --features arrow
--lib arrow_writer` is green (136 tests, full of nulls and roundtrips);
full `cargo test -p parquet --features arrow` green modulo the
pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on
`main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt
--check` clean. The `unsafe get_unchecked_mut` flagged in the original
revision was replaced via `NullBuffer::valid_indices()`; the slow-path
also dropped its `unsafe value_unchecked` for the same reason.

## Are there any user-facing changes?

None.

## Benchmarks

`cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
primitive columns, local Ryzen 9 9950X:

```
primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   <- the case apache#9731 calls out
primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   (subsumed by apache#9954's O(1) path if that lands first)
struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string:  within noise (no regression)
```

The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the
post-fixup revision shows the same shape with stronger relative wins on
the targeted cases:

```
primitive_all_null/default              2.47x (11.0ms -> 4.4ms)
primitive_sparse_99pct_null/default     1.60x (16.8ms -> 10.5ms)
primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}    1.38x to 2.48x
primitive_sparse_99pct_null/{...}        1.28x to 1.59x
list_primitive*, list_primitive_sparse_99pct_null*:                    1.00x to 1.01x (within noise)
```

Microbench of the definition-level fill in isolation: 10.3x @ 100%-null,
8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈
12-15% null, clean win above ~25%; the `>= 50% null` guard is
conservative.

This is the *materialization*-cost half of apache#9731 (~30% of the 99%-null
write); the *walk*-cost half, a run-length input to the level encoder so
the column writer doesn't even iterate all `num_rows` levels, is the
larger structural change apache#9653 is heading toward. This PR is
deliberately small and isolated so it lands independently of and rebases
cleanly under that work.

---------

Co-authored-by: Ryan Stewart <noreply@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants