feat(parquet): add all-null fast paths for level building#9954

HippoBaro · 2026-05-10T02:39:01Z

Which issue does this PR close?

Spawn off from feat(parquet): fuse level encoding passes and compact level representation #9653
Contributes to Column performance: run-proportional read/write cost #9731

Rationale for this change

What changes are included in this PR?

When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via extend_uniform_levels in O(1).

Are these changes tested?

All tests passing + additional all null unit tests.

Are there any user-facing changes?

None.

HippoBaro · 2026-05-10T03:37:15Z

@alamb @etseidl This one is short and sweet and makes the all-null write case an O(1) operation.

alamb

Code looks good to me -- thank you @HippoBaro

I think we need a few more tests and I will launch some benchmarks

alamb · 2026-05-12T19:55:51Z

run benchmarks arrow_writer

alamb · 2026-05-12T19:55:54Z

run benchmarks arrow_writer

adriangbot · 2026-05-12T19:59:22Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4434256987-20-cl8cw 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing all_null_fast_path (e1d948f) to 7abb225 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-12T19:59:32Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4434257436-21-zwvn6 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing all_null_fast_path (e1d948f) to 7abb225 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-12T21:04:45Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                              all_null_fast_path                     main
-----                                              ------------------                     ----
bool/bloom_filter                                  1.00     13.0±0.04ms    19.2 MB/sec    1.01     13.1±0.11ms    19.1 MB/sec
bool/cdc                                           1.01     15.9±0.06ms    15.7 MB/sec    1.00     15.7±0.09ms    15.9 MB/sec
bool/default                                       1.00     10.9±0.03ms    22.9 MB/sec    1.01     11.0±0.11ms    22.7 MB/sec
bool/parquet_2                                     1.00     14.7±0.05ms    17.0 MB/sec    1.01     14.8±0.12ms    16.9 MB/sec
bool/zstd                                          1.00     11.4±0.04ms    21.9 MB/sec    1.01     11.5±0.10ms    21.7 MB/sec
bool/zstd_parquet_2                                1.00     15.1±0.04ms    16.6 MB/sec    1.01     15.2±0.12ms    16.5 MB/sec
bool_non_null/bloom_filter                         1.00      7.1±0.03ms    17.7 MB/sec    1.00      7.1±0.02ms    17.6 MB/sec
bool_non_null/cdc                                  1.00      6.9±0.03ms    18.1 MB/sec    1.00      6.9±0.03ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    28.9 MB/sec    1.00      4.3±0.02ms    28.9 MB/sec
bool_non_null/parquet_2                            1.00      9.1±0.04ms    13.8 MB/sec    1.00      9.1±0.03ms    13.7 MB/sec
bool_non_null/zstd                                 1.00      4.7±0.02ms    26.7 MB/sec    1.00      4.7±0.02ms    26.7 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.04ms    13.2 MB/sec    1.00      9.5±0.04ms    13.1 MB/sec
float_with_nans/bloom_filter                       1.00     93.6±0.36ms   149.5 MB/sec    1.00     93.9±0.41ms   149.1 MB/sec
float_with_nans/cdc                                1.00     82.2±0.30ms   170.3 MB/sec    1.00     82.1±0.45ms   170.6 MB/sec
float_with_nans/default                            1.00     74.5±0.25ms   188.0 MB/sec    1.01     75.0±1.17ms   186.6 MB/sec
float_with_nans/parquet_2                          1.00     95.2±0.37ms   147.1 MB/sec    1.00     95.0±0.37ms   147.3 MB/sec
float_with_nans/zstd                               1.00    112.2±0.23ms   124.8 MB/sec    1.00    112.4±0.22ms   124.5 MB/sec
float_with_nans/zstd_parquet_2                     1.00    132.6±0.64ms   105.6 MB/sec    1.00    132.3±0.36ms   105.8 MB/sec
list_primitive/bloom_filter                        1.02    337.8±0.79ms  1614.6 MB/sec    1.00    330.1±1.76ms  1652.1 MB/sec
list_primitive/cdc                                 1.02    365.9±0.98ms  1490.5 MB/sec    1.00    359.9±0.87ms  1515.5 MB/sec
list_primitive/default                             1.02    255.8±0.65ms     2.1 GB/sec    1.00    249.9±1.22ms     2.1 GB/sec
list_primitive/parquet_2                           1.02    274.8±0.60ms  1984.7 MB/sec    1.00    268.3±0.48ms  2032.5 MB/sec
list_primitive/zstd                                1.01    506.8±0.74ms  1076.2 MB/sec    1.00    500.4±0.86ms  1089.9 MB/sec
list_primitive/zstd_parquet_2                      1.01    498.0±0.52ms  1095.0 MB/sec    1.00    492.4±0.48ms  1107.5 MB/sec
list_primitive_non_null/bloom_filter               1.00    405.2±3.38ms  1343.1 MB/sec    1.07    435.6±4.14ms  1249.5 MB/sec
list_primitive_non_null/cdc                        1.00    442.4±7.06ms  1230.2 MB/sec    1.00    442.2±8.87ms  1230.6 MB/sec
list_primitive_non_null/default                    1.00    267.9±3.17ms  2031.2 MB/sec    1.10    294.7±3.08ms  1846.8 MB/sec
list_primitive_non_null/parquet_2                  1.00    292.3±3.82ms  1862.1 MB/sec    1.07   312.0±13.71ms  1744.5 MB/sec
list_primitive_non_null/zstd                       1.00    684.7±3.43ms   794.9 MB/sec    1.05    717.1±7.09ms   759.0 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    670.7±0.33ms   811.5 MB/sec    1.00    671.3±0.48ms   810.8 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.03     11.6±0.05ms     3.2 GB/sec    1.00     11.2±0.06ms     3.2 GB/sec
list_primitive_sparse_99pct_null/cdc               1.02     23.1±0.08ms  1616.0 MB/sec    1.00     22.7±0.07ms  1647.4 MB/sec
list_primitive_sparse_99pct_null/default           1.03     11.2±0.05ms     3.3 GB/sec    1.00     10.9±0.05ms     3.3 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.03     11.2±0.06ms     3.2 GB/sec    1.00     10.9±0.05ms     3.3 GB/sec
list_primitive_sparse_99pct_null/zstd              1.03     13.1±0.07ms     2.8 GB/sec    1.00     12.7±0.05ms     2.9 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.02     11.4±0.06ms     3.2 GB/sec    1.00     11.1±0.04ms     3.3 GB/sec
primitive/bloom_filter                             1.01    151.0±0.53ms   297.2 MB/sec    1.00    150.2±0.65ms   298.8 MB/sec
primitive/cdc                                      1.00    159.2±0.44ms   281.9 MB/sec    1.00    159.9±0.90ms   280.6 MB/sec
primitive/default                                  1.01    119.0±0.59ms   377.1 MB/sec    1.00    118.4±0.59ms   379.1 MB/sec
primitive/parquet_2                                1.00    133.1±0.26ms   337.2 MB/sec    1.00    133.3±0.57ms   336.6 MB/sec
primitive/zstd                                     1.00    147.7±0.20ms   303.9 MB/sec    1.00    147.8±0.57ms   303.7 MB/sec
primitive/zstd_parquet_2                           1.00    166.4±0.30ms   269.6 MB/sec    1.00    166.3±0.66ms   269.9 MB/sec
primitive_all_null/bloom_filter                    1.00    905.0±1.81µs    48.4 GB/sec    12.79    11.6±0.19ms     3.8 GB/sec
primitive_all_null/cdc                             1.00     19.4±0.54ms     2.3 GB/sec    1.58     30.7±0.41ms  1459.8 MB/sec
primitive_all_null/default                         1.00    273.0±0.84µs   160.6 GB/sec    39.95    10.9±0.15ms     4.0 GB/sec
primitive_all_null/parquet_2                       1.00    280.0±1.70µs   156.5 GB/sec    39.04    10.9±0.19ms     4.0 GB/sec
primitive_all_null/zstd                            1.00    387.7±1.12µs   113.0 GB/sec    28.45    11.0±0.15ms     4.0 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    355.4±1.85µs   123.3 GB/sec    31.12    11.1±0.21ms     4.0 GB/sec
primitive_non_null/bloom_filter                    1.00    107.9±0.38ms   407.7 MB/sec    1.05    113.4±1.29ms   387.9 MB/sec
primitive_non_null/cdc                             1.00     90.4±0.26ms   486.8 MB/sec    1.00     90.2±0.55ms   487.8 MB/sec
primitive_non_null/default                         1.00     67.8±0.20ms   649.1 MB/sec    1.00     67.6±0.21ms   650.8 MB/sec
primitive_non_null/parquet_2                       1.00     89.6±0.26ms   490.9 MB/sec    1.00     89.5±0.25ms   491.7 MB/sec
primitive_non_null/zstd                            1.00     98.6±0.19ms   446.1 MB/sec    1.06    105.0±0.22ms   418.9 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    123.4±0.19ms   356.5 MB/sec    1.05    130.1±1.76ms   338.3 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     18.6±0.14ms     2.4 GB/sec    1.00     18.6±0.12ms     2.4 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     36.9±0.41ms  1217.4 MB/sec    1.02     37.5±0.34ms  1196.9 MB/sec
primitive_sparse_99pct_null/default                1.00     16.9±0.06ms     2.6 GB/sec    1.00     16.9±0.05ms     2.6 GB/sec
primitive_sparse_99pct_null/parquet_2              1.00     17.0±0.07ms     2.6 GB/sec    1.00     16.9±0.09ms     2.6 GB/sec
primitive_sparse_99pct_null/zstd                   1.00     20.3±0.07ms     2.2 GB/sec    1.00     20.2±0.06ms     2.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.01     19.0±0.15ms     2.3 GB/sec    1.00     18.8±0.06ms     2.3 GB/sec
string/bloom_filter                                1.00    205.5±6.07ms     2.5 GB/sec    1.12   230.0±26.26ms     2.2 GB/sec
string/cdc                                         1.00    220.4±3.05ms     2.3 GB/sec    1.00    221.3±6.04ms     2.3 GB/sec
string/default                                     1.00    118.6±4.89ms     4.3 GB/sec    1.23   145.4±26.31ms     3.5 GB/sec
string/parquet_2                                   1.00    111.0±4.77ms     4.6 GB/sec    1.14    126.0±0.72ms     4.1 GB/sec
string/zstd                                        1.00    420.4±0.91ms  1247.1 MB/sec    1.01    426.0±3.05ms  1230.7 MB/sec
string/zstd_parquet_2                              1.00    394.7±0.51ms  1328.4 MB/sec    1.00    394.8±0.47ms  1327.7 MB/sec
string_and_binary_view/bloom_filter                1.00     64.3±0.26ms   501.9 MB/sec    1.01     64.8±0.29ms   497.5 MB/sec
string_and_binary_view/cdc                         1.01     59.1±0.13ms   545.4 MB/sec    1.00     58.6±0.27ms   549.9 MB/sec
string_and_binary_view/default                     1.00     48.1±0.14ms   670.3 MB/sec    1.00     48.0±0.21ms   671.6 MB/sec
string_and_binary_view/parquet_2                   1.00     59.0±0.13ms   546.8 MB/sec    1.00     58.9±0.29ms   547.5 MB/sec
string_and_binary_view/zstd                        1.00     84.6±0.12ms   381.4 MB/sec    1.00     84.4±0.23ms   382.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     72.8±0.11ms   442.9 MB/sec    1.00     72.7±0.28ms   443.5 MB/sec
string_dictionary/bloom_filter                     1.03     92.9±0.49ms     2.8 GB/sec    1.00     89.7±0.77ms     2.9 GB/sec
string_dictionary/cdc                              1.00     54.3±0.68ms     4.7 GB/sec    1.59     86.2±0.71ms     3.0 GB/sec
string_dictionary/default                          1.02     49.7±0.84ms     5.2 GB/sec    1.00     48.8±0.36ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     53.7±0.25ms     4.8 GB/sec    1.01     54.3±0.24ms     4.8 GB/sec
string_dictionary/zstd                             1.00    210.7±0.58ms  1253.4 MB/sec    1.00    210.3±0.76ms  1255.7 MB/sec
string_dictionary/zstd_parquet_2                   1.00    199.9±0.86ms  1321.1 MB/sec    1.00    199.3±0.22ms  1325.1 MB/sec
string_non_null/bloom_filter                       1.00   256.4±11.03ms  2043.5 MB/sec    1.02   260.4±16.27ms  2012.5 MB/sec
string_non_null/cdc                                1.00    271.6±9.03ms  1929.1 MB/sec    1.00    271.0±9.60ms  1933.6 MB/sec
string_non_null/default                            1.04   135.7±11.22ms     3.8 GB/sec    1.00   130.3±13.67ms     3.9 GB/sec
string_non_null/parquet_2                          1.00    131.2±2.76ms     3.9 GB/sec    1.08   142.3±12.13ms     3.6 GB/sec
string_non_null/zstd                               1.05    557.8±6.54ms   939.3 MB/sec    1.00    533.7±1.53ms   981.9 MB/sec
string_non_null/zstd_parquet_2                     1.00    509.8±4.05ms  1027.8 MB/sec    1.00    508.1±2.26ms  1031.2 MB/sec
struct_all_null/bloom_filter                       1.00    378.0±1.04µs    41.7 GB/sec    6.69      2.5±0.00ms     6.2 GB/sec
struct_all_null/cdc                                1.00      7.9±0.16ms  2040.6 MB/sec    1.25      9.9±0.22ms  1630.4 MB/sec
struct_all_null/default                            1.00    118.7±0.38µs   132.7 GB/sec    18.96     2.3±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00    120.5±0.52µs   130.7 GB/sec    18.69     2.3±0.00ms     7.0 GB/sec
struct_all_null/zstd                               1.00    166.7±0.84µs    94.5 GB/sec    13.80     2.3±0.00ms     6.8 GB/sec
struct_all_null/zstd_parquet_2                     1.00    153.2±0.60µs   102.8 GB/sec    14.92     2.3±0.00ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.00     46.2±0.10ms   346.3 MB/sec    1.05     48.7±0.13ms   328.4 MB/sec
struct_non_null/cdc                                1.00     45.5±0.14ms   351.7 MB/sec    1.01     45.9±0.15ms   348.3 MB/sec
struct_non_null/default                            1.00     32.2±0.16ms   496.5 MB/sec    1.01     32.6±0.11ms   491.0 MB/sec
struct_non_null/parquet_2                          1.00     40.7±0.12ms   392.9 MB/sec    1.01     41.2±0.12ms   388.3 MB/sec
struct_non_null/zstd                               1.00     40.7±0.08ms   393.0 MB/sec    1.01     41.3±0.10ms   387.5 MB/sec
struct_non_null/zstd_parquet_2                     1.00     54.7±0.11ms   292.5 MB/sec    1.01     55.3±0.11ms   289.2 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      7.5±0.03ms     2.1 GB/sec    1.01      7.6±0.03ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.00     14.5±0.10ms  1109.5 MB/sec    1.07     15.5±0.11ms  1040.5 MB/sec
struct_sparse_99pct_null/default                   1.00      6.9±0.02ms     2.3 GB/sec    1.01      7.0±0.03ms     2.3 GB/sec
struct_sparse_99pct_null/parquet_2                 1.00      7.0±0.03ms     2.3 GB/sec    1.00      7.0±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.3±0.02ms  1941.8 MB/sec    1.00      8.3±0.02ms  1938.0 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.7±0.02ms     2.0 GB/sec    1.00      7.7±0.03ms     2.0 GB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	1945.4s
Peak memory	6.6 GiB
Avg memory	6.4 GiB
CPU user	1887.7s
CPU sys	57.1s
Peak spill	0 B

branch

Metric	Value
Wall time	1920.4s
Peak memory	6.6 GiB
Avg memory	6.4 GiB
CPU user	1890.6s
CPU sys	29.4s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-05-12T21:04:52Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                              all_null_fast_path                     main
-----                                              ------------------                     ----
bool/bloom_filter                                  1.01     13.1±0.08ms    19.1 MB/sec    1.00     13.0±0.03ms    19.2 MB/sec
bool/cdc                                           1.03     16.0±0.11ms    15.6 MB/sec    1.00     15.6±0.04ms    16.1 MB/sec
bool/default                                       1.01     11.0±0.08ms    22.7 MB/sec    1.00     10.9±0.03ms    22.9 MB/sec
bool/parquet_2                                     1.00     14.8±0.09ms    16.9 MB/sec    1.00     14.7±0.05ms    17.0 MB/sec
bool/zstd                                          1.01     11.5±0.09ms    21.7 MB/sec    1.00     11.4±0.03ms    21.9 MB/sec
bool/zstd_parquet_2                                1.00     15.1±0.08ms    16.5 MB/sec    1.00     15.2±0.03ms    16.5 MB/sec
bool_non_null/bloom_filter                         1.00      7.1±0.03ms    17.7 MB/sec    1.00      7.1±0.03ms    17.7 MB/sec
bool_non_null/cdc                                  1.00      6.9±0.03ms    18.1 MB/sec    1.00      6.9±0.04ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    28.9 MB/sec    1.01      4.4±0.03ms    28.7 MB/sec
bool_non_null/parquet_2                            1.00      9.1±0.04ms    13.8 MB/sec    1.01      9.1±0.04ms    13.7 MB/sec
bool_non_null/zstd                                 1.00      4.7±0.02ms    26.7 MB/sec    1.00      4.7±0.03ms    26.6 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.04ms    13.2 MB/sec    1.01      9.5±0.03ms    13.1 MB/sec
float_with_nans/bloom_filter                       1.00     92.0±0.37ms   152.2 MB/sec    1.00     92.3±0.37ms   151.7 MB/sec
float_with_nans/cdc                                1.00     81.5±0.18ms   171.8 MB/sec    1.00     81.2±0.19ms   172.5 MB/sec
float_with_nans/default                            1.00     73.9±0.23ms   189.6 MB/sec    1.00     73.9±0.24ms   189.5 MB/sec
float_with_nans/parquet_2                          1.00     93.8±0.41ms   149.3 MB/sec    1.00     94.0±0.39ms   148.9 MB/sec
float_with_nans/zstd                               1.00    111.6±0.24ms   125.5 MB/sec    1.00    111.6±0.22ms   125.4 MB/sec
float_with_nans/zstd_parquet_2                     1.00    131.0±0.39ms   106.9 MB/sec    1.00    131.2±0.40ms   106.7 MB/sec
list_primitive/bloom_filter                        1.02    327.9±1.16ms  1663.3 MB/sec    1.00    322.8±0.38ms  1689.3 MB/sec
list_primitive/cdc                                 1.01    362.3±1.91ms  1505.4 MB/sec    1.00    357.0±1.94ms  1527.6 MB/sec
list_primitive/default                             1.02    249.3±1.68ms     2.1 GB/sec    1.00    245.5±0.34ms     2.2 GB/sec
list_primitive/parquet_2                           1.02    273.1±0.68ms  1996.7 MB/sec    1.00    267.4±0.43ms  2039.7 MB/sec
list_primitive/zstd                                1.01    499.2±2.69ms  1092.4 MB/sec    1.00    494.9±0.45ms  1102.0 MB/sec
list_primitive/zstd_parquet_2                      1.01    497.4±0.79ms  1096.5 MB/sec    1.00    490.1±0.38ms  1112.7 MB/sec
list_primitive_non_null/bloom_filter               1.03    429.3±4.23ms  1267.8 MB/sec    1.00    417.5±5.77ms  1303.6 MB/sec
list_primitive_non_null/cdc                        1.00    436.2±7.99ms  1247.7 MB/sec    1.01    438.7±7.81ms  1240.7 MB/sec
list_primitive_non_null/default                    1.01    290.4±4.14ms  1874.1 MB/sec    1.00    286.7±2.78ms  1898.6 MB/sec
list_primitive_non_null/parquet_2                  1.04    323.7±2.71ms  1681.1 MB/sec    1.00   310.3±13.54ms  1754.0 MB/sec
list_primitive_non_null/zstd                       1.00    705.6±7.92ms   771.4 MB/sec    1.01    712.6±4.39ms   763.8 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    686.5±1.70ms   792.8 MB/sec    1.00    684.5±0.46ms   795.1 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.02     11.4±0.02ms     3.2 GB/sec    1.00     11.1±0.22ms     3.3 GB/sec
list_primitive_sparse_99pct_null/cdc               1.01     22.7±0.07ms  1643.0 MB/sec    1.00     22.4±0.05ms  1667.1 MB/sec
list_primitive_sparse_99pct_null/default           1.03     11.1±0.07ms     3.3 GB/sec    1.00     10.8±0.02ms     3.4 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.03     11.1±0.02ms     3.3 GB/sec    1.00     10.8±0.02ms     3.4 GB/sec
list_primitive_sparse_99pct_null/zstd              1.02     12.9±0.02ms     2.8 GB/sec    1.00     12.6±0.02ms     2.9 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.03     11.2±0.03ms     3.3 GB/sec    1.00     10.9±0.03ms     3.3 GB/sec
primitive/bloom_filter                             1.01    150.8±0.54ms   297.7 MB/sec    1.00    149.2±0.42ms   300.9 MB/sec
primitive/cdc                                      1.00    159.1±0.65ms   282.0 MB/sec    1.00    158.6±0.52ms   282.9 MB/sec
primitive/default                                  1.01    118.5±0.50ms   378.7 MB/sec    1.00    117.2±0.18ms   382.8 MB/sec
primitive/parquet_2                                1.01    133.2±0.50ms   336.8 MB/sec    1.00    132.1±0.17ms   339.7 MB/sec
primitive/zstd                                     1.01    147.6±0.48ms   303.9 MB/sec    1.00    146.8±0.17ms   305.7 MB/sec
primitive/zstd_parquet_2                           1.00    166.4±0.57ms   269.7 MB/sec    1.00    165.6±0.41ms   270.9 MB/sec
primitive_all_null/bloom_filter                    1.00    893.9±2.41µs    49.0 GB/sec    12.93    11.6±0.24ms     3.8 GB/sec
primitive_all_null/cdc                             1.00     19.5±0.56ms     2.2 GB/sec    1.57     30.6±0.49ms  1465.1 MB/sec
primitive_all_null/default                         1.00    272.6±0.80µs   160.8 GB/sec    40.11    10.9±0.21ms     4.0 GB/sec
primitive_all_null/parquet_2                       1.00    277.8±1.24µs   157.8 GB/sec    39.26    10.9±0.15ms     4.0 GB/sec
primitive_all_null/zstd                            1.00    384.6±0.98µs   114.0 GB/sec    28.68    11.0±0.18ms     4.0 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    354.0±1.24µs   123.8 GB/sec    31.10    11.0±0.18ms     4.0 GB/sec
primitive_non_null/bloom_filter                    1.00    106.4±0.30ms   413.5 MB/sec    1.06    112.8±1.33ms   390.2 MB/sec
primitive_non_null/cdc                             1.00     89.8±0.25ms   490.0 MB/sec    1.00     89.9±0.53ms   489.2 MB/sec
primitive_non_null/default                         1.00     67.1±0.17ms   655.4 MB/sec    1.00     67.4±0.14ms   652.7 MB/sec
primitive_non_null/parquet_2                       1.00     88.9±0.23ms   495.1 MB/sec    1.00     89.2±0.21ms   493.2 MB/sec
primitive_non_null/zstd                            1.00     98.2±0.36ms   448.3 MB/sec    1.07    104.8±0.19ms   420.0 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    122.7±0.35ms   358.7 MB/sec    1.06    129.8±1.74ms   338.9 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     18.0±0.07ms     2.4 GB/sec    1.00     18.0±0.08ms     2.4 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     36.0±0.37ms  1247.9 MB/sec    1.03     36.9±0.34ms  1216.2 MB/sec
primitive_sparse_99pct_null/default                1.00     16.7±0.05ms     2.6 GB/sec    1.00     16.7±0.03ms     2.6 GB/sec
primitive_sparse_99pct_null/parquet_2              1.00     16.7±0.04ms     2.6 GB/sec    1.00     16.7±0.04ms     2.6 GB/sec
primitive_sparse_99pct_null/zstd                   1.00     20.0±0.05ms     2.2 GB/sec    1.00     20.0±0.04ms     2.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.00     18.6±0.05ms     2.4 GB/sec    1.00     18.6±0.05ms     2.4 GB/sec
string/bloom_filter                                1.00    201.4±5.93ms     2.5 GB/sec    1.16   234.4±25.88ms     2.2 GB/sec
string/cdc                                         1.00    219.3±3.10ms     2.3 GB/sec    1.02    222.7±5.87ms     2.3 GB/sec
string/default                                     1.00    116.4±4.74ms     4.4 GB/sec    1.25   145.2±25.94ms     3.5 GB/sec
string/parquet_2                                   1.00    110.1±4.86ms     4.6 GB/sec    1.16    127.9±0.21ms     4.0 GB/sec
string/zstd                                        1.00    416.0±0.62ms  1260.1 MB/sec    1.02    425.8±2.49ms  1231.4 MB/sec
string/zstd_parquet_2                              1.00    393.9±0.54ms  1330.9 MB/sec    1.01    396.2±0.29ms  1323.0 MB/sec
string_and_binary_view/bloom_filter                1.00     64.3±0.30ms   501.7 MB/sec    1.00     64.2±0.23ms   502.3 MB/sec
string_and_binary_view/cdc                         1.01     59.4±0.26ms   542.9 MB/sec    1.00     58.9±0.11ms   547.4 MB/sec
string_and_binary_view/default                     1.00     48.1±0.28ms   670.3 MB/sec    1.00     48.2±0.09ms   668.6 MB/sec
string_and_binary_view/parquet_2                   1.00     59.1±0.26ms   545.6 MB/sec    1.00     59.2±0.11ms   545.2 MB/sec
string_and_binary_view/zstd                        1.00     85.0±0.28ms   379.6 MB/sec    1.00     84.8±0.12ms   380.5 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     73.1±0.29ms   441.3 MB/sec    1.01     73.5±0.10ms   438.7 MB/sec
string_dictionary/bloom_filter                     1.01     93.1±0.45ms     2.8 GB/sec    1.00     91.8±0.70ms     2.8 GB/sec
string_dictionary/cdc                              1.00     54.0±0.40ms     4.8 GB/sec    1.59     86.0±0.62ms     3.0 GB/sec
string_dictionary/default                          1.01     49.5±0.20ms     5.2 GB/sec    1.00     49.0±0.31ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     54.0±0.34ms     4.8 GB/sec    1.00     54.1±0.09ms     4.8 GB/sec
string_dictionary/zstd                             1.00    209.4±0.45ms  1261.4 MB/sec    1.00    209.5±0.57ms  1260.7 MB/sec
string_dictionary/zstd_parquet_2                   1.00    199.4±0.37ms  1324.4 MB/sec    1.00    198.8±0.09ms  1328.8 MB/sec
string_non_null/bloom_filter                       1.00   258.5±10.24ms  2027.0 MB/sec    1.01   260.8±15.68ms  2009.1 MB/sec
string_non_null/cdc                                1.01    271.5±9.42ms  1930.0 MB/sec    1.00    269.3±9.52ms  1945.9 MB/sec
string_non_null/default                            1.09   140.3±10.40ms     3.6 GB/sec    1.00   129.0±13.04ms     4.0 GB/sec
string_non_null/parquet_2                          1.08    153.8±0.92ms     3.3 GB/sec    1.00   142.7±12.07ms     3.6 GB/sec
string_non_null/zstd                               1.05    554.7±4.83ms   944.6 MB/sec    1.00    529.3±1.71ms   989.9 MB/sec
string_non_null/zstd_parquet_2                     1.02    515.9±3.44ms  1015.7 MB/sec    1.00    505.9±2.31ms  1035.8 MB/sec
struct_all_null/bloom_filter                       1.00    376.6±2.15µs    41.8 GB/sec    6.70      2.5±0.00ms     6.2 GB/sec
struct_all_null/cdc                                1.00      7.9±0.18ms  2032.5 MB/sec    1.24      9.8±0.11ms  1638.4 MB/sec
struct_all_null/default                            1.00    118.3±0.26µs   133.1 GB/sec    19.03     2.3±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00    120.4±0.49µs   130.8 GB/sec    18.69     2.3±0.00ms     7.0 GB/sec
struct_all_null/zstd                               1.00    165.7±0.42µs    95.0 GB/sec    13.88     2.3±0.00ms     6.8 GB/sec
struct_all_null/zstd_parquet_2                     1.00    152.4±0.56µs   103.3 GB/sec    14.99     2.3±0.00ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.00     45.9±0.12ms   348.8 MB/sec    1.01     46.3±0.14ms   345.3 MB/sec
struct_non_null/cdc                                1.00     45.3±0.16ms   353.2 MB/sec    1.00     45.5±0.20ms   351.6 MB/sec
struct_non_null/default                            1.00     32.0±0.17ms   500.8 MB/sec    1.00     31.8±0.11ms   502.8 MB/sec
struct_non_null/parquet_2                          1.00     40.4±0.11ms   396.4 MB/sec    1.01     40.7±0.49ms   393.1 MB/sec
struct_non_null/zstd                               1.00     40.5±0.09ms   395.2 MB/sec    1.00     40.6±0.09ms   393.9 MB/sec
struct_non_null/zstd_parquet_2                     1.00     54.4±0.12ms   294.0 MB/sec    1.00     54.7±0.12ms   292.7 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      7.4±0.02ms     2.1 GB/sec    1.00      7.4±0.02ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.00     14.4±0.09ms  1118.9 MB/sec    1.07     15.4±0.09ms  1049.4 MB/sec
struct_sparse_99pct_null/default                   1.00      6.9±0.02ms     2.3 GB/sec    1.00      6.9±0.01ms     2.3 GB/sec
struct_sparse_99pct_null/parquet_2                 1.00      6.9±0.01ms     2.3 GB/sec    1.00      6.9±0.01ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.3±0.02ms  1953.3 MB/sec    1.00      8.3±0.01ms  1953.3 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.7±0.02ms     2.1 GB/sec    1.00      7.7±0.01ms     2.1 GB/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	1945.4s
Peak memory	6.6 GiB
Avg memory	6.4 GiB
CPU user	1882.1s
CPU sys	57.7s
Peak spill	0 B

branch

Metric	Value
Wall time	1930.4s
Peak memory	6.6 GiB
Avg memory	6.4 GiB
CPU user	1878.3s
CPU sys	49.6s
Peak spill	0 B

File an issue against this benchmark runner

etseidl

Thanks @HippoBaro, this looks nice. The few regressions look to be noise.

When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via `extend_uniform_levels` in O(1). Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

HippoBaro · 2026-05-13T03:33:21Z

Thank you @alamb @etseidl 🙇 The branch is updated with your feedback.

…umns When writing a nullable leaf (primitive) Arrow array, `write_leaf` built the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rows of branchy work and allocates a num_rows level buffer even though almost every level is the same value. Add a length-gated bulk-fill path: when the column is majority-null and the sub-range is large enough to amortize the gate's per-call cost, build the definition levels by bulk-filling the null level (a vectorized memset) and overwriting only the non-null positions found via `NullBuffer::valid_indices()`. The per-row path is kept for non-majority-null arrays and for the small sub-ranges produced by list/struct write paths, so those shapes are not regressed. Contributes to apache#9731. Complements apache#9954's all-null fast path by covering the sparse (mostly-but-not-entirely-null) case it does not handle. Threshold sweep on Ryzen 9 9950X (parquet/arrow_writer benches, /default variant, vs main): T primitive list_primitive primitive_sparse list_primitive_sparse ---------------------------------------------------------------------- 0 -3.0% +2.6% -36.1% +7.8% 16 -1.4% +1.8% -34.8% +2.8% 32 -1.1% -0.1% -35.1% +1.7% 64 -1.1% +0.7% -34.5% +1.7% <- chosen 128 -1.0% +1.5% -35.1% +2.4% 256 -1.4% +1.4% -35.1% +2.7% T=0 reproduces the per-call slice/popcount regression on list_primitive_sparse_99pct_null (+7.8%, matches the criterion bot's original measurement on the unguarded version). The +1.7% floor at T>=32 is the structural cost of evaluating the gate itself across ~10K small write_leaf calls in the list path; reducing it further would require hoisting the decision into the caller. T=64 matches T=32 on every shape and gives ~12x margin over the avg list length of ~5. Final benchmarks vs main on Ryzen 9 9950X (T=64, /default variants): primitive/default -1.5% primitive_non_null/default -2.8% primitive_sparse_99pct_null/default -35.1% primitive_all_null/default -66.4% list_primitive/default +1.8% (within noise) list_primitive_non_null/default -0.7% (no change, p=0.45) list_primitive_sparse_99pct_null/default +3.0% (gate-check floor) struct_sparse_99pct_null/default -4.9% bool/default +2.2%

alamb

Looks great to me -- thank you @HippoBaro and @etseidl

## Which issue does this PR close? - Contributes to #9731. ## AI assistance Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in [Threshold (`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's [CONTRIBUTING guidance on AI-generated submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions). ## Rationale for this change When writing a nullable leaf (primitive) Arrow array, `write_leaf` builds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~`num_rows` of branchy work and allocates a `num_rows`-element level buffer even though almost every produced level is the same value. #9954 adds an O(1) fast path for the *entirely* null case; this PR covers the *sparse* (mostly-but-not-entirely null) case it doesn't handle, the literal subject of #9731 ("a column that is 99% null … ~100x more work than necessary"). ## What changes are included in this PR? A single popcount pass over the null mask (`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorized `Vec::resize` memset) and only the non-null positions (from `NullBuffer::valid_indices()`) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the same `let range_nulls = nulls.slice(range.start, len)` slicing idiom; the slow path uses `range_nulls.iter()` for the def-level map and `range_nulls.valid_indices().map(|i| i + range.start)` for `non_null_indices`, with no `unsafe`. Output is byte-identical: the level *values* are unchanged, just produced via memset+scatter (fast path) or via the high-level `NullBuffer` iterators (slow path) instead of a manual `BitIndexIterator` walk. ## Threshold (`BULK_FILL_MIN_LEN`) The bulk-fill fast path is gated on two conditions: - `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths call `write_leaf` many times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins. - `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact. Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller. ## Are these changes tested? Existing tests cover this path: `cargo test -p parquet --features arrow --lib arrow_writer` is green (136 tests, full of nulls and roundtrips); full `cargo test -p parquet --features arrow` green modulo the pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on `main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt --check` clean. The `unsafe get_unchecked_mut` flagged in the original revision was replaced via `NullBuffer::valid_indices()`; the slow-path also dropped its `unsafe value_unchecked` for the same reason. ## Are there any user-facing changes? None. ## Benchmarks `cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X: ``` primitive_sparse_99pct_null/default 11.88 ms -> 9.13 ms (-23%) <- the case #9731 calls out primitive_all_null/default 5.65 ms -> 2.33 ms (-59%) (subsumed by #9954's O(1) path if that lands first) struct_sparse_99pct_null/default 5.67 ms -> 5.32 ms (-6%) struct_all_null/default 1.52 ms -> 1.31 ms (-14%) list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string: within noise (no regression) ``` The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases: ``` primitive_all_null/default 2.47x (11.0ms -> 4.4ms) primitive_sparse_99pct_null/default 1.60x (16.8ms -> 10.5ms) primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2} 1.38x to 2.48x primitive_sparse_99pct_null/{...} 1.28x to 1.59x list_primitive*, list_primitive_sparse_99pct_null*: 1.00x to 1.01x (within noise) ``` Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the `>= 50% null` guard is conservative. This is the *materialization*-cost half of #9731 (~30% of the 99%-null write); the *walk*-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all `num_rows` levels, is the larger structural change #9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work. --------- Co-authored-by: Ryan Stewart <noreply@example.com>

# Which issue does this PR close?  - Spawn off from apache#9653 - Contributes to apache#9731 # Rationale for this change  See apache#9731 # What changes are included in this PR? When an entire list, struct, fixed-size list, or leaf array is null, skip per-row iteration and emit bulk uniform def/rep levels via `extend_uniform_levels` in O(1). # Are these changes tested?  All tests passing + additional all null unit tests. # Are there any user-facing changes?  None. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

## Which issue does this PR close? - Contributes to apache#9731. ## AI assistance Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in [Threshold (`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's [CONTRIBUTING guidance on AI-generated submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions). ## Rationale for this change When writing a nullable leaf (primitive) Arrow array, `write_leaf` builds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~`num_rows` of branchy work and allocates a `num_rows`-element level buffer even though almost every produced level is the same value. apache#9954 adds an O(1) fast path for the *entirely* null case; this PR covers the *sparse* (mostly-but-not-entirely null) case it doesn't handle, the literal subject of apache#9731 ("a column that is 99% null … ~100x more work than necessary"). ## What changes are included in this PR? A single popcount pass over the null mask (`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorized `Vec::resize` memset) and only the non-null positions (from `NullBuffer::valid_indices()`) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the same `let range_nulls = nulls.slice(range.start, len)` slicing idiom; the slow path uses `range_nulls.iter()` for the def-level map and `range_nulls.valid_indices().map(|i| i + range.start)` for `non_null_indices`, with no `unsafe`. Output is byte-identical: the level *values* are unchanged, just produced via memset+scatter (fast path) or via the high-level `NullBuffer` iterators (slow path) instead of a manual `BitIndexIterator` walk. ## Threshold (`BULK_FILL_MIN_LEN`) The bulk-fill fast path is gated on two conditions: - `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths call `write_leaf` many times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins. - `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact. Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller. ## Are these changes tested? Existing tests cover this path: `cargo test -p parquet --features arrow --lib arrow_writer` is green (136 tests, full of nulls and roundtrips); full `cargo test -p parquet --features arrow` green modulo the pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on `main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt --check` clean. The `unsafe get_unchecked_mut` flagged in the original revision was replaced via `NullBuffer::valid_indices()`; the slow-path also dropped its `unsafe value_unchecked` for the same reason. ## Are there any user-facing changes? None. ## Benchmarks `cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X: ``` primitive_sparse_99pct_null/default 11.88 ms -> 9.13 ms (-23%) <- the case apache#9731 calls out primitive_all_null/default 5.65 ms -> 2.33 ms (-59%) (subsumed by apache#9954's O(1) path if that lands first) struct_sparse_99pct_null/default 5.67 ms -> 5.32 ms (-6%) struct_all_null/default 1.52 ms -> 1.31 ms (-14%) list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string: within noise (no regression) ``` The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases: ``` primitive_all_null/default 2.47x (11.0ms -> 4.4ms) primitive_sparse_99pct_null/default 1.60x (16.8ms -> 10.5ms) primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2} 1.38x to 2.48x primitive_sparse_99pct_null/{...} 1.28x to 1.59x list_primitive*, list_primitive_sparse_99pct_null*: 1.00x to 1.01x (within noise) ``` Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the `>= 50% null` guard is conservative. This is the *materialization*-cost half of apache#9731 (~30% of the 99%-null write); the *walk*-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all `num_rows` levels, is the larger structural change apache#9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work. --------- Co-authored-by: Ryan Stewart <noreply@example.com>

github-actions Bot added the parquet Changes to the parquet crate label May 10, 2026

alamb reviewed May 12, 2026

View reviewed changes

Comment thread parquet/src/arrow/arrow_writer/levels.rs

Comment thread parquet/src/arrow/arrow_writer/levels.rs

etseidl reviewed May 12, 2026

View reviewed changes

Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated

RyanJamesStewart mentioned this pull request May 13, 2026

Bulk-fill definition levels for majority-null leaf columns #9967

Merged

HippoBaro force-pushed the all_null_fast_path branch from e1d948f to 3039241 Compare May 13, 2026 03:27

alamb approved these changes May 14, 2026

View reviewed changes

alamb merged commit 2108f20 into apache:main May 14, 2026
16 checks passed

Conversation

HippoBaro commented May 10, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

HippoBaro commented May 10, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented May 12, 2026

Uh oh!

alamb commented May 12, 2026

Uh oh!

adriangbot commented May 12, 2026

Uh oh!

adriangbot commented May 12, 2026

Uh oh!

adriangbot commented May 12, 2026

Uh oh!

adriangbot commented May 12, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HippoBaro commented May 13, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants