perf(frequency): parallel tree-reduce of partial FTables (~1.3x speedup)#3728

jqnatividad · 2026-04-22T10:35:38Z

Summary

Replace the serial merge_all(recv.iter()) of partial frequency tables with a rayon::reduce tree-reduce over the collected partials. Two new helpers (merge_ftables, merge_weighted_ftables) merge two partial tables with per-column parallelism, since each column's Frequencies / HashMap is independent.

Why

Samply profiling on a 1M-row × 41-column CSV showed only ~3.7 of 16 cores busy on average for qsv frequency — workers all finished their chunks at roughly the same wall time, then parked on the threadpool's condvar while one main thread serially folded N partial HashMaps via merge_all. With nchunks == njobs == 16 and a bounded(nchunks) channel that never back-pressures, the merge dominated the long tail.

Tree-reduce turns the O(N) serial fold into O(log N) rounds that the rayon pool runs in parallel, and the per-column merge inside each round adds a second axis of parallelism.

Measured

hyperfine, 10 runs, warmup 3, M4 Max 16 cores, 1M-row × 41-col NYC 311 CSV:

	mean	min	max	user CPU	avg cores busy
baseline (master)	924.0 ms ± 7.9	913.0	938.6	3187 ms	~3.5×
new (parallel-merge)	716.7 ms ± 22.4	677.7	763.0	3782 ms	~5.3×

~1.29× wall-time speedup; user-CPU rose because workers now contribute during the merge phase.

Correctness

Output is byte-identical to master (diff -q clean on the benchmark CSV). Frequencies::merge is qsv-stats's Commute trait — commutative + associative — so any reduce order yields the same result.
All 160 tests in cargo test frequency -F all_features pass, including the existing prop_frequency_seq_vs_parallel proptest that directly compares parallel vs sequential output.
No public API change. Only src/cmd/frequency.rs touched (+57 / -21).

Test plan

cargo build --locked --bin qsv -F all_features — clean
cargo clippy --locked --bin qsv -F all_features — clean
cargo test --locked frequency -F all_features — 160/160 pass
cargo test --locked stats -F all_features — 706/706 pass (no regression)
Byte-equal output check on 1M-row CSV
hyperfine wall-time comparison
samply re-profile confirms higher CPU utilization (~3.5× → ~5.3× cores)

🤖 Generated with Claude Code

Replace serial `merge_all(recv.iter())` with `rayon::reduce` over the collected partial FTables. New helpers `merge_ftables` and `merge_weighted_ftables` merge two partial tables with per-column parallelism (each column's `Frequencies` / HashMap is independent). Why: samply profiling on a 1M-row x 41-col CSV showed only ~3.7 of 16 cores busy on average — workers all finished their chunks then parked on the threadpool's condvar while one main thread folded N partial HashMaps. Tree-reduce turns the O(N) serial fold into O(log N) rounds that the rayon pool runs in parallel, and the per-column merge inside each round adds a second axis of parallelism. Measured (hyperfine, 10 runs, M4 Max 16 cores, 1M rows x 41 cols): baseline (master): 924.0 ms +/- 7.9 user 3187 ms (~3.5x cores) new (parallel-merge): 716.7 ms +/- 22.4 user 3782 ms (~5.3x cores) -> 1.29x faster Output is byte-identical to master (`Frequencies::merge` is the `Commute` trait's commutative + associative merge, so any reduce order yields the same result). All 160 frequency tests pass, including the existing `prop_frequency_seq_vs_parallel` proptest.

codacy-production · 2026-04-22T10:36:15Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes. Give us feedback}

Copilot

Pull request overview

This PR improves qsv frequency performance by replacing the serial merge of per-chunk partial frequency tables with a Rayon-powered parallel tree-reduce, including parallel per-column merges during each reduce step.

Changes:

Replaced serial folding of partial unweighted FTables with a Rayon reduce tree-reduce.
Replaced serial folding of partial WeightedFTables with a Rayon reduce tree-reduce.
Added pairwise merge helpers to merge two partial tables with per-column parallelism.

…for streaming reduce - Promote `debug_assert_eq!` -> `assert_eq!` in `merge_ftables` and `merge_weighted_ftables`. Without this the subsequent `zip(...)` would silently truncate to the shorter Vec on a length mismatch and drop columns in release builds. Cost is one cmp per merge call (~16 in default path). - Switch both reduce call sites to `recv.into_iter().par_bridge().reduce(...)`. The previous `recv.iter().collect::<Vec<_>>()` materialized all `nchunks` partial tables before reducing — a real concern when memory-aware chunking picks small chunk sizes (nchunks can grow into the hundreds). par_bridge lets the reducer consume chunks as they arrive, so peak memory is bounded by rayon's working set rather than `nchunks`. 160 frequency tests still pass.

jqnatividad requested a review from Copilot April 22, 2026 10:37

Copilot started reviewing on behalf of jqnatividad April 22, 2026 10:37 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread src/cmd/frequency.rs Outdated

Comment thread src/cmd/frequency.rs Outdated

Comment thread src/cmd/frequency.rs Outdated

Comment thread src/cmd/frequency.rs Outdated

Comment thread src/cmd/frequency.rs Outdated

jqnatividad merged commit 624d0a9 into master Apr 22, 2026
17 checks passed

jqnatividad deleted the stats-freq-parallel-merge-opt branch April 22, 2026 12:03

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(frequency): parallel tree-reduce of partial FTables (~1.3x speedup)#3728