Skip to content

perf(frequency): parallel tree-reduce of partial FTables (~1.3x speedup)#3728

Merged
jqnatividad merged 2 commits into
masterfrom
stats-freq-parallel-merge-opt
Apr 22, 2026
Merged

perf(frequency): parallel tree-reduce of partial FTables (~1.3x speedup)#3728
jqnatividad merged 2 commits into
masterfrom
stats-freq-parallel-merge-opt

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

Replace the serial merge_all(recv.iter()) of partial frequency tables with a rayon::reduce tree-reduce over the collected partials. Two new helpers (merge_ftables, merge_weighted_ftables) merge two partial tables with per-column parallelism, since each column's Frequencies / HashMap is independent.

Why

Samply profiling on a 1M-row × 41-column CSV showed only ~3.7 of 16 cores busy on average for qsv frequency — workers all finished their chunks at roughly the same wall time, then parked on the threadpool's condvar while one main thread serially folded N partial HashMaps via merge_all. With nchunks == njobs == 16 and a bounded(nchunks) channel that never back-pressures, the merge dominated the long tail.

Tree-reduce turns the O(N) serial fold into O(log N) rounds that the rayon pool runs in parallel, and the per-column merge inside each round adds a second axis of parallelism.

Measured

hyperfine, 10 runs, warmup 3, M4 Max 16 cores, 1M-row × 41-col NYC 311 CSV:

mean min max user CPU avg cores busy
baseline (master) 924.0 ms ± 7.9 913.0 938.6 3187 ms ~3.5×
new (parallel-merge) 716.7 ms ± 22.4 677.7 763.0 3782 ms ~5.3×

~1.29× wall-time speedup; user-CPU rose because workers now contribute during the merge phase.

Correctness

  • Output is byte-identical to master (diff -q clean on the benchmark CSV). Frequencies::merge is qsv-stats's Commute trait — commutative + associative — so any reduce order yields the same result.
  • All 160 tests in cargo test frequency -F all_features pass, including the existing prop_frequency_seq_vs_parallel proptest that directly compares parallel vs sequential output.
  • No public API change. Only src/cmd/frequency.rs touched (+57 / -21).

Test plan

  • cargo build --locked --bin qsv -F all_features — clean
  • cargo clippy --locked --bin qsv -F all_features — clean
  • cargo test --locked frequency -F all_features — 160/160 pass
  • cargo test --locked stats -F all_features — 706/706 pass (no regression)
  • Byte-equal output check on 1M-row CSV
  • hyperfine wall-time comparison
  • samply re-profile confirms higher CPU utilization (~3.5× → ~5.3× cores)

🤖 Generated with Claude Code

Replace serial `merge_all(recv.iter())` with `rayon::reduce` over the
collected partial FTables. New helpers `merge_ftables` and
`merge_weighted_ftables` merge two partial tables with per-column
parallelism (each column's `Frequencies` / HashMap is independent).

Why: samply profiling on a 1M-row x 41-col CSV showed only ~3.7 of 16
cores busy on average — workers all finished their chunks then parked
on the threadpool's condvar while one main thread folded N partial
HashMaps. Tree-reduce turns the O(N) serial fold into O(log N) rounds
that the rayon pool runs in parallel, and the per-column merge inside
each round adds a second axis of parallelism.

Measured (hyperfine, 10 runs, M4 Max 16 cores, 1M rows x 41 cols):

  baseline (master):    924.0 ms +/- 7.9   user 3187 ms (~3.5x cores)
  new (parallel-merge): 716.7 ms +/- 22.4  user 3782 ms (~5.3x cores)
  -> 1.29x faster

Output is byte-identical to master (`Frequencies::merge` is the
`Commute` trait's commutative + associative merge, so any reduce order
yields the same result). All 160 frequency tests pass, including the
existing `prop_frequency_seq_vs_parallel` proptest.
@codacy-production

codacy-production Bot commented Apr 22, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves qsv frequency performance by replacing the serial merge of per-chunk partial frequency tables with a Rayon-powered parallel tree-reduce, including parallel per-column merges during each reduce step.

Changes:

  • Replaced serial folding of partial unweighted FTables with a Rayon reduce tree-reduce.
  • Replaced serial folding of partial WeightedFTables with a Rayon reduce tree-reduce.
  • Added pairwise merge helpers to merge two partial tables with per-column parallelism.

Comment thread src/cmd/frequency.rs Outdated
Comment thread src/cmd/frequency.rs Outdated
Comment thread src/cmd/frequency.rs Outdated
Comment thread src/cmd/frequency.rs Outdated
Comment thread src/cmd/frequency.rs Outdated
…for streaming reduce

- Promote `debug_assert_eq!` -> `assert_eq!` in `merge_ftables` and
  `merge_weighted_ftables`. Without this the subsequent `zip(...)` would
  silently truncate to the shorter Vec on a length mismatch and drop columns
  in release builds. Cost is one cmp per merge call (~16 in default path).

- Switch both reduce call sites to `recv.into_iter().par_bridge().reduce(...)`.
  The previous `recv.iter().collect::<Vec<_>>()` materialized all `nchunks`
  partial tables before reducing — a real concern when memory-aware chunking
  picks small chunk sizes (nchunks can grow into the hundreds). par_bridge
  lets the reducer consume chunks as they arrive, so peak memory is bounded
  by rayon's working set rather than `nchunks`.

160 frequency tests still pass.
@jqnatividad jqnatividad merged commit 624d0a9 into master Apr 22, 2026
17 checks passed
@jqnatividad jqnatividad deleted the stats-freq-parallel-merge-opt branch April 22, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants