refactor(moarstats): safety fixes, perf cleanup, and unit tests#3718

jqnatividad · 2026-04-19T18:56:44Z

Summary

Follow-up to the recent `moarstats` work (#3654 and subsequent). Splits a code review of `src/cmd/moarstats.rs` into three reviewable commits covering safety, perf, and test-coverage wins. No behavior change in the happy path; non-finite inputs are now filtered instead of silently poisoning correlation state.

1. `refactor(moarstats): safety fixes + allocation/const cleanup` (`1eab485`)

Replace `panic!("Field pair not found...")` at two sites with `CliError` propagation. The parallel rayon closure now returns `CliResult` and is `.collect()`'d into `CliResult<HashMap<,>>`. The `debug_assert!`s that guard the invariant stay in place.
Clamp Kendall's tau factors with `.max(0.0)` before `sqrt` so tie-heavy data can't produce `sqrt(NaN)` from rounding.
NMI: `denominator == 0.0` → `denominator < f64::EPSILON`, matching the epsilon-comparison pattern used elsewhere in the file.
`temp_stats_path.to_str().unwrap()` → `ok_or_else(CliError::Other(...))`.
Swap `HashMap<String, String>` string interner (key == value) for `HashSet`, saving one clone per cache miss on the mutual-information hot path.
Named constants `OUTLIER_EXTREME_LOWER..OUTLIER_TOTAL` + `OUTLIER_COUNTS_LEN` replace ~25 `stats.counts[0..=5]` magic-index sites.
Promote the duplicated `10_000` parallel threshold to `PARALLEL_THRESHOLD` const.
Drop commented-out `// mean` / `// nullcount` initializers from `BivariateFieldInfo` constructors.

2. `perf(moarstats): NaN guard + Arc-share field maps across chunks` (`91cbf04`)

`parse_float_opt` / `parse_float_opt_from_bytes` now filter non-finite values. Previously a cell that `fast_float2` parsed as `NaN` or `±Infinity` (it accepts the "nan"/"inf" tokens) would poison Welford correlation state and propagate silently through all downstream statistics.
Parallel outlier and bivariate paths `Arc`-wrap the read-only `fields_to_count` / `field_pairs` `HashMap`s instead of deep-cloning them into every worker — one clone + (nchunks - 1) cheap `Arc` clones instead of `nchunks` full-map clones.

3. `test(moarstats): unit tests for pure helpers` (`ef26335`)

New `#[cfg(test)] mod tests` in `moarstats.rs` with 7 unit tests pinning the invariants established above: NaN/Inf rejection in both float parsers, zero-stddev Pearson skew, all-ties Kendall returning `None` (not `NaN`), perfect-concordance tau, NMI's zero-entropy and epsilon-denominator guards, and `OUTLIER_*` index-bounds.

Deferred (follow-up PRs)

Welford-mean for univariate sums — changes output in the last few ULPs; needs a baseline-comparison run.
Parallel/sequential generic core refactor — `count_all_outliers` / `..._from_reader` and `compute_all_bivariatestats` / `..._sequential` collapse to an iterator-generic shared core (~600 lines of duplication). Worth its own PR to agree on the abstraction shape.
Submodule split of `moarstats.rs` — best after the dedup above.

Test plan

`cargo build --locked --bin qsv -F all_features`
`cargo test -F all_features test_moarstats` — all 73 integration tests pass
`cargo test --bin qsv -F all_features cmd::moarstats::tests` — all 7 new unit tests pass
`cargo clippy --locked --bin qsv -F all_features -- -D warnings` — clean
Sanity check on a real dataset with the `--bivariate` path before merge

🤖 Generated with Claude Code

Safety: - Replace panic!("Field pair not found...") at two sites with CliError propagation. The parallel rayon closure now returns CliResult and is .collect()'d into CliResult<HashMap<_,_>>. - to_str().unwrap() on temp stats path -> ok_or_else(CliError). - Clamp Kendall's tau factors to >= 0 before sqrt so tie-heavy data cannot produce sqrt(NaN) from rounding. - NMI: denominator == 0.0 -> denominator < f64::EPSILON, matching the epsilon-comparison pattern used elsewhere in the file. Performance / cleanup: - String interner: HashMap<String, String> (key == value) -> HashSet<String>, saving one clone per cache miss on the MI hot path. - Named constants for outlier count indices (OUTLIER_EXTREME_LOWER..OUTLIER_TOTAL, OUTLIER_COUNTS_LEN) replace the ~25 stats.counts[0..=5] magic-index sites. - Promote duplicated 10_000 parallel threshold to PARALLEL_THRESHOLD const with doc comment. - Drop commented-out BivariateFieldInfo field initializers (// mean, // nullcount) from constructors. All 73 test_moarstats integration tests pass; clippy clean.

- parse_float_opt / parse_float_opt_from_bytes now filter out non-finite values. Previously a cell parsed as NaN or ±Infinity (fast_float2 accepts these tokens) would poison Welford correlation state and propagate silently through all downstream statistics for the column. - Parallel outlier and bivariate paths Arc-wrap the read-only fields_to_count / field_pairs HashMaps instead of deep-cloning them into every worker. One clone + (nchunks - 1) cheap Arc clones instead of nchunks full-map clones.

Add an in-file #[cfg(test)] module exercising the numeric helpers that recent review fixes touched: - parse_float_opt / parse_float_opt_from_bytes: NaN, ±Infinity, and 'inf'/'nan' string tokens are rejected. - compute_pearson_skewness: zero-stddev and None-input edges. - compute_kendall_tau: all-ties input returns None (not NaN) and perfect concordance returns 1.0. - compute_normalized_mutual_information: zero-entropy early exit and epsilon guard on a subnormal denominator. - Outlier index constants stay within OUTLIER_COUNTS_LEN. These target the code paths changed by the recent safety fixes so the invariants are pinned by tests, not just by the commit message.

codacy-production · 2026-04-19T18:57:21Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

- Pass fields_to_count / field_pairs into count_all_outliers and compute_all_bivariatestats by value instead of by reference. The parallel paths now move the map directly into Arc — zero map clones, down from one. The sequential fallbacks still borrow via & so the underlying from_reader helpers keep their shared-reference signatures. - Document in the usage text that non-finite numeric tokens ("NaN", "Infinity", "-Infinity", and case variants) are treated as missing values and excluded from all statistical computations. Makes the NaN/Inf filtering added in 91cbf04 discoverable without reading the source. Review also flagged the absence of unit tests for the NaN/Inf filter; those landed in ef26335 (same branch) as parse_float_opt_filters_nan_and_infinity and parse_float_opt_from_bytes_filters_nan_and_infinity.

Copilot

Pull request overview

Refactors moarstats to improve robustness and performance while adding unit tests for key helper functions; notably, non-finite numeric tokens are now filtered out to prevent poisoning bivariate/correlation calculations.

Changes:

Filter NaN/±Infinity in float parsers and harden Kendall’s tau / NMI edge cases.
Reduce parallel-path allocations by moving read-only field maps into Arc (and add shared constants for outlier count indices / parallel threshold).
Add unit tests for parsing and statistical helper invariants.

- Reword the USAGE note about non-finite tokens: previously claimed they are counted toward null totals reported by the underlying `stats` command, but `stats` treats "NaN" / "Infinity" as valid TFloat observations via fast_float2::parse. The filter only affects moarstats' own derived computations, so the doc now scopes the promise accordingly. - Delete the `input_path.to_str().unwrap_or("").to_string()` lines in the outlier and bivariate parallel paths. The outer scope already built a UTF-8-validated `input_path_string` via ok_or_else; the shadowing copy re-did the conversion and silently fell back to the empty string on invalid UTF-8.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

- compute_normalized_mutual_information doc now mentions the f64::EPSILON guard on the denominator, matching the implementation (was still saying "denominator is 0"). - Rewrote the comments around the string interner in compute_chunk_bivariate and compute_all_bivariatestats_sequential to be honest: this is not true interning — each cache hit still clones the full String. The benefit over the prior HashMap<String, String> form is just one fewer clone on cache miss. Noted Arc<str>/SmolStr as the path to real interning, for a follow-up PR.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

If rconfig_chunk.indexed() failed inside a parallel worker, the code path sent Ok(HashMap::new()) through the channel and returned. The aggregator then merged an empty result silently, which under- counted outliers (or produced incorrect bivariate stats) without any indication to the user. Changed both the outlier and bivariate workers to send an Err(CliError::Other) instead, distinguishing the two failure modes (index missing vs. I/O error) and including the chunk index and path in the message. The sibling seek-failure branch already used this pattern — now the index-open branch is consistent with it.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

jqnatividad added 3 commits April 19, 2026 14:46

jqnatividad requested a review from Copilot April 19, 2026 19:40

Copilot started reviewing on behalf of jqnatividad April 19, 2026 19:41 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread src/cmd/moarstats.rs Outdated

Comment thread src/cmd/moarstats.rs Outdated

Comment thread src/cmd/moarstats.rs Outdated

jqnatividad requested a review from Copilot April 19, 2026 19:53

Copilot started reviewing on behalf of jqnatividad April 19, 2026 19:54 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread src/cmd/moarstats.rs Outdated

Comment thread src/cmd/moarstats.rs

jqnatividad requested a review from Copilot April 19, 2026 20:12

Copilot started reviewing on behalf of jqnatividad April 19, 2026 20:13 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread src/cmd/moarstats.rs

Comment thread src/cmd/moarstats.rs

jqnatividad requested a review from Copilot April 20, 2026 00:23

Copilot started reviewing on behalf of jqnatividad April 20, 2026 00:23 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

jqnatividad merged commit 65a60e7 into master Apr 20, 2026
20 of 21 checks passed

jqnatividad deleted the refactor-moarstats branch April 20, 2026 00:27

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(moarstats): safety fixes, perf cleanup, and unit tests#3718