Skip to content

refactor(moarstats): safety fixes, perf cleanup, and unit tests#3718

Merged
jqnatividad merged 7 commits into
masterfrom
refactor-moarstats
Apr 20, 2026
Merged

refactor(moarstats): safety fixes, perf cleanup, and unit tests#3718
jqnatividad merged 7 commits into
masterfrom
refactor-moarstats

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to the recent `moarstats` work (#3654 and subsequent). Splits a code review of `src/cmd/moarstats.rs` into three reviewable commits covering safety, perf, and test-coverage wins. No behavior change in the happy path; non-finite inputs are now filtered instead of silently poisoning correlation state.

1. `refactor(moarstats): safety fixes + allocation/const cleanup` (1eab485)

  • Replace `panic!("Field pair not found...")` at two sites with `CliError` propagation. The parallel rayon closure now returns `CliResult` and is `.collect()`'d into `CliResult<HashMap<,>>`. The `debug_assert!`s that guard the invariant stay in place.
  • Clamp Kendall's tau factors with `.max(0.0)` before `sqrt` so tie-heavy data can't produce `sqrt(NaN)` from rounding.
  • NMI: `denominator == 0.0` → `denominator < f64::EPSILON`, matching the epsilon-comparison pattern used elsewhere in the file.
  • `temp_stats_path.to_str().unwrap()` → `ok_or_else(CliError::Other(...))`.
  • Swap `HashMap<String, String>` string interner (key == value) for `HashSet`, saving one clone per cache miss on the mutual-information hot path.
  • Named constants `OUTLIER_EXTREME_LOWER..OUTLIER_TOTAL` + `OUTLIER_COUNTS_LEN` replace ~25 `stats.counts[0..=5]` magic-index sites.
  • Promote the duplicated `10_000` parallel threshold to `PARALLEL_THRESHOLD` const.
  • Drop commented-out `// mean` / `// nullcount` initializers from `BivariateFieldInfo` constructors.

2. `perf(moarstats): NaN guard + Arc-share field maps across chunks` (91cbf04)

  • `parse_float_opt` / `parse_float_opt_from_bytes` now filter non-finite values. Previously a cell that `fast_float2` parsed as `NaN` or `±Infinity` (it accepts the "nan"/"inf" tokens) would poison Welford correlation state and propagate silently through all downstream statistics.
  • Parallel outlier and bivariate paths `Arc`-wrap the read-only `fields_to_count` / `field_pairs` `HashMap`s instead of deep-cloning them into every worker — one clone + (nchunks - 1) cheap `Arc` clones instead of `nchunks` full-map clones.

3. `test(moarstats): unit tests for pure helpers` (ef26335)

  • New `#[cfg(test)] mod tests` in `moarstats.rs` with 7 unit tests pinning the invariants established above: NaN/Inf rejection in both float parsers, zero-stddev Pearson skew, all-ties Kendall returning `None` (not `NaN`), perfect-concordance tau, NMI's zero-entropy and epsilon-denominator guards, and `OUTLIER_*` index-bounds.

Deferred (follow-up PRs)

  • Welford-mean for univariate sums — changes output in the last few ULPs; needs a baseline-comparison run.
  • Parallel/sequential generic core refactor — `count_all_outliers` / `..._from_reader` and `compute_all_bivariatestats` / `..._sequential` collapse to an iterator-generic shared core (~600 lines of duplication). Worth its own PR to agree on the abstraction shape.
  • Submodule split of `moarstats.rs` — best after the dedup above.

Test plan

  • `cargo build --locked --bin qsv -F all_features`
  • `cargo test -F all_features test_moarstats` — all 73 integration tests pass
  • `cargo test --bin qsv -F all_features cmd::moarstats::tests` — all 7 new unit tests pass
  • `cargo clippy --locked --bin qsv -F all_features -- -D warnings` — clean
  • Sanity check on a real dataset with the `--bivariate` path before merge

🤖 Generated with Claude Code

Safety:
- Replace panic!("Field pair not found...") at two sites with
  CliError propagation. The parallel rayon closure now returns
  CliResult and is .collect()'d into CliResult<HashMap<_,_>>.
- to_str().unwrap() on temp stats path -> ok_or_else(CliError).
- Clamp Kendall's tau factors to >= 0 before sqrt so tie-heavy
  data cannot produce sqrt(NaN) from rounding.
- NMI: denominator == 0.0 -> denominator < f64::EPSILON, matching
  the epsilon-comparison pattern used elsewhere in the file.

Performance / cleanup:
- String interner: HashMap<String, String> (key == value) ->
  HashSet<String>, saving one clone per cache miss on the MI hot
  path.
- Named constants for outlier count indices
  (OUTLIER_EXTREME_LOWER..OUTLIER_TOTAL, OUTLIER_COUNTS_LEN)
  replace the ~25 stats.counts[0..=5] magic-index sites.
- Promote duplicated 10_000 parallel threshold to
  PARALLEL_THRESHOLD const with doc comment.
- Drop commented-out BivariateFieldInfo field initializers
  (// mean, // nullcount) from constructors.

All 73 test_moarstats integration tests pass; clippy clean.
- parse_float_opt / parse_float_opt_from_bytes now filter out
  non-finite values. Previously a cell parsed as NaN or ±Infinity
  (fast_float2 accepts these tokens) would poison Welford
  correlation state and propagate silently through all downstream
  statistics for the column.
- Parallel outlier and bivariate paths Arc-wrap the read-only
  fields_to_count / field_pairs HashMaps instead of deep-cloning
  them into every worker. One clone + (nchunks - 1) cheap Arc
  clones instead of nchunks full-map clones.
Add an in-file #[cfg(test)] module exercising the numeric helpers
that recent review fixes touched:

- parse_float_opt / parse_float_opt_from_bytes: NaN, ±Infinity,
  and 'inf'/'nan' string tokens are rejected.
- compute_pearson_skewness: zero-stddev and None-input edges.
- compute_kendall_tau: all-ties input returns None (not NaN) and
  perfect concordance returns 1.0.
- compute_normalized_mutual_information: zero-entropy early exit
  and epsilon guard on a subnormal denominator.
- Outlier index constants stay within OUTLIER_COUNTS_LEN.

These target the code paths changed by the recent safety fixes
so the invariants are pinned by tests, not just by the commit
message.
@codacy-production

codacy-production Bot commented Apr 19, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

- Pass fields_to_count / field_pairs into count_all_outliers and
  compute_all_bivariatestats by value instead of by reference.
  The parallel paths now move the map directly into Arc — zero
  map clones, down from one. The sequential fallbacks still
  borrow via & so the underlying from_reader helpers keep their
  shared-reference signatures.
- Document in the usage text that non-finite numeric tokens
  ("NaN", "Infinity", "-Infinity", and case variants) are treated
  as missing values and excluded from all statistical computations.
  Makes the NaN/Inf filtering added in 91cbf04 discoverable without
  reading the source.

Review also flagged the absence of unit tests for the NaN/Inf
filter; those landed in ef26335 (same branch) as
parse_float_opt_filters_nan_and_infinity and
parse_float_opt_from_bytes_filters_nan_and_infinity.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors moarstats to improve robustness and performance while adding unit tests for key helper functions; notably, non-finite numeric tokens are now filtered out to prevent poisoning bivariate/correlation calculations.

Changes:

  • Filter NaN/±Infinity in float parsers and harden Kendall’s tau / NMI edge cases.
  • Reduce parallel-path allocations by moving read-only field maps into Arc (and add shared constants for outlier count indices / parallel threshold).
  • Add unit tests for parsing and statistical helper invariants.

Comment thread src/cmd/moarstats.rs Outdated
Comment thread src/cmd/moarstats.rs Outdated
Comment thread src/cmd/moarstats.rs Outdated
- Reword the USAGE note about non-finite tokens: previously claimed
  they are counted toward null totals reported by the underlying
  `stats` command, but `stats` treats "NaN" / "Infinity" as valid
  TFloat observations via fast_float2::parse. The filter only
  affects moarstats' own derived computations, so the doc now scopes
  the promise accordingly.
- Delete the `input_path.to_str().unwrap_or("").to_string()` lines
  in the outlier and bivariate parallel paths. The outer scope
  already built a UTF-8-validated `input_path_string` via ok_or_else;
  the shadowing copy re-did the conversion and silently fell back to
  the empty string on invalid UTF-8.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread src/cmd/moarstats.rs Outdated
Comment thread src/cmd/moarstats.rs
- compute_normalized_mutual_information doc now mentions the
  f64::EPSILON guard on the denominator, matching the
  implementation (was still saying "denominator is 0").
- Rewrote the comments around the string interner in
  compute_chunk_bivariate and compute_all_bivariatestats_sequential
  to be honest: this is not true interning — each cache hit still
  clones the full String. The benefit over the prior
  HashMap<String, String> form is just one fewer clone on cache
  miss. Noted Arc<str>/SmolStr as the path to real interning, for
  a follow-up PR.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread src/cmd/moarstats.rs
Comment thread src/cmd/moarstats.rs
If rconfig_chunk.indexed() failed inside a parallel worker, the
code path sent Ok(HashMap::new()) through the channel and returned.
The aggregator then merged an empty result silently, which under-
counted outliers (or produced incorrect bivariate stats) without
any indication to the user.

Changed both the outlier and bivariate workers to send an
Err(CliError::Other) instead, distinguishing the two failure modes
(index missing vs. I/O error) and including the chunk index and
path in the message. The sibling seek-failure branch already used
this pattern — now the index-open branch is consistent with it.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@jqnatividad jqnatividad merged commit 65a60e7 into master Apr 20, 2026
20 of 21 checks passed
@jqnatividad jqnatividad deleted the refactor-moarstats branch April 20, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants