refactor(moarstats): safety fixes, perf cleanup, and unit tests#3718
Merged
Conversation
Safety:
- Replace panic!("Field pair not found...") at two sites with
CliError propagation. The parallel rayon closure now returns
CliResult and is .collect()'d into CliResult<HashMap<_,_>>.
- to_str().unwrap() on temp stats path -> ok_or_else(CliError).
- Clamp Kendall's tau factors to >= 0 before sqrt so tie-heavy
data cannot produce sqrt(NaN) from rounding.
- NMI: denominator == 0.0 -> denominator < f64::EPSILON, matching
the epsilon-comparison pattern used elsewhere in the file.
Performance / cleanup:
- String interner: HashMap<String, String> (key == value) ->
HashSet<String>, saving one clone per cache miss on the MI hot
path.
- Named constants for outlier count indices
(OUTLIER_EXTREME_LOWER..OUTLIER_TOTAL, OUTLIER_COUNTS_LEN)
replace the ~25 stats.counts[0..=5] magic-index sites.
- Promote duplicated 10_000 parallel threshold to
PARALLEL_THRESHOLD const with doc comment.
- Drop commented-out BivariateFieldInfo field initializers
(// mean, // nullcount) from constructors.
All 73 test_moarstats integration tests pass; clippy clean.
- parse_float_opt / parse_float_opt_from_bytes now filter out non-finite values. Previously a cell parsed as NaN or ±Infinity (fast_float2 accepts these tokens) would poison Welford correlation state and propagate silently through all downstream statistics for the column. - Parallel outlier and bivariate paths Arc-wrap the read-only fields_to_count / field_pairs HashMaps instead of deep-cloning them into every worker. One clone + (nchunks - 1) cheap Arc clones instead of nchunks full-map clones.
Add an in-file #[cfg(test)] module exercising the numeric helpers that recent review fixes touched: - parse_float_opt / parse_float_opt_from_bytes: NaN, ±Infinity, and 'inf'/'nan' string tokens are rejected. - compute_pearson_skewness: zero-stddev and None-input edges. - compute_kendall_tau: all-ties input returns None (not NaN) and perfect concordance returns 1.0. - compute_normalized_mutual_information: zero-entropy early exit and epsilon guard on a subnormal denominator. - Outlier index constants stay within OUTLIER_COUNTS_LEN. These target the code paths changed by the recent safety fixes so the invariants are pinned by tests, not just by the commit message.
Up to standards ✅🟢 Issues
|
- Pass fields_to_count / field_pairs into count_all_outliers and
compute_all_bivariatestats by value instead of by reference.
The parallel paths now move the map directly into Arc — zero
map clones, down from one. The sequential fallbacks still
borrow via & so the underlying from_reader helpers keep their
shared-reference signatures.
- Document in the usage text that non-finite numeric tokens
("NaN", "Infinity", "-Infinity", and case variants) are treated
as missing values and excluded from all statistical computations.
Makes the NaN/Inf filtering added in 91cbf04 discoverable without
reading the source.
Review also flagged the absence of unit tests for the NaN/Inf
filter; those landed in ef26335 (same branch) as
parse_float_opt_filters_nan_and_infinity and
parse_float_opt_from_bytes_filters_nan_and_infinity.
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors moarstats to improve robustness and performance while adding unit tests for key helper functions; notably, non-finite numeric tokens are now filtered out to prevent poisoning bivariate/correlation calculations.
Changes:
- Filter
NaN/±Infinityin float parsers and harden Kendall’s tau / NMI edge cases. - Reduce parallel-path allocations by moving read-only field maps into
Arc(and add shared constants for outlier count indices / parallel threshold). - Add unit tests for parsing and statistical helper invariants.
- Reword the USAGE note about non-finite tokens: previously claimed
they are counted toward null totals reported by the underlying
`stats` command, but `stats` treats "NaN" / "Infinity" as valid
TFloat observations via fast_float2::parse. The filter only
affects moarstats' own derived computations, so the doc now scopes
the promise accordingly.
- Delete the `input_path.to_str().unwrap_or("").to_string()` lines
in the outlier and bivariate parallel paths. The outer scope
already built a UTF-8-validated `input_path_string` via ok_or_else;
the shadowing copy re-did the conversion and silently fell back to
the empty string on invalid UTF-8.
- compute_normalized_mutual_information doc now mentions the f64::EPSILON guard on the denominator, matching the implementation (was still saying "denominator is 0"). - Rewrote the comments around the string interner in compute_chunk_bivariate and compute_all_bivariatestats_sequential to be honest: this is not true interning — each cache hit still clones the full String. The benefit over the prior HashMap<String, String> form is just one fewer clone on cache miss. Noted Arc<str>/SmolStr as the path to real interning, for a follow-up PR.
If rconfig_chunk.indexed() failed inside a parallel worker, the code path sent Ok(HashMap::new()) through the channel and returned. The aggregator then merged an empty result silently, which under- counted outliers (or produced incorrect bivariate stats) without any indication to the user. Changed both the outlier and bivariate workers to send an Err(CliError::Other) instead, distinguishing the two failure modes (index missing vs. I/O error) and including the chunk index and path in the message. The sibling seek-failure branch already used this pattern — now the index-open branch is consistent with it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to the recent `moarstats` work (#3654 and subsequent). Splits a code review of `src/cmd/moarstats.rs` into three reviewable commits covering safety, perf, and test-coverage wins. No behavior change in the happy path; non-finite inputs are now filtered instead of silently poisoning correlation state.
1. `refactor(moarstats): safety fixes + allocation/const cleanup` (1eab485)
2. `perf(moarstats): NaN guard + Arc-share field maps across chunks` (91cbf04)
3. `test(moarstats): unit tests for pure helpers` (ef26335)
Deferred (follow-up PRs)
Test plan
🤖 Generated with Claude Code