Refactor(moarstats): collapse duplicated outlier bivariate scan code#3719

jqnatividad · 2026-04-20T01:10:52Z

No description provided.

…hunk fn count_chunk_outliers is already generic over any Iterator<Item = csv::Result<csv::ByteRecord>>, so the dedicated *_from_reader helper was a verbatim duplicate (~135 lines). Both call sites in count_all_outliers now feed rdr.byte_records() directly into count_chunk_outliers. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The per-pair Pearson/Spearman/Kendall/covariance/MI/NMI computation (zero-variance early exit, cardinality-threshold gating, MI-then-NMI ordering) was duplicated between the parallel and sequential bivariate paths — ~200 lines of nearly identical logic, with the cardinality-skip log message and field-name lookup repeated four times. Extract a single finalize_bivariate_pair_stats(pair_key, chunk_stats, field_pairs, field_names, cardinality_threshold, stats_config) helper that owns all of it. Wire the parallel path's into_par_iter().map(...) to call it in one line. Computing exceeds_cardinality and the log_skip closure once collapses the four-way MI/NMI branching into one shape. Marginal-frequency computation (x_counts/y_counts from xy_counts) stays in the caller — the parallel path can keep its single mutating pass over all pairs before the par_iter, and a future caller could share marginals across pairs. The sequential path still has its own inline finalization in this commit; the next commit collapses it to call this helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

compute_all_bivariatestats_sequential previously reimplemented the entire per-record scan that compute_chunk_bivariate already does, just with a StringRecord-shaped tuple instead of BivariateChunkStats — and then duplicated the per-pair finalization too. Together that was ~365 lines that diverged in subtle ways (e.g. the sequential path errored on whole-record UTF-8 issues via rdr.records() while the chunk path silently skips invalid cells per-cell). Treat the whole file as a single chunk: feed rdr.byte_records() into compute_chunk_bivariate and call finalize_bivariate_pair_stats per pair. The new sequential body is ~70 lines. To preserve the per-1000-record progress ticks without coupling the chunk function to a ProgressBar parameter, wrap byte_records() with .inspect() — the closure increments a counter and ticks the bar without the chunk fn needing to know about progress. Progress-bar style setup is hoisted out of the per-record loop into a one-shot before iteration. Behavior shift worth noting: the sequential path now matches the chunk path's UTF-8 tolerance (skips invalid cells instead of erroring out the whole run). This removes a divergence from the parallel path that used to run on the same inputs depending on file size / index presence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-04-20T01:11:30Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

Copilot

Pull request overview

Refactors moarstats bivariate-statistics computation to centralize per-pair finalization logic and reuse the same scan/finalize path across parallel and sequential execution, while also simplifying sequential outlier counting to reuse the existing chunk outlier scanner.

Changes:

Replaced sequential outlier counting helper with count_chunk_outliers(..., rdr.byte_records()) to reuse the chunk-based scanner.
Introduced finalize_bivariate_pair_stats(...) and used it in both parallel and sequential bivariate finalization to remove duplicated logic.
Reworked sequential bivariate processing to scan via compute_chunk_bivariate(...), then compute marginals (when needed) and finalize via the shared helper.

…bivariate compute_chunk_bivariate previously allocated an owned String for every needed column on every record (via record_strings.insert(col, s.to_string())), even when only Pearson/covariance/Spearman/Kendall on non-date columns was requested — i.e. the strings were allocated, used as a non-empty sentinel, then discarded. This was a pre-existing inefficiency that became visible in the sequential bivariate path after it started routing through the chunk function instead of the prior StringRecord-based scanner. Drop the record_strings HashMap and the col_indices_needed scan. Inside the inner loop: - Read bytes directly from the ByteRecord; non-empty checks stay byte-level. - Only call from_utf8 lazily for the two paths that actually need a &str: date parsing and frequency-count interning. - Numeric (non-date) parsing reads bytes via parse_float_opt_from_bytes, so the common case allocates nothing per record. Per-cell from_utf8 is a memchr-class byte scan; replacing it for the allocation it used to require is a clear win on large inputs. Both the parallel and sequential paths benefit. Behavior is preserved for valid CSV inputs. The only edge case shift: a cell with truly invalid UTF-8 in a column used for purely numeric Pearson is no longer skipped (it's parsed as bytes), but parse_float_opt_from_bytes will reject it the same way it rejects any non-numeric input — only genuinely-numeric ASCII parses, so the resulting stats match. Addresses Copilot review feedback on PR #3719. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 3 commits April 19, 2026 21:05

jqnatividad requested a review from Copilot April 20, 2026 01:10

Copilot started reviewing on behalf of jqnatividad April 20, 2026 01:11 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread src/cmd/moarstats.rs

jqnatividad merged commit 480815a into master Apr 20, 2026
15 checks passed

jqnatividad deleted the refactor(moarstats)-collapse-duplicated-outlier-bivariate-scan-code branch April 20, 2026 01:35

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor(moarstats): collapse duplicated outlier bivariate scan code#3719

Refactor(moarstats): collapse duplicated outlier bivariate scan code#3719
jqnatividad merged 4 commits into
masterfrom
refactor(moarstats)-collapse-duplicated-outlier-bivariate-scan-code

jqnatividad commented Apr 20, 2026

Uh oh!

codacy-production Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented Apr 20, 2026

Uh oh!

codacy-production Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented Apr 20, 2026 •

edited

Loading