Skip to content

Refactor(moarstats): collapse duplicated outlier bivariate scan code#3719

Merged
jqnatividad merged 4 commits into
masterfrom
refactor(moarstats)-collapse-duplicated-outlier-bivariate-scan-code
Apr 20, 2026
Merged

Refactor(moarstats): collapse duplicated outlier bivariate scan code#3719
jqnatividad merged 4 commits into
masterfrom
refactor(moarstats)-collapse-duplicated-outlier-bivariate-scan-code

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

No description provided.

jqnatividad and others added 3 commits April 19, 2026 21:05
…hunk fn

count_chunk_outliers is already generic over any
Iterator<Item = csv::Result<csv::ByteRecord>>, so the dedicated
*_from_reader helper was a verbatim duplicate (~135 lines). Both call
sites in count_all_outliers now feed rdr.byte_records() directly into
count_chunk_outliers. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-pair Pearson/Spearman/Kendall/covariance/MI/NMI computation
(zero-variance early exit, cardinality-threshold gating, MI-then-NMI
ordering) was duplicated between the parallel and sequential bivariate
paths — ~200 lines of nearly identical logic, with the cardinality-skip
log message and field-name lookup repeated four times.

Extract a single finalize_bivariate_pair_stats(pair_key, chunk_stats,
field_pairs, field_names, cardinality_threshold, stats_config) helper
that owns all of it. Wire the parallel path's into_par_iter().map(...)
to call it in one line. Computing exceeds_cardinality and the log_skip
closure once collapses the four-way MI/NMI branching into one shape.

Marginal-frequency computation (x_counts/y_counts from xy_counts) stays
in the caller — the parallel path can keep its single mutating pass over
all pairs before the par_iter, and a future caller could share marginals
across pairs.

The sequential path still has its own inline finalization in this commit;
the next commit collapses it to call this helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute_all_bivariatestats_sequential previously reimplemented the entire
per-record scan that compute_chunk_bivariate already does, just with a
StringRecord-shaped tuple instead of BivariateChunkStats — and then
duplicated the per-pair finalization too. Together that was ~365 lines
that diverged in subtle ways (e.g. the sequential path errored on
whole-record UTF-8 issues via rdr.records() while the chunk path
silently skips invalid cells per-cell).

Treat the whole file as a single chunk: feed rdr.byte_records() into
compute_chunk_bivariate and call finalize_bivariate_pair_stats per pair.
The new sequential body is ~70 lines.

To preserve the per-1000-record progress ticks without coupling the
chunk function to a ProgressBar parameter, wrap byte_records() with
.inspect() — the closure increments a counter and ticks the bar without
the chunk fn needing to know about progress. Progress-bar style setup is
hoisted out of the per-record loop into a one-shot before iteration.

Behavior shift worth noting: the sequential path now matches the chunk
path's UTF-8 tolerance (skips invalid cells instead of erroring out the
whole run). This removes a divergence from the parallel path that used to
run on the same inputs depending on file size / index presence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented Apr 20, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors moarstats bivariate-statistics computation to centralize per-pair finalization logic and reuse the same scan/finalize path across parallel and sequential execution, while also simplifying sequential outlier counting to reuse the existing chunk outlier scanner.

Changes:

  • Replaced sequential outlier counting helper with count_chunk_outliers(..., rdr.byte_records()) to reuse the chunk-based scanner.
  • Introduced finalize_bivariate_pair_stats(...) and used it in both parallel and sequential bivariate finalization to remove duplicated logic.
  • Reworked sequential bivariate processing to scan via compute_chunk_bivariate(...), then compute marginals (when needed) and finalize via the shared helper.

Comment thread src/cmd/moarstats.rs
…bivariate

compute_chunk_bivariate previously allocated an owned String for every
needed column on every record (via record_strings.insert(col, s.to_string())),
even when only Pearson/covariance/Spearman/Kendall on non-date columns was
requested — i.e. the strings were allocated, used as a non-empty sentinel,
then discarded. This was a pre-existing inefficiency that became visible in
the sequential bivariate path after it started routing through the chunk
function instead of the prior StringRecord-based scanner.

Drop the record_strings HashMap and the col_indices_needed scan. Inside the
inner loop:
- Read bytes directly from the ByteRecord; non-empty checks stay byte-level.
- Only call from_utf8 lazily for the two paths that actually need a &str:
  date parsing and frequency-count interning.
- Numeric (non-date) parsing reads bytes via parse_float_opt_from_bytes,
  so the common case allocates nothing per record.

Per-cell from_utf8 is a memchr-class byte scan; replacing it for the
allocation it used to require is a clear win on large inputs. Both the
parallel and sequential paths benefit.

Behavior is preserved for valid CSV inputs. The only edge case shift: a
cell with truly invalid UTF-8 in a column used for purely numeric Pearson
is no longer skipped (it's parsed as bytes), but parse_float_opt_from_bytes
will reject it the same way it rejects any non-numeric input — only
genuinely-numeric ASCII parses, so the resulting stats match.

Addresses Copilot review feedback on PR #3719.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 480815a into master Apr 20, 2026
15 checks passed
@jqnatividad jqnatividad deleted the refactor(moarstats)-collapse-duplicated-outlier-bivariate-scan-code branch April 20, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants