Refactor(moarstats): collapse duplicated outlier bivariate scan code#3719
Merged
jqnatividad merged 4 commits intoApr 20, 2026
Merged
Conversation
…hunk fn count_chunk_outliers is already generic over any Iterator<Item = csv::Result<csv::ByteRecord>>, so the dedicated *_from_reader helper was a verbatim duplicate (~135 lines). Both call sites in count_all_outliers now feed rdr.byte_records() directly into count_chunk_outliers. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-pair Pearson/Spearman/Kendall/covariance/MI/NMI computation (zero-variance early exit, cardinality-threshold gating, MI-then-NMI ordering) was duplicated between the parallel and sequential bivariate paths — ~200 lines of nearly identical logic, with the cardinality-skip log message and field-name lookup repeated four times. Extract a single finalize_bivariate_pair_stats(pair_key, chunk_stats, field_pairs, field_names, cardinality_threshold, stats_config) helper that owns all of it. Wire the parallel path's into_par_iter().map(...) to call it in one line. Computing exceeds_cardinality and the log_skip closure once collapses the four-way MI/NMI branching into one shape. Marginal-frequency computation (x_counts/y_counts from xy_counts) stays in the caller — the parallel path can keep its single mutating pass over all pairs before the par_iter, and a future caller could share marginals across pairs. The sequential path still has its own inline finalization in this commit; the next commit collapses it to call this helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute_all_bivariatestats_sequential previously reimplemented the entire per-record scan that compute_chunk_bivariate already does, just with a StringRecord-shaped tuple instead of BivariateChunkStats — and then duplicated the per-pair finalization too. Together that was ~365 lines that diverged in subtle ways (e.g. the sequential path errored on whole-record UTF-8 issues via rdr.records() while the chunk path silently skips invalid cells per-cell). Treat the whole file as a single chunk: feed rdr.byte_records() into compute_chunk_bivariate and call finalize_bivariate_pair_stats per pair. The new sequential body is ~70 lines. To preserve the per-1000-record progress ticks without coupling the chunk function to a ProgressBar parameter, wrap byte_records() with .inspect() — the closure increments a counter and ticks the bar without the chunk fn needing to know about progress. Progress-bar style setup is hoisted out of the per-record loop into a one-shot before iteration. Behavior shift worth noting: the sequential path now matches the chunk path's UTF-8 tolerance (skips invalid cells instead of erroring out the whole run). This removes a divergence from the parallel path that used to run on the same inputs depending on file size / index presence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors moarstats bivariate-statistics computation to centralize per-pair finalization logic and reuse the same scan/finalize path across parallel and sequential execution, while also simplifying sequential outlier counting to reuse the existing chunk outlier scanner.
Changes:
- Replaced sequential outlier counting helper with
count_chunk_outliers(..., rdr.byte_records())to reuse the chunk-based scanner. - Introduced
finalize_bivariate_pair_stats(...)and used it in both parallel and sequential bivariate finalization to remove duplicated logic. - Reworked sequential bivariate processing to scan via
compute_chunk_bivariate(...), then compute marginals (when needed) and finalize via the shared helper.
…bivariate compute_chunk_bivariate previously allocated an owned String for every needed column on every record (via record_strings.insert(col, s.to_string())), even when only Pearson/covariance/Spearman/Kendall on non-date columns was requested — i.e. the strings were allocated, used as a non-empty sentinel, then discarded. This was a pre-existing inefficiency that became visible in the sequential bivariate path after it started routing through the chunk function instead of the prior StringRecord-based scanner. Drop the record_strings HashMap and the col_indices_needed scan. Inside the inner loop: - Read bytes directly from the ByteRecord; non-empty checks stay byte-level. - Only call from_utf8 lazily for the two paths that actually need a &str: date parsing and frequency-count interning. - Numeric (non-date) parsing reads bytes via parse_float_opt_from_bytes, so the common case allocates nothing per record. Per-cell from_utf8 is a memchr-class byte scan; replacing it for the allocation it used to require is a clear win on large inputs. Both the parallel and sequential paths benefit. Behavior is preserved for valid CSV inputs. The only edge case shift: a cell with truly invalid UTF-8 in a column used for purely numeric Pearson is no longer skipped (it's parsed as bytes), but parse_float_opt_from_bytes will reject it the same way it rejects any non-numeric input — only genuinely-numeric ASCII parses, so the resulting stats match. Addresses Copilot review feedback on PR #3719. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.