feat(pivotp): expand smart aggregation with 7 more statistics#3699
Merged
Conversation
Leverage additional StatsData fields for smarter --agg smart decisions: Numeric (in suggest_numeric_after_bimodality): - mad_stddev_ratio < 0.5: outliers inflate stddev beyond MAD → Median - median_mean_ratio outside 0.7–1.3: mean/median diverge → Median - quartile_coefficient_dispersion > 0.5: robust variability → Median - n_negative + n_positive both > 20%: mixed-sign cancellation → Mean Numeric (in suggest_agg_function): - mode_count > 10: complex multimodal distribution → Len Date/DateTime: - sparsity > 0.5: sparse date column → Len - sort_order == Ascending: temporal recency → Last Add 7 corresponding tests covering each new signal path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 50 |
TIP This summary will be updated as you push new changes. Give us feedback
- Date/DateTime sparsity message now includes the type
(e.g. "sparse Date column") to distinguish from numeric sparsity
- Fix misleading comment in mixed-sign test: mean ≈ 0 makes CV NaN,
not "< 1"
- Use vec![].join("\n") for date test CSV strings to prevent
formatter from corrupting line continuations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Expands pivotp --agg smart to use additional statistics (including new moarstats-derived signals) to choose more appropriate aggregations for numeric and temporal value columns, and adds tests covering the new decision paths.
Changes:
- Enhanced smart aggregation heuristics for numeric columns (multimodality, robust-spread divergence, mean/median divergence, quartile dispersion, mixed-sign guard).
- Enhanced smart aggregation heuristics for Date/DateTime columns (sparsity and ascending sort order handling).
- Added 7 new integration tests and updated
--agg smarthelp text to document the signals.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/cmd/pivotp.rs |
Updates --agg smart help text and adds new smart-aggregation decision branches for numeric and Date/DateTime columns. |
tests/test_pivotp.rs |
Adds 7 tests to exercise each newly added smart-aggregation signal path. |
- Move mixed-sign check before high-cardinality Sum branch so it can't be bypassed; compute is_mixed_sign once, use in both paths - Replace u64→f64 casts with integer arithmetic (n*5 > total ≡ >20%), remove #[allow(clippy::cast_precision_loss)] from the function - Extract log_mixed_sign() helper to avoid duplicating the message - Clarify help text: skewness/mode_count/sort_order require a prior `stats --everything` or `stats --mode --quartiles` run; streaming stats (type, cardinality, sparsity, CV, n_negative/n_positive) are always available - Fix multimodal test comment: mode_count is from base stats, not moarstats; moarstats is needed to prevent cache overwrite Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move sort_order to streaming stats list in help text (it is always
present in the 31-column base output, not gated behind --quartiles)
- Use starts_with("Ascending") instead of == for robustness,
matching the pattern used in stats.rs itself
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
src/cmd/pivotp.rs:549
- In the Date/DateTime path, the
high_cardinality_pivot || high_cardinality_indexbranch returns before thesort_order.starts_with("Ascending")check below. This means a value column that is globally ascending can still selectFirst(when the pivot/index aren’t both ordered), which doesn’t match the PR description’s “ascending sort order → Last” signal. Consider checking the value column’s ascendingsort_orderbefore the high-cardinality branch, or incorporating the ascending check into that branch as well soLastis preferred whenever the temporal values are ascending.
} else if high_cardinality_pivot || high_cardinality_index {
if ordered_pivot && ordered_index {
if !quiet {
eprintln!(
"Info: Ordered temporal data with high cardinality, using Last"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--agg smartto consult 7 additional statistics fromStatsDatafor better aggregation decisionsmad_stddev_ratio,median_mean_ratio,quartile_coefficient_dispersion→ Median;n_negative/n_positivemixed-sign guard → Mean;mode_count > 10multimodal → Len--agg smarthelp text to document all signalsTest plan
cargo test pivotp -F all_features— all 56 tests pass (49 existing + 7 new)cargo clippy --bin qsv -F all_features -- -W clippy::pedantic— zero warnings🤖 Generated with Claude Code