Skip to content

feat(pivotp): expand smart aggregation with 7 more statistics#3699

Merged
jqnatividad merged 5 commits into
masterfrom
pivotp-even-smarter-aggregation
Apr 12, 2026
Merged

feat(pivotp): expand smart aggregation with 7 more statistics#3699
jqnatividad merged 5 commits into
masterfrom
pivotp-even-smarter-aggregation

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

  • Expand --agg smart to consult 7 additional statistics from StatsData for better aggregation decisions
  • Numeric: mad_stddev_ratio, median_mean_ratio, quartile_coefficient_dispersion → Median; n_negative/n_positive mixed-sign guard → Mean; mode_count > 10 multimodal → Len
  • Date/DateTime: sparsity > 50% → Len; ascending sort order → Last
  • Add 7 tests covering each new signal path
  • Update --agg smart help text to document all signals

Test plan

  • cargo test pivotp -F all_features — all 56 tests pass (49 existing + 7 new)
  • cargo clippy --bin qsv -F all_features -- -W clippy::pedantic — zero warnings
  • Manual verification with crafted CSVs for each new signal (MAD/stddev divergence, median/mean divergence, QCD, mixed-sign, multimodal, sparse dates, ascending dates)

🤖 Generated with Claude Code

Leverage additional StatsData fields for smarter --agg smart decisions:

Numeric (in suggest_numeric_after_bimodality):
- mad_stddev_ratio < 0.5: outliers inflate stddev beyond MAD → Median
- median_mean_ratio outside 0.7–1.3: mean/median diverge → Median
- quartile_coefficient_dispersion > 0.5: robust variability → Median
- n_negative + n_positive both > 20%: mixed-sign cancellation → Mean

Numeric (in suggest_agg_function):
- mode_count > 10: complex multimodal distribution → Len

Date/DateTime:
- sparsity > 0.5: sparse date column → Len
- sort_order == Ascending: temporal recency → Last

Add 7 corresponding tests covering each new signal path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented Apr 12, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 50 complexity

Metric Results
Complexity 50

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

- Date/DateTime sparsity message now includes the type
  (e.g. "sparse Date column") to distinguish from numeric sparsity
- Fix misleading comment in mixed-sign test: mean ≈ 0 makes CV NaN,
  not "< 1"
- Use vec![].join("\n") for date test CSV strings to prevent
  formatter from corrupting line continuations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands pivotp --agg smart to use additional statistics (including new moarstats-derived signals) to choose more appropriate aggregations for numeric and temporal value columns, and adds tests covering the new decision paths.

Changes:

  • Enhanced smart aggregation heuristics for numeric columns (multimodality, robust-spread divergence, mean/median divergence, quartile dispersion, mixed-sign guard).
  • Enhanced smart aggregation heuristics for Date/DateTime columns (sparsity and ascending sort order handling).
  • Added 7 new integration tests and updated --agg smart help text to document the signals.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/cmd/pivotp.rs Updates --agg smart help text and adds new smart-aggregation decision branches for numeric and Date/DateTime columns.
tests/test_pivotp.rs Adds 7 tests to exercise each newly added smart-aggregation signal path.

Comment thread src/cmd/pivotp.rs Outdated
Comment thread src/cmd/pivotp.rs Outdated
Comment thread src/cmd/pivotp.rs Outdated
Comment thread tests/test_pivotp.rs
jqnatividad and others added 2 commits April 11, 2026 23:50
- Move mixed-sign check before high-cardinality Sum branch so it
  can't be bypassed; compute is_mixed_sign once, use in both paths
- Replace u64→f64 casts with integer arithmetic (n*5 > total ≡ >20%),
  remove #[allow(clippy::cast_precision_loss)] from the function
- Extract log_mixed_sign() helper to avoid duplicating the message
- Clarify help text: skewness/mode_count/sort_order require a prior
  `stats --everything` or `stats --mode --quartiles` run; streaming
  stats (type, cardinality, sparsity, CV, n_negative/n_positive)
  are always available
- Fix multimodal test comment: mode_count is from base stats, not
  moarstats; moarstats is needed to prevent cache overwrite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/cmd/pivotp.rs Outdated
Comment thread src/cmd/pivotp.rs Outdated
- Move sort_order to streaming stats list in help text (it is always
  present in the 31-column base output, not gated behind --quartiles)
- Use starts_with("Ascending") instead of == for robustness,
  matching the pattern used in stats.rs itself

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/cmd/pivotp.rs:549

  • In the Date/DateTime path, the high_cardinality_pivot || high_cardinality_index branch returns before the sort_order.starts_with("Ascending") check below. This means a value column that is globally ascending can still select First (when the pivot/index aren’t both ordered), which doesn’t match the PR description’s “ascending sort order → Last” signal. Consider checking the value column’s ascending sort_order before the high-cardinality branch, or incorporating the ascending check into that branch as well so Last is preferred whenever the temporal values are ascending.
                } else if high_cardinality_pivot || high_cardinality_index {
                    if ordered_pivot && ordered_index {
                        if !quiet {
                            eprintln!(
                                "Info: Ordered temporal data with high cardinality, using Last"

Comment thread tests/test_pivotp.rs
Comment thread tests/test_pivotp.rs
Comment thread tests/test_pivotp.rs
@jqnatividad jqnatividad merged commit 4e9e2c4 into master Apr 12, 2026
21 checks passed
@jqnatividad jqnatividad deleted the pivotp-even-smarter-aggregation branch April 12, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants