Skip to content

fix(stats): preserve length & lex stats when column type widens to String#3856

Merged
jqnatividad merged 4 commits into
masterfrom
fix-stats-3855
May 15, 2026
Merged

fix(stats): preserve length & lex stats when column type widens to String#3856
jqnatividad merged 4 commits into
masterfrom
fix-stats-3855

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

  • When a column's inferred type widens to TString mid-stream (Date→String under --infer-dates, Integer→String, Float→String, DateTime→String, etc.), TypedSum::stotlen, TypedMinMax::{str_len, strings}, and Stats::online_len previously only recorded samples that arrived after the widen — so sum_length/avg_length/min_length/max_length/lex min/max/sort_order/sortiness/stddev_length/variance_length/cv_length all reflected only the post-widen tail. The same bug also dropped pre-merge TNull samples from online_len on pure-String columns.
  • Fix: track these fields unconditionally on every (non-empty) sample regardless of dispatched type. show() already gates length-stat emission on the final typ == TString, so columns that never widen are unaffected in output. Commute::merge impls for TypedSum, TypedMinMax, and OnlineStats already accumulate the affected fields correctly, so parallel-chunk merging is unchanged.
  • Reproducer from issue now matches the expected output:
    $ qsv stats --infer-dates issue3855.csv | grep date2
    date2,String,true,,2026-05-13,NA,,Ascending,1,2,10,12,6,4,16,0.6667,...
    

Changes

  • src/cmd/stats.rs:
    • TypedSum::add_with_parsed — track stotlen unconditionally
    • TypedMinMax::add_with_parsed — track str_len + strings (lex min/max) unconditionally on non-empty samples
    • Stats::add — lift online_len.add out of the t == TString block; use if let Some for defense-in-depth
  • tests/test_stats.rs — 5 new regression tests (stats_widen_*_issue3855) covering each widening path
  • tests/test_moarstats.rs, tests/test_stats.rs, resources/test/*-stats.csv — fixture updates for the equal-or-more-correct post-fix values; diffs are confined exclusively to stddev_length/variance_length/cv_length on String columns with interleaved NULLs (where the pre-fix code skipped NULLs that arrived before the column type stabilized to TString)

Fixes #3855.

Test plan

  • cargo build --locked --bin qsv -F all_features
  • cargo test --test tests stats_widen -F all_features — 5 new tests pass
  • cargo test --test tests stats -F all_features — 749 stats tests, 0 failed
  • cargo test --test tests moarstats -F all_features — 88 tests, 0 failed
  • cargo test -F all_features — 2821 tests, 0 failed
  • cargo +nightly fmt --check clean
  • cargo clippy -F all_features clean
  • Reproducer from the issue produces expected output

🤖 Generated with Claude Code

…ring

When a column's inferred type widens to TString mid-stream (Date→String
under --infer-dates, Integer→String, Float→String, DateTime→String,
etc.), TypedSum::stotlen, TypedMinMax::{str_len, strings}, and
Stats::online_len previously only recorded samples that arrived *after*
the widen. Earlier rows — processed under a more specific type —
never contributed, so sum_length / avg_length / min_length /
max_length / lex min / lex max / sort_order / sortiness /
stddev_length / variance_length / cv_length all reflected only the
post-widen tail. The same bug also dropped pre-merge TNull samples
from online_len for pure-String columns, skewing stddev_length /
variance_length / cv_length for any String column with leading
empties.

Track these fields unconditionally on every (non-empty) sample
regardless of the dispatched type. show() already gates length-stat
emission on the final typ == TString, so columns that never widen are
unaffected in output and incur only a small per-sample tracking cost.
Commute::merge impls for TypedSum, TypedMinMax, and OnlineStats
already accumulate the affected fields correctly, so parallel-chunk
merging is unchanged.

Tests: 5 new regression tests covering Date→String (the issue's
reproducer), Integer→String, Float→String, DateTime→String widening,
plus a pure-Integer non-regression sanity check.

Fixture updates (post-fix values are equal-or-more-correct, never
regressed): boston311-10-boolean-1or0-stats.csv,
boston311-10-boolean-tf-stats.csv,
boston311-100-antimodes-len500-stats.csv, and
boston311-100-everything-date-stats-variance-stddev.csv updated for
stddev_length / variance_length / cv_length on String columns with
interleaved NULLs. Inline expectations in test_moarstats.rs (basic,
advanced, advanced_atkinson_epsilon) and test_stats.rs
(stats_percentiles_mixed_types) updated for the same reason.

Fixes #3855.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented May 15, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes stats length and lexicographic string statistics when inferred column types widen to String, ensuring earlier typed samples are included in final String outputs.

Changes:

  • Tracks string length totals, min/max length, lexical min/max, and online length stats across all samples before final type stabilization.
  • Adds regression tests for Date/Integer/Float/DateTime-to-String widening.
  • Updates expected stats fixtures for corrected String length variance/stddev/CV values.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/cmd/stats.rs Moves String length/lex tracking outside String-only branches.
tests/test_stats.rs Updates expected mixed-type output and adds widening regression tests.
tests/test_moarstats.rs Updates expected fixture output for corrected String length stats.
resources/test/boston311-100-everything-date-stats-variance-stddev.csv Updates corrected stats fixture values.
resources/test/boston311-100-antimodes-len500-stats.csv Updates corrected stats fixture values.
resources/test/boston311-10-boolean-tf-stats.csv Updates corrected stats fixture values.
resources/test/boston311-10-boolean-1or0-stats.csv Updates corrected stats fixture values.

Comment thread src/cmd/stats.rs
jqnatividad and others added 3 commits May 15, 2026 10:28
- Stats::add: expand comment to document the intentional behavior — null/
  empty samples contribute a zero-length entry to online_len, matching
  pre-fix behavior for nulls that arrived after the column widened to
  TString. Also add debug_assert!(online_len.is_some()) to catch
  invariant violations in debug builds (the typesonly early-returns
  guarantee Some here).
- TypedSum::add_with_parsed: add caller-invariant comment — Stats::add
  guards the call with `if b"" != sample`, and FieldType::from_sample
  returns TNull only for empty samples, so the function is never
  invoked with typ==TNull and a non-empty sample.
- stats_widen_date_to_string_issue3855: add a comment explaining that
  sort_order is lexical, not chronological ("2026-05-13" < "NA" because
  '2' < 'N').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add stddev_length / variance_length / cv_length assertions to the four
widening regression tests so a regression that re-gates online_len.add
under `t == TString` is caught directly on the widening paths
(Date/Integer/Float/DateTime → String), not only via fixture diffs on
existing null-heavy String columns. The pure-Integer non-regression
test also asserts these fields are empty, confirming `show()` still
gates emission on the final typ == TString.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ubuntu-latest runner has crashed mid-build with "No space left on
device" while compiling qsv with `-F all_features` (pyo3, datasketches,
polars, etc.), e.g. run 25923503208 on PR #3856. Add the
jlumbroso/free-disk-space action (pinned to v1.3.1 SHA) right after
the checkouts to remove ~20+ GB of pre-installed tooling we don't
use: Android SDK, .NET, Haskell, and swap. Keep tool-cache and
large-packages disabled — the former may hold useful runtimes, the
latter takes ~3 min of apt-get cleanup we don't need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 33d57a5 into master May 15, 2026
17 checks passed
@jqnatividad jqnatividad deleted the fix-stats-3855 branch May 15, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: stats reports wrong max_length when string looks like a date

2 participants