fix(stats): preserve length & lex stats when column type widens to String#3856
Merged
Conversation
…ring
When a column's inferred type widens to TString mid-stream (Date→String
under --infer-dates, Integer→String, Float→String, DateTime→String,
etc.), TypedSum::stotlen, TypedMinMax::{str_len, strings}, and
Stats::online_len previously only recorded samples that arrived *after*
the widen. Earlier rows — processed under a more specific type —
never contributed, so sum_length / avg_length / min_length /
max_length / lex min / lex max / sort_order / sortiness /
stddev_length / variance_length / cv_length all reflected only the
post-widen tail. The same bug also dropped pre-merge TNull samples
from online_len for pure-String columns, skewing stddev_length /
variance_length / cv_length for any String column with leading
empties.
Track these fields unconditionally on every (non-empty) sample
regardless of the dispatched type. show() already gates length-stat
emission on the final typ == TString, so columns that never widen are
unaffected in output and incur only a small per-sample tracking cost.
Commute::merge impls for TypedSum, TypedMinMax, and OnlineStats
already accumulate the affected fields correctly, so parallel-chunk
merging is unchanged.
Tests: 5 new regression tests covering Date→String (the issue's
reproducer), Integer→String, Float→String, DateTime→String widening,
plus a pure-Integer non-regression sanity check.
Fixture updates (post-fix values are equal-or-more-correct, never
regressed): boston311-10-boolean-1or0-stats.csv,
boston311-10-boolean-tf-stats.csv,
boston311-100-antimodes-len500-stats.csv, and
boston311-100-everything-date-stats-variance-stddev.csv updated for
stddev_length / variance_length / cv_length on String columns with
interleaved NULLs. Inline expectations in test_moarstats.rs (basic,
advanced, advanced_atkinson_epsilon) and test_stats.rs
(stats_percentiles_mixed_types) updated for the same reason.
Fixes #3855.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes stats length and lexicographic string statistics when inferred column types widen to String, ensuring earlier typed samples are included in final String outputs.
Changes:
- Tracks string length totals, min/max length, lexical min/max, and online length stats across all samples before final type stabilization.
- Adds regression tests for Date/Integer/Float/DateTime-to-String widening.
- Updates expected stats fixtures for corrected String length variance/stddev/CV values.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/cmd/stats.rs |
Moves String length/lex tracking outside String-only branches. |
tests/test_stats.rs |
Updates expected mixed-type output and adds widening regression tests. |
tests/test_moarstats.rs |
Updates expected fixture output for corrected String length stats. |
resources/test/boston311-100-everything-date-stats-variance-stddev.csv |
Updates corrected stats fixture values. |
resources/test/boston311-100-antimodes-len500-stats.csv |
Updates corrected stats fixture values. |
resources/test/boston311-10-boolean-tf-stats.csv |
Updates corrected stats fixture values. |
resources/test/boston311-10-boolean-1or0-stats.csv |
Updates corrected stats fixture values. |
- Stats::add: expand comment to document the intentional behavior — null/
empty samples contribute a zero-length entry to online_len, matching
pre-fix behavior for nulls that arrived after the column widened to
TString. Also add debug_assert!(online_len.is_some()) to catch
invariant violations in debug builds (the typesonly early-returns
guarantee Some here).
- TypedSum::add_with_parsed: add caller-invariant comment — Stats::add
guards the call with `if b"" != sample`, and FieldType::from_sample
returns TNull only for empty samples, so the function is never
invoked with typ==TNull and a non-empty sample.
- stats_widen_date_to_string_issue3855: add a comment explaining that
sort_order is lexical, not chronological ("2026-05-13" < "NA" because
'2' < 'N').
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add stddev_length / variance_length / cv_length assertions to the four widening regression tests so a regression that re-gates online_len.add under `t == TString` is caught directly on the widening paths (Date/Integer/Float/DateTime → String), not only via fixture diffs on existing null-heavy String columns. The pure-Integer non-regression test also asserts these fields are empty, confirming `show()` still gates emission on the final typ == TString. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ubuntu-latest runner has crashed mid-build with "No space left on device" while compiling qsv with `-F all_features` (pyo3, datasketches, polars, etc.), e.g. run 25923503208 on PR #3856. Add the jlumbroso/free-disk-space action (pinned to v1.3.1 SHA) right after the checkouts to remove ~20+ GB of pre-installed tooling we don't use: Android SDK, .NET, Haskell, and swap. Keep tool-cache and large-packages disabled — the former may hold useful runtimes, the latter takes ~3 min of apt-get cleanup we don't need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TStringmid-stream (Date→String under--infer-dates, Integer→String, Float→String, DateTime→String, etc.),TypedSum::stotlen,TypedMinMax::{str_len, strings}, andStats::online_lenpreviously only recorded samples that arrived after the widen — sosum_length/avg_length/min_length/max_length/lexmin/max/sort_order/sortiness/stddev_length/variance_length/cv_lengthall reflected only the post-widen tail. The same bug also dropped pre-mergeTNullsamples fromonline_lenon pure-String columns.show()already gates length-stat emission on the finaltyp == TString, so columns that never widen are unaffected in output.Commute::mergeimpls forTypedSum,TypedMinMax, andOnlineStatsalready accumulate the affected fields correctly, so parallel-chunk merging is unchanged.Changes
src/cmd/stats.rs:TypedSum::add_with_parsed— trackstotlenunconditionallyTypedMinMax::add_with_parsed— trackstr_len+strings(lex min/max) unconditionally on non-empty samplesStats::add— liftonline_len.addout of thet == TStringblock; useif let Somefor defense-in-depthtests/test_stats.rs— 5 new regression tests (stats_widen_*_issue3855) covering each widening pathtests/test_moarstats.rs,tests/test_stats.rs,resources/test/*-stats.csv— fixture updates for the equal-or-more-correct post-fix values; diffs are confined exclusively tostddev_length/variance_length/cv_lengthon String columns with interleaved NULLs (where the pre-fix code skipped NULLs that arrived before the column type stabilized toTString)Fixes #3855.
Test plan
cargo build --locked --bin qsv -F all_featurescargo test --test tests stats_widen -F all_features— 5 new tests passcargo test --test tests stats -F all_features— 749 stats tests, 0 failedcargo test --test tests moarstats -F all_features— 88 tests, 0 failedcargo test -F all_features— 2821 tests, 0 failedcargo +nightly fmt --checkcleancargo clippy -F all_featuresclean🤖 Generated with Claude Code