frequency: review-driven correctness, perf and refactor cleanup#3745
Merged
Conversation
Correctness: - Ordinal ranking no longer advances current_rank for sentinel-suppressed null entries (unweighted + weighted), matching the other strategies. - group_by_count/group_by_weight: drop redundant emptiness guard and replace misleading unwrap() safety comment with proper if-let. - counts() now suppresses misleading "Other (0)" rows via the same other_unique_count > 0 guard the weighted path already had. - Replace three unsafe Vec→Selection transmutes with the existing Selection::from_indices constructor. - add_stat: emit JsonValue::Null for NaN/Inf instead of silently coercing to 0. - try_output_from_cache: warn and fall back to recomputation when the cache contains duplicate column names (last-wins lookup is ambiguous). Performance: - move_null_to_end_if_needed uses Vec::extract_if instead of an O(n²) Vec::remove shift loop. - merge_ftables/merge_weighted_ftables fall back to sequential zip below 8 columns to avoid rayon overhead. Refactor: - FrequencyField::stats gets #[serde(skip_serializing_if)]; delete the brittle JSON regex post-processing and the manual TOON tree-walk for empty stats arrays. - evaluate_stats_filter: collapse eight near-identical macros into a single bind! macro driven by mlua's IntoLua (Option<T> → Nil). Docs / USAGE: - Clarify QSV_FREQ_CHUNK_MEMORY_MB falls through to CPU-based chunking on any non-u64 value. - --unq-limit notes it is unweighted-only. - --null-text / --other-text "<NONE>" sentinel is documented as case-sensitive exact match. - --force description clarified. - Reword --force continuation so docopt does not re-parse it as an option definition (caught during testing — broke 137 tests). - Remove duplicate comment in parallel_ftables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache Both write_frequency_jsonl and read_frequency_cache built the cache path via `path.with_extension(...)` directly off the user-supplied input path. Running `qsv frequency data.csv` and then `qsv frequency ./data.csv` produced different cache filenames (or missed an existing cache through a symlink), defeating the cache silently. Add a small Self::cache_path_for helper that canonicalizes via fs::canonicalize before computing the extension, with a debug-level fallback to the uncanonicalized path on error so behavior is no worse than before. Both call sites now route through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both apply_ranking_strategy_unweighted and apply_ranking_strategy_weighted had five match arms (Dense / Min / Max / Ordinal / Average) that were ~90% identical: same null vs non-null branching, same pct_nulls handling, same count_sum / pct_sum accumulation. The only real differences were when current_rank is updated and what rank value each row emits. Extract the per-row emit (the null/non-null branching plus the count_sum / pct_sum accumulation) into a closure within each function. Each strategy now just computes its own rank value and calls the closure, returning a `suppressed` bool so Ordinal can preserve its existing carve-out where sentinel-suppressed nulls do not consume a rank slot. All 16 frequency_rank_ties tests pass with byte-identical output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run() and try_output_from_cache() each had a near-identical inner loop that built header_vec, called process_frequencies, then for each ProcessedFrequency formatted the rank, optionally visualized whitespace, and wrote one CSV row. Move that into Args::emit_unweighted_csv_rows, owning its own scratch buffers (itoa/zmij/rank/value_str/processed_frequencies). Both callers now write the column-header record, call the helper, and flush. The weighted branch in run() stays inline because its data shape differs (weighted column map by reference vs FTable by value). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refines the frequency command implementation (src/cmd/frequency.rs) with review-driven correctness fixes, targeted performance improvements, and refactors that remove duplicated logic while preserving byte-identical output across the existing frequency test suite.
Changes:
- Improves correctness/robustness around ranking, “Other” emission, NaN/Inf JSON stat handling, and cache validation fallbacks.
- Optimizes hot paths (e.g., NULL-to-end movement and merge parallelism thresholds) and reduces duplicated CSV emit / ranking logic.
- Simplifies JSON/TOON output generation by relying on serde to omit empty
statsarrays (removing regex/tree-walk post-processing).
…lice cmp Three Copilot suggestions on PR #3745: 1. Selection-signature validation. In `--no-headers` mode, the cache stores entries keyed by position-within-selection ("1", "2", ...), so a cache built with `--select 1,2` would silently match a run with `--select 2,1` and return the wrong columns. Add a `selection_signature` field to `FrequencyCacheMetadata` (the selected-header bytes joined by ASCII Unit Separator), populate it in `write_frequency_jsonl`, and reject mismatched/empty signatures in `try_output_from_cache`. `#[serde(default)]` keeps old caches readable; they get invalidated on first read post-upgrade and regenerated cleanly. 2. Avoid per-field String allocations in `evaluate_stats_filter`. The previous `bind!` macro called `.clone().into_lua(lua)` on every field, which allocates a fresh `String` for `String` / `Option<String>` fields (`field`, `r#type`, `min`, `max`, `sort_order`, `mode`, `antimode`) on every column evaluation. Add `bind_str!` and `bind_opt_str!` variants that pass `&str` / `Option<&str>` so Lua copies directly without a Rust-side clone. `bind!` keeps its role for Copy fields (numerics, bool, Option<num>). 3. Use `.as_slice()` comparison in `move_null_to_end_if_needed` so the predicate reads as a clear borrow-only slice compare rather than relying on deref/autoref behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end code review of
src/cmd/frequency.rsplus three follow-up refactors. Net −645 / +458 lines (-187 total) across four focused commits — no behavior changes for any user-visible feature, byte-identical output for all 160 frequency tests.Commits
465dc34c0—frequency: review-driven correctness, perf and docs fixesRankStrategy::Ordinalno longer advancescurrent_rankfor sentinel-suppressed nulls (matches the other 4 strategies);group_by_count/group_by_weightcleanup;counts()suppresses misleadingOther (0)rows; threeunsafe { mem::transmute }calls replaced with the existingSelection::from_indicesconstructor;add_statemitsJsonValue::Nullfor NaN/Inf instead of silently coercing to 0;try_output_from_cachewarns and falls back when the cache contains duplicate column names.move_null_to_end_if_needednow usesVec::extract_ifinstead of an O(n²)Vec::removeshift loop;merge_ftables/merge_weighted_ftablesfall back to sequential zip below 8 columns.FrequencyField::statsgets#[serde(skip_serializing_if = "Vec::is_empty")]— deletes the brittle JSON regex post-processing and the manual TOON tree-walk;evaluate_stats_filtercollapses 8 near-identical macros into a singlebind!macro driven by mlua'sIntoLua(Option<T>→ Nil).QSV_FREQ_CHUNK_MEMORY_MBfalls through to CPU-based on any non-u64value;--unq-limitis unweighted-only;--null-text/--other-text<NONE>documented as case-sensitive exact match;--forceclarified.26a73afc1—frequency: canonicalize cache path so symlinks/relative paths share cacheArgs::cache_path_forhelper that callsfs::canonicalizebefore computing the cache filename, sodata.csv,./data.csv, and symlinks all hit the same cache. Debug-level fallback to the uncanonicalized path on error.f49a9f490—frequency: dedup rank-emit logic into shared closureapply_ranking_strategy_unweightedandapply_ranking_strategy_weightedhad 5 match arms (Dense / Min / Max / Ordinal / Average) that were ~90% identical. Extracted the per-row null/non-null branching pluscount_sum/pct_sumaccumulation into a closure within each function. Each strategy now just computes its own rank and calls the closure, returning asuppressedbool so Ordinal preserves its existing carve-out.frequency_rank_ties_*tests.e875024c7—frequency: extract shared unweighted CSV emit looprun()andtry_output_from_cache()had near-identical inner loops. Extracted intoArgs::emit_unweighted_csv_rowswith its own scratch buffers (itoa/zmij/rank/value_str/processed_frequencies). Weighted branch inrun()stays inline because its data shape differs.Test plan
cargo build --locked --bin qsv -F all_features— cleancargo test -F all_features— 2620 passed; 0 failedcargo clippy -F all_features --bin qsv -- -W clippy::perf— no warningstest_frequency::*tests pass byte-identical output before/aftertest_frequency::frequency_rank_ties_*tests pass after the rank-emit deduptest_frequency::frequency_jsonl_*cache tests pass after the canonicalize change🤖 Generated with Claude Code