frequency: review-driven correctness, perf and refactor cleanup#3745

jqnatividad · 2026-04-26T00:27:32Z

Summary

End-to-end code review of src/cmd/frequency.rs plus three follow-up refactors. Net −645 / +458 lines (-187 total) across four focused commits — no behavior changes for any user-visible feature, byte-identical output for all 160 frequency tests.

Commits

465dc34c0 — frequency: review-driven correctness, perf and docs fixes
- Correctness: RankStrategy::Ordinal no longer advances current_rank for sentinel-suppressed nulls (matches the other 4 strategies); group_by_count/group_by_weight cleanup; counts() suppresses misleading Other (0) rows; three unsafe { mem::transmute } calls replaced with the existing Selection::from_indices constructor; add_stat emits JsonValue::Null for NaN/Inf instead of silently coercing to 0; try_output_from_cache warns and falls back when the cache contains duplicate column names.
- Performance: move_null_to_end_if_needed now uses Vec::extract_if instead of an O(n²) Vec::remove shift loop; merge_ftables/merge_weighted_ftables fall back to sequential zip below 8 columns.
- Refactor: FrequencyField::stats gets #[serde(skip_serializing_if = "Vec::is_empty")] — deletes the brittle JSON regex post-processing and the manual TOON tree-walk; evaluate_stats_filter collapses 8 near-identical macros into a single bind! macro driven by mlua's IntoLua (Option<T> → Nil).
- Docs: QSV_FREQ_CHUNK_MEMORY_MB falls through to CPU-based on any non-u64 value; --unq-limit is unweighted-only; --null-text/--other-text <NONE> documented as case-sensitive exact match; --force clarified.
26a73afc1 — frequency: canonicalize cache path so symlinks/relative paths share cache
- Adds Args::cache_path_for helper that calls fs::canonicalize before computing the cache filename, so data.csv, ./data.csv, and symlinks all hit the same cache. Debug-level fallback to the uncanonicalized path on error.
f49a9f490 — frequency: dedup rank-emit logic into shared closure
- Both apply_ranking_strategy_unweighted and apply_ranking_strategy_weighted had 5 match arms (Dense / Min / Max / Ordinal / Average) that were ~90% identical. Extracted the per-row null/non-null branching plus count_sum/pct_sum accumulation into a closure within each function. Each strategy now just computes its own rank and calls the closure, returning a suppressed bool so Ordinal preserves its existing carve-out.
- Net −139 lines with byte-identical output for all 16 frequency_rank_ties_* tests.
e875024c7 — frequency: extract shared unweighted CSV emit loop
- run() and try_output_from_cache() had near-identical inner loops. Extracted into Args::emit_unweighted_csv_rows with its own scratch buffers (itoa/zmij/rank/value_str/processed_frequencies). Weighted branch in run() stays inline because its data shape differs.

Test plan

cargo build --locked --bin qsv -F all_features — clean
cargo test -F all_features — 2620 passed; 0 failed
cargo clippy -F all_features --bin qsv -- -W clippy::perf — no warnings
All 160 test_frequency::* tests pass byte-identical output before/after
All 16 test_frequency::frequency_rank_ties_* tests pass after the rank-emit dedup
All 30 test_frequency::frequency_jsonl_* cache tests pass after the canonicalize change

🤖 Generated with Claude Code

Correctness: - Ordinal ranking no longer advances current_rank for sentinel-suppressed null entries (unweighted + weighted), matching the other strategies. - group_by_count/group_by_weight: drop redundant emptiness guard and replace misleading unwrap() safety comment with proper if-let. - counts() now suppresses misleading "Other (0)" rows via the same other_unique_count > 0 guard the weighted path already had. - Replace three unsafe Vec→Selection transmutes with the existing Selection::from_indices constructor. - add_stat: emit JsonValue::Null for NaN/Inf instead of silently coercing to 0. - try_output_from_cache: warn and fall back to recomputation when the cache contains duplicate column names (last-wins lookup is ambiguous). Performance: - move_null_to_end_if_needed uses Vec::extract_if instead of an O(n²) Vec::remove shift loop. - merge_ftables/merge_weighted_ftables fall back to sequential zip below 8 columns to avoid rayon overhead. Refactor: - FrequencyField::stats gets #[serde(skip_serializing_if)]; delete the brittle JSON regex post-processing and the manual TOON tree-walk for empty stats arrays. - evaluate_stats_filter: collapse eight near-identical macros into a single bind! macro driven by mlua's IntoLua (Option<T> → Nil). Docs / USAGE: - Clarify QSV_FREQ_CHUNK_MEMORY_MB falls through to CPU-based chunking on any non-u64 value. - --unq-limit notes it is unweighted-only. - --null-text / --other-text "<NONE>" sentinel is documented as case-sensitive exact match. - --force description clarified. - Reword --force continuation so docopt does not re-parse it as an option definition (caught during testing — broke 137 tests). - Remove duplicate comment in parallel_ftables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ache Both write_frequency_jsonl and read_frequency_cache built the cache path via `path.with_extension(...)` directly off the user-supplied input path. Running `qsv frequency data.csv` and then `qsv frequency ./data.csv` produced different cache filenames (or missed an existing cache through a symlink), defeating the cache silently. Add a small Self::cache_path_for helper that canonicalizes via fs::canonicalize before computing the extension, with a debug-level fallback to the uncanonicalized path on error so behavior is no worse than before. Both call sites now route through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both apply_ranking_strategy_unweighted and apply_ranking_strategy_weighted had five match arms (Dense / Min / Max / Ordinal / Average) that were ~90% identical: same null vs non-null branching, same pct_nulls handling, same count_sum / pct_sum accumulation. The only real differences were when current_rank is updated and what rank value each row emits. Extract the per-row emit (the null/non-null branching plus the count_sum / pct_sum accumulation) into a closure within each function. Each strategy now just computes its own rank value and calls the closure, returning a `suppressed` bool so Ordinal can preserve its existing carve-out where sentinel-suppressed nulls do not consume a rank slot. All 16 frequency_rank_ties tests pass with byte-identical output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

run() and try_output_from_cache() each had a near-identical inner loop that built header_vec, called process_frequencies, then for each ProcessedFrequency formatted the rank, optionally visualized whitespace, and wrote one CSV row. Move that into Args::emit_unweighted_csv_rows, owning its own scratch buffers (itoa/zmij/rank/value_str/processed_frequencies). Both callers now write the column-header record, call the helper, and flush. The weighted branch in run() stays inline because its data shape differs (weighted column map by reference vs FTable by value). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-04-26T00:28:01Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

This PR refines the frequency command implementation (src/cmd/frequency.rs) with review-driven correctness fixes, targeted performance improvements, and refactors that remove duplicated logic while preserving byte-identical output across the existing frequency test suite.

Changes:

Improves correctness/robustness around ranking, “Other” emission, NaN/Inf JSON stat handling, and cache validation fallbacks.
Optimizes hot paths (e.g., NULL-to-end movement and merge parallelism thresholds) and reduces duplicated CSV emit / ranking logic.
Simplifies JSON/TOON output generation by relying on serde to omit empty stats arrays (removing regex/tree-walk post-processing).

…lice cmp Three Copilot suggestions on PR #3745: 1. Selection-signature validation. In `--no-headers` mode, the cache stores entries keyed by position-within-selection ("1", "2", ...), so a cache built with `--select 1,2` would silently match a run with `--select 2,1` and return the wrong columns. Add a `selection_signature` field to `FrequencyCacheMetadata` (the selected-header bytes joined by ASCII Unit Separator), populate it in `write_frequency_jsonl`, and reject mismatched/empty signatures in `try_output_from_cache`. `#[serde(default)]` keeps old caches readable; they get invalidated on first read post-upgrade and regenerated cleanly. 2. Avoid per-field String allocations in `evaluate_stats_filter`. The previous `bind!` macro called `.clone().into_lua(lua)` on every field, which allocates a fresh `String` for `String` / `Option<String>` fields (`field`, `r#type`, `min`, `max`, `sort_order`, `mode`, `antimode`) on every column evaluation. Add `bind_str!` and `bind_opt_str!` variants that pass `&str` / `Option<&str>` so Lua copies directly without a Rust-side clone. `bind!` keeps its role for Copy fields (numerics, bool, Option<num>). 3. Use `.as_slice()` comparison in `move_null_to_end_if_needed` so the predicate reads as a clear borrow-only slice compare rather than relying on deref/autoref behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 4 commits April 25, 2026 20:06

jqnatividad requested a review from Copilot April 26, 2026 00:30

Copilot started reviewing on behalf of jqnatividad April 26, 2026 00:30 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread src/cmd/frequency.rs

Comment thread src/cmd/frequency.rs

Comment thread src/cmd/frequency.rs Outdated

jqnatividad merged commit 0e07c98 into master Apr 26, 2026
17 checks passed

jqnatividad deleted the frequency-review-202604 branch April 26, 2026 01:51

jqnatividad mentioned this pull request Apr 26, 2026

exclude: review-driven cleanup, stdin support and memcheck #3749

Merged

5 tasks

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequency: review-driven correctness, perf and refactor cleanup#3745