Skip to content

frequency: review-driven correctness, perf and refactor cleanup#3745

Merged
jqnatividad merged 5 commits into
masterfrom
frequency-review-202604
Apr 26, 2026
Merged

frequency: review-driven correctness, perf and refactor cleanup#3745
jqnatividad merged 5 commits into
masterfrom
frequency-review-202604

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

End-to-end code review of src/cmd/frequency.rs plus three follow-up refactors. Net −645 / +458 lines (-187 total) across four focused commits — no behavior changes for any user-visible feature, byte-identical output for all 160 frequency tests.

Commits

  1. 465dc34c0frequency: review-driven correctness, perf and docs fixes

    • Correctness: RankStrategy::Ordinal no longer advances current_rank for sentinel-suppressed nulls (matches the other 4 strategies); group_by_count/group_by_weight cleanup; counts() suppresses misleading Other (0) rows; three unsafe { mem::transmute } calls replaced with the existing Selection::from_indices constructor; add_stat emits JsonValue::Null for NaN/Inf instead of silently coercing to 0; try_output_from_cache warns and falls back when the cache contains duplicate column names.
    • Performance: move_null_to_end_if_needed now uses Vec::extract_if instead of an O(n²) Vec::remove shift loop; merge_ftables/merge_weighted_ftables fall back to sequential zip below 8 columns.
    • Refactor: FrequencyField::stats gets #[serde(skip_serializing_if = "Vec::is_empty")] — deletes the brittle JSON regex post-processing and the manual TOON tree-walk; evaluate_stats_filter collapses 8 near-identical macros into a single bind! macro driven by mlua's IntoLua (Option<T> → Nil).
    • Docs: QSV_FREQ_CHUNK_MEMORY_MB falls through to CPU-based on any non-u64 value; --unq-limit is unweighted-only; --null-text/--other-text <NONE> documented as case-sensitive exact match; --force clarified.
  2. 26a73afc1frequency: canonicalize cache path so symlinks/relative paths share cache

    • Adds Args::cache_path_for helper that calls fs::canonicalize before computing the cache filename, so data.csv, ./data.csv, and symlinks all hit the same cache. Debug-level fallback to the uncanonicalized path on error.
  3. f49a9f490frequency: dedup rank-emit logic into shared closure

    • Both apply_ranking_strategy_unweighted and apply_ranking_strategy_weighted had 5 match arms (Dense / Min / Max / Ordinal / Average) that were ~90% identical. Extracted the per-row null/non-null branching plus count_sum/pct_sum accumulation into a closure within each function. Each strategy now just computes its own rank and calls the closure, returning a suppressed bool so Ordinal preserves its existing carve-out.
    • Net −139 lines with byte-identical output for all 16 frequency_rank_ties_* tests.
  4. e875024c7frequency: extract shared unweighted CSV emit loop

    • run() and try_output_from_cache() had near-identical inner loops. Extracted into Args::emit_unweighted_csv_rows with its own scratch buffers (itoa/zmij/rank/value_str/processed_frequencies). Weighted branch in run() stays inline because its data shape differs.

Test plan

  • cargo build --locked --bin qsv -F all_features — clean
  • cargo test -F all_features2620 passed; 0 failed
  • cargo clippy -F all_features --bin qsv -- -W clippy::perf — no warnings
  • All 160 test_frequency::* tests pass byte-identical output before/after
  • All 16 test_frequency::frequency_rank_ties_* tests pass after the rank-emit dedup
  • All 30 test_frequency::frequency_jsonl_* cache tests pass after the canonicalize change

🤖 Generated with Claude Code

jqnatividad and others added 4 commits April 25, 2026 20:06
Correctness:
- Ordinal ranking no longer advances current_rank for sentinel-suppressed
  null entries (unweighted + weighted), matching the other strategies.
- group_by_count/group_by_weight: drop redundant emptiness guard and
  replace misleading unwrap() safety comment with proper if-let.
- counts() now suppresses misleading "Other (0)" rows via the same
  other_unique_count > 0 guard the weighted path already had.
- Replace three unsafe Vec→Selection transmutes with the existing
  Selection::from_indices constructor.
- add_stat: emit JsonValue::Null for NaN/Inf instead of silently
  coercing to 0.
- try_output_from_cache: warn and fall back to recomputation when the
  cache contains duplicate column names (last-wins lookup is ambiguous).

Performance:
- move_null_to_end_if_needed uses Vec::extract_if instead of an
  O(n²) Vec::remove shift loop.
- merge_ftables/merge_weighted_ftables fall back to sequential zip
  below 8 columns to avoid rayon overhead.

Refactor:
- FrequencyField::stats gets #[serde(skip_serializing_if)]; delete the
  brittle JSON regex post-processing and the manual TOON tree-walk for
  empty stats arrays.
- evaluate_stats_filter: collapse eight near-identical macros into a
  single bind! macro driven by mlua's IntoLua (Option<T> → Nil).

Docs / USAGE:
- Clarify QSV_FREQ_CHUNK_MEMORY_MB falls through to CPU-based chunking
  on any non-u64 value.
- --unq-limit notes it is unweighted-only.
- --null-text / --other-text "<NONE>" sentinel is documented as
  case-sensitive exact match.
- --force description clarified.
- Reword --force continuation so docopt does not re-parse it as an
  option definition (caught during testing — broke 137 tests).
- Remove duplicate comment in parallel_ftables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache

Both write_frequency_jsonl and read_frequency_cache built the cache path
via `path.with_extension(...)` directly off the user-supplied input path.
Running `qsv frequency data.csv` and then `qsv frequency ./data.csv`
produced different cache filenames (or missed an existing cache through
a symlink), defeating the cache silently.

Add a small Self::cache_path_for helper that canonicalizes via
fs::canonicalize before computing the extension, with a debug-level
fallback to the uncanonicalized path on error so behavior is no worse
than before. Both call sites now route through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both apply_ranking_strategy_unweighted and apply_ranking_strategy_weighted
had five match arms (Dense / Min / Max / Ordinal / Average) that were
~90% identical: same null vs non-null branching, same pct_nulls handling,
same count_sum / pct_sum accumulation. The only real differences were
when current_rank is updated and what rank value each row emits.

Extract the per-row emit (the null/non-null branching plus the
count_sum / pct_sum accumulation) into a closure within each function.
Each strategy now just computes its own rank value and calls the
closure, returning a `suppressed` bool so Ordinal can preserve its
existing carve-out where sentinel-suppressed nulls do not consume a
rank slot. All 16 frequency_rank_ties tests pass with byte-identical
output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run() and try_output_from_cache() each had a near-identical inner loop
that built header_vec, called process_frequencies, then for each
ProcessedFrequency formatted the rank, optionally visualized whitespace,
and wrote one CSV row.

Move that into Args::emit_unweighted_csv_rows, owning its own scratch
buffers (itoa/zmij/rank/value_str/processed_frequencies). Both callers
now write the column-header record, call the helper, and flush. The
weighted branch in run() stays inline because its data shape differs
(weighted column map by reference vs FTable by value).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented Apr 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the frequency command implementation (src/cmd/frequency.rs) with review-driven correctness fixes, targeted performance improvements, and refactors that remove duplicated logic while preserving byte-identical output across the existing frequency test suite.

Changes:

  • Improves correctness/robustness around ranking, “Other” emission, NaN/Inf JSON stat handling, and cache validation fallbacks.
  • Optimizes hot paths (e.g., NULL-to-end movement and merge parallelism thresholds) and reduces duplicated CSV emit / ranking logic.
  • Simplifies JSON/TOON output generation by relying on serde to omit empty stats arrays (removing regex/tree-walk post-processing).

Comment thread src/cmd/frequency.rs
Comment thread src/cmd/frequency.rs
Comment thread src/cmd/frequency.rs Outdated
…lice cmp

Three Copilot suggestions on PR #3745:

1. Selection-signature validation. In `--no-headers` mode, the cache
   stores entries keyed by position-within-selection ("1", "2", ...),
   so a cache built with `--select 1,2` would silently match a run
   with `--select 2,1` and return the wrong columns. Add a
   `selection_signature` field to `FrequencyCacheMetadata` (the
   selected-header bytes joined by ASCII Unit Separator), populate it
   in `write_frequency_jsonl`, and reject mismatched/empty signatures
   in `try_output_from_cache`. `#[serde(default)]` keeps old caches
   readable; they get invalidated on first read post-upgrade and
   regenerated cleanly.

2. Avoid per-field String allocations in `evaluate_stats_filter`.
   The previous `bind!` macro called `.clone().into_lua(lua)` on every
   field, which allocates a fresh `String` for `String` /
   `Option<String>` fields (`field`, `r#type`, `min`, `max`,
   `sort_order`, `mode`, `antimode`) on every column evaluation. Add
   `bind_str!` and `bind_opt_str!` variants that pass `&str` /
   `Option<&str>` so Lua copies directly without a Rust-side clone.
   `bind!` keeps its role for Copy fields (numerics, bool, Option<num>).

3. Use `.as_slice()` comparison in `move_null_to_end_if_needed` so
   the predicate reads as a clear borrow-only slice compare rather
   than relying on deref/autoref behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 0e07c98 into master Apr 26, 2026
17 checks passed
@jqnatividad jqnatividad deleted the frequency-review-202604 branch April 26, 2026 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants