dedup: review-driven cleanup, allocation-free ignore-case, edge-case fixes#3754

jqnatividad · 2026-04-26T22:06:51Z

Summary

Review-driven cleanup of src/cmd/dedup.rs (continues the recent count / util / datefmt series).

Empty-input bug: --sorted on a header-only file no longer emits a stray empty data row (still prints the dupe count to stderr to keep parity with the in-memory path).
dupewtr is now Option<Writer>: only constructed and flushed when --dupes-output is set; no phantom stdout writer otherwise.
ASCII fast path in iter_cmp_ignore_case: pure-ASCII cells now compare via a zero-allocation byte-wise lowercase fold; non-ASCII keeps the allocating fallback; invalid UTF-8 falls back to raw byte order (deterministic) instead of being treated as a missing cell.
Hoisted comparison dispatch in the in-memory scan loop: picks the comparison function once per run, then runs a single tight monomorphized loop, mirroring the existing par_sort_by structure.
std::mem::swap instead of clone_from in the streaming sorted loop.
use std::cmp::Ordering; throughout.
USAGE: note --memcheck has no effect under --sorted.

Reviewed by roborev (#1707, #1708) — both findings addressed in follow-up commits in this branch.

Test plan

cargo test -F all_features dedup — 21/21 pass (3 new: dedup_sorted_empty, dedup_sorted_empty_dupes_output, dedup_dupes_output_run_of_three)
cargo test -F all_features sort — 78/78 pass (consumes iter_cmp_ignore_case)
cargo +nightly clippy --bin qsv -F all_features -- -W clippy::perf — clean
cargo +nightly fmt --check — clean

🤖 Generated with Claude Code

…fixes - Fix --sorted on empty input emitting a stray empty data row. - Make dupewtr Option<Writer>; no phantom stdout writer when --dupes-output is absent. - ASCII fast path in iter_cmp_ignore_case: zero-allocation byte-wise lowercase compare for the common case; non-ASCII keeps the allocating fallback; invalid UTF-8 now falls back to raw byte order (deterministic) instead of being treated as a missing cell. - Hoist comparison dispatch out of the in-memory scan loop; one tight monomorphized loop per ComparisonMode, mirroring the par_sort_by structure. - std::mem::swap instead of clone_from in the streaming sorted loop. - use std::cmp::Ordering for readability. - Doc: note --memcheck is a no-op under --sorted. - Test: dedup_sorted_empty regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…overage Roborev review 1707 raised two Low findings: - The new empty-input early-return in the --sorted path bypassed the "0" duplicate-count message that the in-memory path still prints. Wrap the streaming loop in `if rdr.read_byte_record(...)? { ... }` and let control fall through to the existing flush + count-print block. Behavior on empty input is now identical between the two paths. - Test coverage for the rewritten code was thin. Added: * dedup_sorted_empty_dupes_output — empty input + --sorted --dupes-output, asserts the dupes file contains only the header. * dedup_dupes_output_run_of_three — in-memory mode with a run of three identical rows, exercises the scan_dedup! macro's prev-write-on-equal branch (2 dupes written, 1 survivor). * dedup_sorted_empty also now asserts the stderr count message is emitted on empty input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed_empty Roborev review 1708 raised one Low finding: stderr.contains("0") in dedup_sorted_empty is too loose — any unrelated "0" digit on stderr (timestamp, elapsed time, etc.) would satisfy it. dedup actually emits only the number via `eprintln!("{dupe_count}")`, so the tight form is exact equality on the trimmed output, not the literal "0 duplicates found" string the reviewer suggested (that prose doesn't exist in the code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-04-26T22:07:58Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 5 complexity

Metric Results

Complexity 5

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

This PR refines the dedup command’s behavior and performance, with focused fixes for --sorted edge cases, --dupes-output handling, and faster allocation-free ASCII case-insensitive comparisons used by dedup/sort.

Changes:

Fix --sorted empty-input behavior to avoid emitting a stray empty data row while preserving duplicate-count stderr output.
Make --dupes-output truly optional by only constructing/flushing a duplicates writer when requested.
Improve ignore-case comparisons with an ASCII zero-allocation fast path and more deterministic handling of invalid UTF-8.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/cmd/dedup.rs`	Adjusts streaming (`--sorted`) and in-memory dedup loops, makes dupes writer optional, and rewrites ignore-case comparison for performance/edge cases.
`tests/test_dedup.rs`	Adds regression tests for `--sorted` empty input and for `--dupes-output` behavior.

Three review findings, all valid: 1. Streaming --sorted Equal branch wrote &record (the survivor) to the dupes file. With --select on a run of length >2 (e.g. key=1 across three rows that differ on val), this produced the survivor row N-1 times in dupes and dropped the actually-removed rows. Now writes &next_record (the row being dropped on this iteration), which matches the in-memory path's symmetric behavior (main + dupes = original). Regression: dedup_sorted_select_dupes_output_run_of_three. 2. Dupes writer was seeded with rdr.byte_headers() unconditionally; under --no-headers, byte_headers() returns the first DATA row, so the dupes file started with that row as a phantom "header". Replaced with rconfig.write_headers(&mut rdr, &mut w), which already guards on the --no-headers flag and the empty-record case (config.rs:478-482). Regression: dedup_no_headers_dupes_output. 3. dedup_sorted_empty ran the command twice (read_stdout + output_stderr). Refactored to a single wrk.output(&mut cmd) and assert on both streams from the one run; faster and verifies stdout/stderr from the same invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-free streaming loop (#3756) * sortcheck: review-driven cleanup, add --numeric/--natural, allocation-free streaming loop Bring sortcheck in line with the recent review-driven cleanups for sort (#3755), dedup (#3754), datefmt (#3753), and util (#3752), and close a real feature gap: sortcheck previously only supported lex / --ignore-case, so users could not pre-verify a CSV before piping into `sort --numeric`, `sort --natural`, or `dedup --numeric --sorted`. Changes - Add --numeric (-N) and --natural flags, reusing iter_cmp_num, iter_cmp_natural, and iter_cmp_natural_ignore_case already exported from cmd/sort.rs (no new comparators). - Resolve flags into a ComparisonMode enum once before the loop, with the same precedence as sort/dedup: --natural > --numeric > --ignore-case. Dispatch is a single match per row, monomorphizing to one comparator. - Replace record.clone_from(&next_record) with std::mem::swap on both Less and Greater arms — zero-copy buffer rotation, matching dedup #3754. - args.flag_json | args.flag_pretty_json -> || (idiomatic logical OR). - dupe_count as i64 -> i64::try_from(...).unwrap_or(i64::MAX) with a comment on why -1 signals "not applicable". - Drop stale #[allow(dead_code)] on Args; add a comment on the do_json-forces-full-scan invariant that makes scan_ctr equivalent to record_count in the JSON path. - USAGE: documented the new flags + their precedence, added an Examples block, tightened the extsort paragraph (sort itself is multi-threaded). - Regenerated docs/help/sortcheck.md via `qsv --generate-help-md`. Tests - 10 new tests in tests/test_sortcheck.rs covering --ignore-case (including a real lex-vs-ignore-case discriminator), empty input, single-row input, --numeric sorted/unsorted, --natural sorted/unsorted, --natural + --ignore-case composition, and a precedence test confirming --natural overrides --numeric. - 19/19 sortcheck tests pass; `cargo clippy -F all_features -- -D warnings` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: empty-input scan_ctr off-by-one, docopt-safe USAGE, ComparisonMode Copy, test rename - Fix empty-input bug: a header-only CSV used to report `record_count: 1` because the loop incremented scan_ctr once even when the initial read_byte_record returned false. Guard the loop and simplify the JSON path to always use scan_ctr (do_json forces a full scan, so it's authoritative). - Reword the --ignore-case USAGE block so no continuation line begins with `--natural` (docopt scans for option-like prefixes; lines starting with `--` could be misinterpreted as a second option definition). - Also clarify the wording: --ignore-case is only ignored under PURE numeric comparison (--numeric without --natural), since the documented precedence allows --natural --numeric --ignore-case where ignore-case does apply. - Derive Clone/Copy on ComparisonMode to mirror SortMode in cmd/sort.rs. Functionally a no-op (matching unit variants doesn't move), but defensive and consistent with sort.rs. - Rename test sortcheck_ignore_case_sorted -> sortcheck_ignore_case_notsorted (the test asserts an error; the prior name was misleading — the truly-sorted-under-ignore-case case is covered by sortcheck_ignore_case_actually_sorted). - Update sortcheck_empty test to expect record_count: 0 (was 1, codifying the off-by-one). - Regenerated docs/help/sortcheck.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four review findings, three correctness, one test hygiene — all valid: 1-3. ExtDedupCache::insert only checked the in-memory `memo` set, but dump_to_disk() drains memo when the memory limit is hit. Once spilled, items that exist only on disk would look "new" to a memo-only check and be re-emitted as duplicates. The previous extdedup code papered over this by calling contains() (which DOES check both) before insert(); folding to a single insert() call in this PR exposed the latent bug. Fix in odhtcache: - insert() now consults both memo and disk before declaring a new insertion, matching its docstring contract. - Extracted contains_on_disk() helper used by both contains() and insert(), removing duplicated mmap/HashTable plumbing. - Updated docs on insert() and contains() to reflect the new semantics. Caller behavior in extdedup is unchanged (insert() still returns true exactly when the row is genuinely new), so no extdedup edits are needed and the single-touch pattern from the original cleanup is preserved. 4. extdedup_csvmode_key_collision invoked the command twice — once via wrk.output() for stderr, then again via wrk.read_stdout() for stdout — comparing streams from two different runs. Refactored to a single wrk.output() and assert stdout/stderr from the same Output (matches the dedup_sorted_empty fix in PR #3754). Verification: all 8 extdedup tests pass, including extdedup_large_memory_test (2.5M unique + 2.5M dupes under --memory-limit 1, exercises spill-to-disk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ssues (#3759) * extdedup: review-driven cleanup, fix key-collision and dupes-writer issues Correctness: - CSV mode: insert a US (\x1F) separator between selected fields when building the dedup key. Previously fields were concatenated raw, so rows ("ab","cd") and ("a","bcd") both produced key "abcd" and the second row was silently treated as a duplicate. - CSV mode: honor --no-output (was silently ignored — only line mode respected it). USAGE updated to note both modes apply. - CSV mode: dupewtr is now Option<csv::Writer>; the previous code unconditionally constructed Config::new(None).writer() which targeted stdout when --dupes-output was absent and then flushed it, contending with the main writer. Mirrors the dedup cleanup in 08f2815. - Line mode: dupes_writer is now Option<BufWriter>, dropping the /dev/null / "nul" placeholder file handle and the cfg(target_family) split. Efficiency / cleanup: - Single hash-table touch per row using insert()'s bool return value instead of contains() + insert(). - Drop dead utf8_string intermediate buffer in the UTF-8 fallback path; String::from_utf8_lossy returns Cow<str> which can be pushed directly. - Drop unnecessary clone_from in both modes (byte_records yields owned ByteRecords; line? already owns the String). - let-else replaces flag_select.unwrap() in dedup_csv. - Key buffer capacity bumped from 20 to 256 (more realistic default). Docs: - USAGE clarifies --dupes-output format per mode (CSV: valid CSV with leading dupe_rowno column; LINE: tab-prefixed, not a CSV). - Regenerated docs/help/extdedup.md. odhtcache: ExtDedupCache::contains is now unused by bin targets but kept as part of the cache's public API with #[allow(dead_code)] and a doc note pointing at the new insert-return-value pattern. Tests: - extdedup_csvmode_key_collision — regression for the field-separator bug; ("ab","cd") and ("a","bcd") must both survive. - extdedup_csvmode_no_output — --no-output produces empty stdout but still reports the duplicate count on stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address copilot review on PR #3759 Four review findings, three correctness, one test hygiene — all valid: 1-3. ExtDedupCache::insert only checked the in-memory `memo` set, but dump_to_disk() drains memo when the memory limit is hit. Once spilled, items that exist only on disk would look "new" to a memo-only check and be re-emitted as duplicates. The previous extdedup code papered over this by calling contains() (which DOES check both) before insert(); folding to a single insert() call in this PR exposed the latent bug. Fix in odhtcache: - insert() now consults both memo and disk before declaring a new insertion, matching its docstring contract. - Extracted contains_on_disk() helper used by both contains() and insert(), removing duplicated mmap/HashTable plumbing. - Updated docs on insert() and contains() to reflect the new semantics. Caller behavior in extdedup is unchanged (insert() still returns true exactly when the row is genuinely new), so no extdedup edits are needed and the single-touch pattern from the original cleanup is preserved. 4. extdedup_csvmode_key_collision invoked the command twice — once via wrk.output() for stderr, then again via wrk.read_stdout() for stdout — comparing streams from two different runs. Refactored to a single wrk.output() and assert stdout/stderr from the same Output (matches the dedup_sorted_empty fix in PR #3754). Verification: all 8 extdedup tests pass, including extdedup_large_memory_test (2.5M unique + 2.5M dupes under --memory-limit 1, exercises spill-to-disk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address roborev #1725 — make key-collision test CRLF-safe One Low/Medium finding: extdedup_csvmode_key_collision asserted `assert_eq!(stdout.trim_end_matches(['\r', '\n']), "a,b\nab,cd\na,bcd")`, which trims only TRAILING line endings. On Windows runners csv::Writer may emit `\r\n` between records, so internal separators in stdout would remain `\r\n` and break the equality check. The previous form used `read_stdout` (CSV-parsed, line-ending agnostic) but was rejected by PR #3759 review for running the command twice. Fix: normalize CRLF→LF via newline_converter::dos2unix before comparing. Matches the pattern already used by extdedup_linemode and extdedup_linemode_dupesoutput. Single-execution Output is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 3 commits April 26, 2026 17:04

jqnatividad mentioned this pull request Apr 26, 2026

sort: review-driven cleanup, fix --numeric --natural --unique inconsistency #3755

Merged

5 tasks

jqnatividad requested a review from Copilot April 26, 2026 22:20

Copilot started reviewing on behalf of jqnatividad April 26, 2026 22:21 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread src/cmd/dedup.rs

Comment thread src/cmd/dedup.rs

Comment thread tests/test_dedup.rs Outdated

jqnatividad and others added 2 commits April 26, 2026 19:11

chore: rust fmt

8c23560

jqnatividad merged commit 08f2815 into master Apr 26, 2026
14 of 15 checks passed

jqnatividad deleted the dedup-codereview-202604 branch April 26, 2026 23:56

jqnatividad mentioned this pull request Apr 27, 2026

sortcheck: review-driven cleanup, add --numeric/--natural, allocation-free streaming loop #3756

Merged

5 tasks

This was referenced Apr 27, 2026

py: review-driven cleanup, hoist Python module setup, fix jagged-row panic #3758

Merged

extdedup: review-driven cleanup, fix key-collision and dupes-writer issues #3759

Merged

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedup: review-driven cleanup, allocation-free ignore-case, edge-case fixes#3754