dedup: review-driven cleanup, allocation-free ignore-case, edge-case fixes#3754
Merged
Conversation
…fixes - Fix --sorted on empty input emitting a stray empty data row. - Make dupewtr Option<Writer>; no phantom stdout writer when --dupes-output is absent. - ASCII fast path in iter_cmp_ignore_case: zero-allocation byte-wise lowercase compare for the common case; non-ASCII keeps the allocating fallback; invalid UTF-8 now falls back to raw byte order (deterministic) instead of being treated as a missing cell. - Hoist comparison dispatch out of the in-memory scan loop; one tight monomorphized loop per ComparisonMode, mirroring the par_sort_by structure. - std::mem::swap instead of clone_from in the streaming sorted loop. - use std::cmp::Ordering for readability. - Doc: note --memcheck is a no-op under --sorted. - Test: dedup_sorted_empty regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…overage
Roborev review 1707 raised two Low findings:
- The new empty-input early-return in the --sorted path bypassed the
"0" duplicate-count message that the in-memory path still prints.
Wrap the streaming loop in `if rdr.read_byte_record(...)? { ... }`
and let control fall through to the existing flush + count-print
block. Behavior on empty input is now identical between the two
paths.
- Test coverage for the rewritten code was thin. Added:
* dedup_sorted_empty_dupes_output — empty input + --sorted
--dupes-output, asserts the dupes file contains only the header.
* dedup_dupes_output_run_of_three — in-memory mode with a run of
three identical rows, exercises the scan_dedup! macro's
prev-write-on-equal branch (2 dupes written, 1 survivor).
* dedup_sorted_empty also now asserts the stderr count message
is emitted on empty input.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed_empty
Roborev review 1708 raised one Low finding: stderr.contains("0") in
dedup_sorted_empty is too loose — any unrelated "0" digit on stderr
(timestamp, elapsed time, etc.) would satisfy it. dedup actually
emits only the number via `eprintln!("{dupe_count}")`, so the tight
form is exact equality on the trimmed output, not the literal "0
duplicates found" string the reviewer suggested (that prose doesn't
exist in the code).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 5 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refines the dedup command’s behavior and performance, with focused fixes for --sorted edge cases, --dupes-output handling, and faster allocation-free ASCII case-insensitive comparisons used by dedup/sort.
Changes:
- Fix
--sortedempty-input behavior to avoid emitting a stray empty data row while preserving duplicate-count stderr output. - Make
--dupes-outputtruly optional by only constructing/flushing a duplicates writer when requested. - Improve ignore-case comparisons with an ASCII zero-allocation fast path and more deterministic handling of invalid UTF-8.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/cmd/dedup.rs |
Adjusts streaming (--sorted) and in-memory dedup loops, makes dupes writer optional, and rewrites ignore-case comparison for performance/edge cases. |
tests/test_dedup.rs |
Adds regression tests for --sorted empty input and for --dupes-output behavior. |
Three review findings, all valid: 1. Streaming --sorted Equal branch wrote &record (the survivor) to the dupes file. With --select on a run of length >2 (e.g. key=1 across three rows that differ on val), this produced the survivor row N-1 times in dupes and dropped the actually-removed rows. Now writes &next_record (the row being dropped on this iteration), which matches the in-memory path's symmetric behavior (main + dupes = original). Regression: dedup_sorted_select_dupes_output_run_of_three. 2. Dupes writer was seeded with rdr.byte_headers() unconditionally; under --no-headers, byte_headers() returns the first DATA row, so the dupes file started with that row as a phantom "header". Replaced with rconfig.write_headers(&mut rdr, &mut w), which already guards on the --no-headers flag and the empty-record case (config.rs:478-482). Regression: dedup_no_headers_dupes_output. 3. dedup_sorted_empty ran the command twice (read_stdout + output_stderr). Refactored to a single wrk.output(&mut cmd) and assert on both streams from the one run; faster and verifies stdout/stderr from the same invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
5 tasks
jqnatividad
added a commit
that referenced
this pull request
Apr 27, 2026
…-free streaming loop (#3756) * sortcheck: review-driven cleanup, add --numeric/--natural, allocation-free streaming loop Bring sortcheck in line with the recent review-driven cleanups for sort (#3755), dedup (#3754), datefmt (#3753), and util (#3752), and close a real feature gap: sortcheck previously only supported lex / --ignore-case, so users could not pre-verify a CSV before piping into `sort --numeric`, `sort --natural`, or `dedup --numeric --sorted`. Changes - Add --numeric (-N) and --natural flags, reusing iter_cmp_num, iter_cmp_natural, and iter_cmp_natural_ignore_case already exported from cmd/sort.rs (no new comparators). - Resolve flags into a ComparisonMode enum once before the loop, with the same precedence as sort/dedup: --natural > --numeric > --ignore-case. Dispatch is a single match per row, monomorphizing to one comparator. - Replace record.clone_from(&next_record) with std::mem::swap on both Less and Greater arms — zero-copy buffer rotation, matching dedup #3754. - args.flag_json | args.flag_pretty_json -> || (idiomatic logical OR). - dupe_count as i64 -> i64::try_from(...).unwrap_or(i64::MAX) with a comment on why -1 signals "not applicable". - Drop stale #[allow(dead_code)] on Args; add a comment on the do_json-forces-full-scan invariant that makes scan_ctr equivalent to record_count in the JSON path. - USAGE: documented the new flags + their precedence, added an Examples block, tightened the extsort paragraph (sort itself is multi-threaded). - Regenerated docs/help/sortcheck.md via `qsv --generate-help-md`. Tests - 10 new tests in tests/test_sortcheck.rs covering --ignore-case (including a real lex-vs-ignore-case discriminator), empty input, single-row input, --numeric sorted/unsorted, --natural sorted/unsorted, --natural + --ignore-case composition, and a precedence test confirming --natural overrides --numeric. - 19/19 sortcheck tests pass; `cargo clippy -F all_features -- -D warnings` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: empty-input scan_ctr off-by-one, docopt-safe USAGE, ComparisonMode Copy, test rename - Fix empty-input bug: a header-only CSV used to report `record_count: 1` because the loop incremented scan_ctr once even when the initial read_byte_record returned false. Guard the loop and simplify the JSON path to always use scan_ctr (do_json forces a full scan, so it's authoritative). - Reword the --ignore-case USAGE block so no continuation line begins with `--natural` (docopt scans for option-like prefixes; lines starting with `--` could be misinterpreted as a second option definition). - Also clarify the wording: --ignore-case is only ignored under PURE numeric comparison (--numeric without --natural), since the documented precedence allows --natural --numeric --ignore-case where ignore-case does apply. - Derive Clone/Copy on ComparisonMode to mirror SortMode in cmd/sort.rs. Functionally a no-op (matching unit variants doesn't move), but defensive and consistent with sort.rs. - Rename test sortcheck_ignore_case_sorted -> sortcheck_ignore_case_notsorted (the test asserts an error; the prior name was misleading — the truly-sorted-under-ignore-case case is covered by sortcheck_ignore_case_actually_sorted). - Update sortcheck_empty test to expect record_count: 0 (was 1, codifying the off-by-one). - Regenerated docs/help/sortcheck.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 27, 2026
jqnatividad
added a commit
that referenced
this pull request
Apr 27, 2026
Four review findings, three correctness, one test hygiene — all valid:
1-3. ExtDedupCache::insert only checked the in-memory `memo` set, but
dump_to_disk() drains memo when the memory limit is hit. Once
spilled, items that exist only on disk would look "new" to a
memo-only check and be re-emitted as duplicates. The previous
extdedup code papered over this by calling contains() (which DOES
check both) before insert(); folding to a single insert() call in
this PR exposed the latent bug.
Fix in odhtcache:
- insert() now consults both memo and disk before declaring a new
insertion, matching its docstring contract.
- Extracted contains_on_disk() helper used by both contains() and
insert(), removing duplicated mmap/HashTable plumbing.
- Updated docs on insert() and contains() to reflect the new
semantics.
Caller behavior in extdedup is unchanged (insert() still returns
true exactly when the row is genuinely new), so no extdedup edits
are needed and the single-touch pattern from the original cleanup
is preserved.
4. extdedup_csvmode_key_collision invoked the command twice — once via
wrk.output() for stderr, then again via wrk.read_stdout() for stdout
— comparing streams from two different runs. Refactored to a single
wrk.output() and assert stdout/stderr from the same Output (matches
the dedup_sorted_empty fix in PR #3754).
Verification: all 8 extdedup tests pass, including extdedup_large_memory_test
(2.5M unique + 2.5M dupes under --memory-limit 1, exercises spill-to-disk).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad
added a commit
that referenced
this pull request
Apr 27, 2026
…ssues (#3759) * extdedup: review-driven cleanup, fix key-collision and dupes-writer issues Correctness: - CSV mode: insert a US (\x1F) separator between selected fields when building the dedup key. Previously fields were concatenated raw, so rows ("ab","cd") and ("a","bcd") both produced key "abcd" and the second row was silently treated as a duplicate. - CSV mode: honor --no-output (was silently ignored — only line mode respected it). USAGE updated to note both modes apply. - CSV mode: dupewtr is now Option<csv::Writer>; the previous code unconditionally constructed Config::new(None).writer() which targeted stdout when --dupes-output was absent and then flushed it, contending with the main writer. Mirrors the dedup cleanup in 08f2815. - Line mode: dupes_writer is now Option<BufWriter>, dropping the /dev/null / "nul" placeholder file handle and the cfg(target_family) split. Efficiency / cleanup: - Single hash-table touch per row using insert()'s bool return value instead of contains() + insert(). - Drop dead utf8_string intermediate buffer in the UTF-8 fallback path; String::from_utf8_lossy returns Cow<str> which can be pushed directly. - Drop unnecessary clone_from in both modes (byte_records yields owned ByteRecords; line? already owns the String). - let-else replaces flag_select.unwrap() in dedup_csv. - Key buffer capacity bumped from 20 to 256 (more realistic default). Docs: - USAGE clarifies --dupes-output format per mode (CSV: valid CSV with leading dupe_rowno column; LINE: tab-prefixed, not a CSV). - Regenerated docs/help/extdedup.md. odhtcache: ExtDedupCache::contains is now unused by bin targets but kept as part of the cache's public API with #[allow(dead_code)] and a doc note pointing at the new insert-return-value pattern. Tests: - extdedup_csvmode_key_collision — regression for the field-separator bug; ("ab","cd") and ("a","bcd") must both survive. - extdedup_csvmode_no_output — --no-output produces empty stdout but still reports the duplicate count on stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address copilot review on PR #3759 Four review findings, three correctness, one test hygiene — all valid: 1-3. ExtDedupCache::insert only checked the in-memory `memo` set, but dump_to_disk() drains memo when the memory limit is hit. Once spilled, items that exist only on disk would look "new" to a memo-only check and be re-emitted as duplicates. The previous extdedup code papered over this by calling contains() (which DOES check both) before insert(); folding to a single insert() call in this PR exposed the latent bug. Fix in odhtcache: - insert() now consults both memo and disk before declaring a new insertion, matching its docstring contract. - Extracted contains_on_disk() helper used by both contains() and insert(), removing duplicated mmap/HashTable plumbing. - Updated docs on insert() and contains() to reflect the new semantics. Caller behavior in extdedup is unchanged (insert() still returns true exactly when the row is genuinely new), so no extdedup edits are needed and the single-touch pattern from the original cleanup is preserved. 4. extdedup_csvmode_key_collision invoked the command twice — once via wrk.output() for stderr, then again via wrk.read_stdout() for stdout — comparing streams from two different runs. Refactored to a single wrk.output() and assert stdout/stderr from the same Output (matches the dedup_sorted_empty fix in PR #3754). Verification: all 8 extdedup tests pass, including extdedup_large_memory_test (2.5M unique + 2.5M dupes under --memory-limit 1, exercises spill-to-disk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address roborev #1725 — make key-collision test CRLF-safe One Low/Medium finding: extdedup_csvmode_key_collision asserted `assert_eq!(stdout.trim_end_matches(['\r', '\n']), "a,b\nab,cd\na,bcd")`, which trims only TRAILING line endings. On Windows runners csv::Writer may emit `\r\n` between records, so internal separators in stdout would remain `\r\n` and break the equality check. The previous form used `read_stdout` (CSV-parsed, line-ending agnostic) but was rejected by PR #3759 review for running the command twice. Fix: normalize CRLF→LF via newline_converter::dos2unix before comparing. Matches the pattern already used by extdedup_linemode and extdedup_linemode_dupesoutput. Single-execution Output is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Review-driven cleanup of
src/cmd/dedup.rs(continues the recentcount/util/datefmtseries).--sortedon a header-only file no longer emits a stray empty data row (still prints the dupe count to stderr to keep parity with the in-memory path).dupewtris nowOption<Writer>: only constructed and flushed when--dupes-outputis set; no phantom stdout writer otherwise.iter_cmp_ignore_case: pure-ASCII cells now compare via a zero-allocation byte-wise lowercase fold; non-ASCII keeps the allocating fallback; invalid UTF-8 falls back to raw byte order (deterministic) instead of being treated as a missing cell.par_sort_bystructure.std::mem::swapinstead ofclone_fromin the streaming sorted loop.use std::cmp::Ordering;throughout.--memcheckhas no effect under--sorted.Reviewed by roborev (#1707, #1708) — both findings addressed in follow-up commits in this branch.
Test plan
cargo test -F all_features dedup— 21/21 pass (3 new:dedup_sorted_empty,dedup_sorted_empty_dupes_output,dedup_dupes_output_run_of_three)cargo test -F all_features sort— 78/78 pass (consumesiter_cmp_ignore_case)cargo +nightly clippy --bin qsv -F all_features -- -W clippy::perf— cleancargo +nightly fmt --check— clean🤖 Generated with Claude Code