Skip to content

dedup: review-driven cleanup, allocation-free ignore-case, edge-case fixes#3754

Merged
jqnatividad merged 5 commits into
masterfrom
dedup-codereview-202604
Apr 26, 2026
Merged

dedup: review-driven cleanup, allocation-free ignore-case, edge-case fixes#3754
jqnatividad merged 5 commits into
masterfrom
dedup-codereview-202604

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

Review-driven cleanup of src/cmd/dedup.rs (continues the recent count / util / datefmt series).

  • Empty-input bug: --sorted on a header-only file no longer emits a stray empty data row (still prints the dupe count to stderr to keep parity with the in-memory path).
  • dupewtr is now Option<Writer>: only constructed and flushed when --dupes-output is set; no phantom stdout writer otherwise.
  • ASCII fast path in iter_cmp_ignore_case: pure-ASCII cells now compare via a zero-allocation byte-wise lowercase fold; non-ASCII keeps the allocating fallback; invalid UTF-8 falls back to raw byte order (deterministic) instead of being treated as a missing cell.
  • Hoisted comparison dispatch in the in-memory scan loop: picks the comparison function once per run, then runs a single tight monomorphized loop, mirroring the existing par_sort_by structure.
  • std::mem::swap instead of clone_from in the streaming sorted loop.
  • use std::cmp::Ordering; throughout.
  • USAGE: note --memcheck has no effect under --sorted.

Reviewed by roborev (#1707, #1708) — both findings addressed in follow-up commits in this branch.

Test plan

  • cargo test -F all_features dedup — 21/21 pass (3 new: dedup_sorted_empty, dedup_sorted_empty_dupes_output, dedup_dupes_output_run_of_three)
  • cargo test -F all_features sort — 78/78 pass (consumes iter_cmp_ignore_case)
  • cargo +nightly clippy --bin qsv -F all_features -- -W clippy::perf — clean
  • cargo +nightly fmt --check — clean

🤖 Generated with Claude Code

jqnatividad and others added 3 commits April 26, 2026 17:04
…fixes

- Fix --sorted on empty input emitting a stray empty data row.
- Make dupewtr Option<Writer>; no phantom stdout writer when
  --dupes-output is absent.
- ASCII fast path in iter_cmp_ignore_case: zero-allocation byte-wise
  lowercase compare for the common case; non-ASCII keeps the
  allocating fallback; invalid UTF-8 now falls back to raw byte order
  (deterministic) instead of being treated as a missing cell.
- Hoist comparison dispatch out of the in-memory scan loop; one tight
  monomorphized loop per ComparisonMode, mirroring the par_sort_by
  structure.
- std::mem::swap instead of clone_from in the streaming sorted loop.
- use std::cmp::Ordering for readability.
- Doc: note --memcheck is a no-op under --sorted.
- Test: dedup_sorted_empty regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…overage

Roborev review 1707 raised two Low findings:

- The new empty-input early-return in the --sorted path bypassed the
  "0" duplicate-count message that the in-memory path still prints.
  Wrap the streaming loop in `if rdr.read_byte_record(...)? { ... }`
  and let control fall through to the existing flush + count-print
  block. Behavior on empty input is now identical between the two
  paths.

- Test coverage for the rewritten code was thin. Added:
  * dedup_sorted_empty_dupes_output — empty input + --sorted
    --dupes-output, asserts the dupes file contains only the header.
  * dedup_dupes_output_run_of_three — in-memory mode with a run of
    three identical rows, exercises the scan_dedup! macro's
    prev-write-on-equal branch (2 dupes written, 1 survivor).
  * dedup_sorted_empty also now asserts the stderr count message
    is emitted on empty input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed_empty

Roborev review 1708 raised one Low finding: stderr.contains("0") in
dedup_sorted_empty is too loose — any unrelated "0" digit on stderr
(timestamp, elapsed time, etc.) would satisfy it. dedup actually
emits only the number via `eprintln!("{dupe_count}")`, so the tight
form is exact equality on the trimmed output, not the literal "0
duplicates found" string the reviewer suggested (that prose doesn't
exist in the code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented Apr 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 5 complexity

Metric Results
Complexity 5

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the dedup command’s behavior and performance, with focused fixes for --sorted edge cases, --dupes-output handling, and faster allocation-free ASCII case-insensitive comparisons used by dedup/sort.

Changes:

  • Fix --sorted empty-input behavior to avoid emitting a stray empty data row while preserving duplicate-count stderr output.
  • Make --dupes-output truly optional by only constructing/flushing a duplicates writer when requested.
  • Improve ignore-case comparisons with an ASCII zero-allocation fast path and more deterministic handling of invalid UTF-8.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/cmd/dedup.rs Adjusts streaming (--sorted) and in-memory dedup loops, makes dupes writer optional, and rewrites ignore-case comparison for performance/edge cases.
tests/test_dedup.rs Adds regression tests for --sorted empty input and for --dupes-output behavior.

Comment thread src/cmd/dedup.rs
Comment thread src/cmd/dedup.rs
Comment thread tests/test_dedup.rs Outdated
jqnatividad and others added 2 commits April 26, 2026 19:11
Three review findings, all valid:

1. Streaming --sorted Equal branch wrote &record (the survivor) to the
   dupes file. With --select on a run of length >2 (e.g. key=1 across
   three rows that differ on val), this produced the survivor row N-1
   times in dupes and dropped the actually-removed rows. Now writes
   &next_record (the row being dropped on this iteration), which matches
   the in-memory path's symmetric behavior (main + dupes = original).
   Regression: dedup_sorted_select_dupes_output_run_of_three.

2. Dupes writer was seeded with rdr.byte_headers() unconditionally; under
   --no-headers, byte_headers() returns the first DATA row, so the dupes
   file started with that row as a phantom "header". Replaced with
   rconfig.write_headers(&mut rdr, &mut w), which already guards on the
   --no-headers flag and the empty-record case (config.rs:478-482).
   Regression: dedup_no_headers_dupes_output.

3. dedup_sorted_empty ran the command twice (read_stdout + output_stderr).
   Refactored to a single wrk.output(&mut cmd) and assert on both streams
   from the one run; faster and verifies stdout/stderr from the same
   invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 08f2815 into master Apr 26, 2026
14 of 15 checks passed
@jqnatividad jqnatividad deleted the dedup-codereview-202604 branch April 26, 2026 23:56
jqnatividad added a commit that referenced this pull request Apr 27, 2026
…-free streaming loop (#3756)

* sortcheck: review-driven cleanup, add --numeric/--natural, allocation-free streaming loop

Bring sortcheck in line with the recent review-driven cleanups for sort
(#3755), dedup (#3754), datefmt (#3753), and util (#3752), and close a
real feature gap: sortcheck previously only supported lex / --ignore-case,
so users could not pre-verify a CSV before piping into `sort --numeric`,
`sort --natural`, or `dedup --numeric --sorted`.

Changes
- Add --numeric (-N) and --natural flags, reusing iter_cmp_num,
  iter_cmp_natural, and iter_cmp_natural_ignore_case already exported
  from cmd/sort.rs (no new comparators).
- Resolve flags into a ComparisonMode enum once before the loop, with
  the same precedence as sort/dedup: --natural > --numeric > --ignore-case.
  Dispatch is a single match per row, monomorphizing to one comparator.
- Replace record.clone_from(&next_record) with std::mem::swap on both
  Less and Greater arms — zero-copy buffer rotation, matching dedup #3754.
- args.flag_json | args.flag_pretty_json -> || (idiomatic logical OR).
- dupe_count as i64 -> i64::try_from(...).unwrap_or(i64::MAX) with a
  comment on why -1 signals "not applicable".
- Drop stale #[allow(dead_code)] on Args; add a comment on the
  do_json-forces-full-scan invariant that makes scan_ctr equivalent to
  record_count in the JSON path.
- USAGE: documented the new flags + their precedence, added an Examples
  block, tightened the extsort paragraph (sort itself is multi-threaded).
- Regenerated docs/help/sortcheck.md via `qsv --generate-help-md`.

Tests
- 10 new tests in tests/test_sortcheck.rs covering --ignore-case
  (including a real lex-vs-ignore-case discriminator), empty input,
  single-row input, --numeric sorted/unsorted, --natural sorted/unsorted,
  --natural + --ignore-case composition, and a precedence test
  confirming --natural overrides --numeric.
- 19/19 sortcheck tests pass; `cargo clippy -F all_features -- -D warnings` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address review: empty-input scan_ctr off-by-one, docopt-safe USAGE, ComparisonMode Copy, test rename

- Fix empty-input bug: a header-only CSV used to report
  `record_count: 1` because the loop incremented scan_ctr once even when
  the initial read_byte_record returned false. Guard the loop and
  simplify the JSON path to always use scan_ctr (do_json forces a full
  scan, so it's authoritative).
- Reword the --ignore-case USAGE block so no continuation line begins
  with `--natural` (docopt scans for option-like prefixes; lines starting
  with `--` could be misinterpreted as a second option definition).
- Also clarify the wording: --ignore-case is only ignored under PURE
  numeric comparison (--numeric without --natural), since the documented
  precedence allows --natural --numeric --ignore-case where ignore-case
  does apply.
- Derive Clone/Copy on ComparisonMode to mirror SortMode in cmd/sort.rs.
  Functionally a no-op (matching unit variants doesn't move), but
  defensive and consistent with sort.rs.
- Rename test sortcheck_ignore_case_sorted -> sortcheck_ignore_case_notsorted
  (the test asserts an error; the prior name was misleading — the
  truly-sorted-under-ignore-case case is covered by
  sortcheck_ignore_case_actually_sorted).
- Update sortcheck_empty test to expect record_count: 0 (was 1, codifying
  the off-by-one).
- Regenerated docs/help/sortcheck.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 27, 2026
Four review findings, three correctness, one test hygiene — all valid:

1-3. ExtDedupCache::insert only checked the in-memory `memo` set, but
     dump_to_disk() drains memo when the memory limit is hit. Once
     spilled, items that exist only on disk would look "new" to a
     memo-only check and be re-emitted as duplicates. The previous
     extdedup code papered over this by calling contains() (which DOES
     check both) before insert(); folding to a single insert() call in
     this PR exposed the latent bug.

     Fix in odhtcache:
     - insert() now consults both memo and disk before declaring a new
       insertion, matching its docstring contract.
     - Extracted contains_on_disk() helper used by both contains() and
       insert(), removing duplicated mmap/HashTable plumbing.
     - Updated docs on insert() and contains() to reflect the new
       semantics.

     Caller behavior in extdedup is unchanged (insert() still returns
     true exactly when the row is genuinely new), so no extdedup edits
     are needed and the single-touch pattern from the original cleanup
     is preserved.

4. extdedup_csvmode_key_collision invoked the command twice — once via
   wrk.output() for stderr, then again via wrk.read_stdout() for stdout
   — comparing streams from two different runs. Refactored to a single
   wrk.output() and assert stdout/stderr from the same Output (matches
   the dedup_sorted_empty fix in PR #3754).

Verification: all 8 extdedup tests pass, including extdedup_large_memory_test
(2.5M unique + 2.5M dupes under --memory-limit 1, exercises spill-to-disk).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 27, 2026
…ssues (#3759)

* extdedup: review-driven cleanup, fix key-collision and dupes-writer issues

Correctness:
- CSV mode: insert a US (\x1F) separator between selected fields when
  building the dedup key. Previously fields were concatenated raw, so
  rows ("ab","cd") and ("a","bcd") both produced key "abcd" and the
  second row was silently treated as a duplicate.
- CSV mode: honor --no-output (was silently ignored — only line mode
  respected it). USAGE updated to note both modes apply.
- CSV mode: dupewtr is now Option<csv::Writer>; the previous code
  unconditionally constructed Config::new(None).writer() which targeted
  stdout when --dupes-output was absent and then flushed it, contending
  with the main writer. Mirrors the dedup cleanup in 08f2815.
- Line mode: dupes_writer is now Option<BufWriter>, dropping the
  /dev/null / "nul" placeholder file handle and the cfg(target_family)
  split.

Efficiency / cleanup:
- Single hash-table touch per row using insert()'s bool return value
  instead of contains() + insert().
- Drop dead utf8_string intermediate buffer in the UTF-8 fallback path;
  String::from_utf8_lossy returns Cow<str> which can be pushed directly.
- Drop unnecessary clone_from in both modes (byte_records yields owned
  ByteRecords; line? already owns the String).
- let-else replaces flag_select.unwrap() in dedup_csv.
- Key buffer capacity bumped from 20 to 256 (more realistic default).

Docs:
- USAGE clarifies --dupes-output format per mode (CSV: valid CSV with
  leading dupe_rowno column; LINE: tab-prefixed, not a CSV).
- Regenerated docs/help/extdedup.md.

odhtcache: ExtDedupCache::contains is now unused by bin targets but
kept as part of the cache's public API with #[allow(dead_code)] and
a doc note pointing at the new insert-return-value pattern.

Tests:
- extdedup_csvmode_key_collision — regression for the field-separator
  bug; ("ab","cd") and ("a","bcd") must both survive.
- extdedup_csvmode_no_output — --no-output produces empty stdout but
  still reports the duplicate count on stderr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address copilot review on PR #3759

Four review findings, three correctness, one test hygiene — all valid:

1-3. ExtDedupCache::insert only checked the in-memory `memo` set, but
     dump_to_disk() drains memo when the memory limit is hit. Once
     spilled, items that exist only on disk would look "new" to a
     memo-only check and be re-emitted as duplicates. The previous
     extdedup code papered over this by calling contains() (which DOES
     check both) before insert(); folding to a single insert() call in
     this PR exposed the latent bug.

     Fix in odhtcache:
     - insert() now consults both memo and disk before declaring a new
       insertion, matching its docstring contract.
     - Extracted contains_on_disk() helper used by both contains() and
       insert(), removing duplicated mmap/HashTable plumbing.
     - Updated docs on insert() and contains() to reflect the new
       semantics.

     Caller behavior in extdedup is unchanged (insert() still returns
     true exactly when the row is genuinely new), so no extdedup edits
     are needed and the single-touch pattern from the original cleanup
     is preserved.

4. extdedup_csvmode_key_collision invoked the command twice — once via
   wrk.output() for stderr, then again via wrk.read_stdout() for stdout
   — comparing streams from two different runs. Refactored to a single
   wrk.output() and assert stdout/stderr from the same Output (matches
   the dedup_sorted_empty fix in PR #3754).

Verification: all 8 extdedup tests pass, including extdedup_large_memory_test
(2.5M unique + 2.5M dupes under --memory-limit 1, exercises spill-to-disk).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address roborev #1725 — make key-collision test CRLF-safe

One Low/Medium finding: extdedup_csvmode_key_collision asserted
`assert_eq!(stdout.trim_end_matches(['\r', '\n']), "a,b\nab,cd\na,bcd")`,
which trims only TRAILING line endings. On Windows runners csv::Writer
may emit `\r\n` between records, so internal separators in stdout would
remain `\r\n` and break the equality check. The previous form used
`read_stdout` (CSV-parsed, line-ending agnostic) but was rejected by
PR #3759 review for running the command twice.

Fix: normalize CRLF→LF via newline_converter::dos2unix before
comparing. Matches the pattern already used by extdedup_linemode and
extdedup_linemode_dupesoutput. Single-execution Output is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants