Skip to content

template: review-driven cleanup (subdir bug, lookup perf, render-error visibility, helper extraction)#3740

Merged
jqnatividad merged 12 commits into
masterfrom
template-review-202604
Apr 25, 2026
Merged

template: review-driven cleanup (subdir bug, lookup perf, render-error visibility, helper extraction)#3740
jqnatividad merged 12 commits into
masterfrom
template-review-202604

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

End-to-end review of qsv template (src/cmd/template.rs) plus follow-up
fixes for everything I found, two roborev review passes, and a structural
refactor of run().

Bugs fixed

  • Subdirectory layout broken across batches. The <outdir>/<subdir>/...
    bucketing used a batch-local index, so on inputs spanning more than one
    batch (e.g. --batch 50000 + 60k rows) every batch wrote into the same
    subdir range and the documented "outsubdir-size files per subdir"
    contract was violated. Fixed by deriving the bucket from a global row
    number and persisting current_subdir across batches.
  • -j short flag collision. -j was declared for both --globals-json
    and --jobs in docopt; --jobs silently won. Dropped the duplicate;
    --globals-json now uses -J.
  • Latent panic on --outsubdir-size 0. Reached an integer
    divide-by-zero inside the per-row bucket calc. Now rejected up front
    with a clear "incorrect usage" error.

Behavior changes

  • Render errors now surface a stderr summary. Per-row template render
    failures used to be silently written into the output and the command
    exited 0. Still exits 0; still writes the per-row marker, but now also
    prints qsv template: N row(s) failed to render to stderr at end-of-run.
  • Lookup-key normalization unified across both sides. A CSV key like
    " 42 " or "100.0" and a template-side integer 42 / string "42" now
    hash to the same key. Whole-number floats collapse to integer form.

Performance

  • Case-insensitive lookups: O(N) → O(1). Builds a lowercased→original
    key index at register_lookup time so |lookup(..., false) no longer
    linear-scans the table per row.
  • lookup_filter no longer allocates a String per call for the
    common case (already-canonical string lookup): normalize_lookup_key
    returns Cow<'_, str> and the filter borrows from stack-resident
    itoa/zmij buffers.
  • Per-row file write went from a payload-sized BufWriter (no actual
    buffering benefit) to a direct fs::write.

Refactor

  • Extracted four self-contained chunks of run() into named helpers above
    it: register_qsv_extensions, open_bulk_writer, pre_register_lookups,
    init_lookup_globals. run() is down from 462 to ~370 lines.

Other

  • create_dircreate_dir_all (idempotent across batch-boundary subdirs).
  • Subdir name padding now sized to subdir count, not row count
    (00..11 instead of 00000..00011).
  • Various comment/diagnostic cleanups, an expect() message on a
    producer-side invariant, byte-vs-char doc note on substr, comment on
    the register_lookup regex pre-scan and its trade-off.
  • Help docs (docs/help/template.md) regenerated.

Tests

  • 45 template tests pass (was 41 before the review).
  • New regressions cover: multi-batch subdir layout, --outsubdir-size 0
    rejection, render-error stderr summary, lookup-key normalization
    (whitespace, 100.0100).

Test plan

  • cargo test -F all_features template — 45 / 45 pass
  • cargo clippy -F all_features --bin qsv -- -D warnings — clean
  • Manual reproducer: 60k rows / --batch 50000 / --jobs 4 /
    --outsubdir-size 5000 produces 12 contiguous subdirs of 5,000 files
  • Reviewer to spot-check the M2/M3 behavior changes vs. their own
    templates (render-error stderr line; lookup keys with whitespace /
    42.0 form)

🤖 Generated with Claude Code

jqnatividad and others added 11 commits April 25, 2026 06:38
B1: --outdir subdirectory bucketing used a batch-local index, so for
inputs spanning multiple batches the same subdir names were reused
across batches and the documented "outsubdir-size files per subdir"
contract broke. Compute the bucket from a global row number instead,
and persist current_subdir across batches. Switch the per-batch
create_dir to create_dir_all so a bucket that straddles a batch
boundary doesn't trip the existence check. Add a regression test
covering 60k rows / 4 jobs / 5k subdir-size.

B2: -j was declared as the short flag for both --globals-json and
--jobs in the docopt USAGE; --jobs silently won, leaving the
documented short for --globals-json non-functional. Drop the -j short
from --globals-json. Also fix a missing space in the --no-headers
help text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M1: case-insensitive lookups linear-scanned the lookup table and
re-lowercased every key on every filter call (O(N) per call, O(N*M)
across an input). Wrap each table in a LookupTable struct that holds
the row map plus a pre-built lowercased-key -> original-key index;
case-insensitive lookup is now a single hash hit.

Also:
* Add a comment on the register_lookup pre-scan regex documenting that
  it is intentionally a textual scan, not a parser, and what the
  failure modes are.
* Inline the unused intermediate bindings in to_bool.
* Replace .unwrap() on the QSV_ROWNO atoi_simd parse with an .expect
  documenting the producer-side invariant.
* Document that substr operates on byte offsets, not chars.
* Replace the per-row sized-to-payload BufWriter (no real buffering)
  with a direct fs::write.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M2: per-row template render failures used to be written into the output
file (or stdout) but produced no aggregate signal — a CI job that only
inspects the exit code or the last few lines of stderr would miss the
failure entirely. Track failures in a static AtomicU64 from inside the
parallel render closure and emit a single "N row(s) failed to render"
summary on stderr at end-of-run. The exit code is still 0; the per-row
"RENDERING ERROR (rowno): ..." marker is still in the output, so this
is purely additive. Add a regression test.

M3: register_lookup canonicalized CSV keys (trim, then i64- or f64-
parse, then itoa or zmij) but the lookup filter only applied the
canonicalization to numeric Values — a string value with whitespace
or a different numeric form would silently miss. Factor the
canonicalization into normalize_lookup_key(), use it on both sides,
and collapse whole-number floats ("100.0") into the i64 form so they
share a key with integer 100. Add a regression test covering
whitespace, "100.0" vs "100", and trailing-space string keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* lookup_filter "not found" diagnostic now reports the user-supplied
  input instead of the post-normalization form, so a template that
  passed " 42 " sees that string in the error rather than the
  canonicalized "42".
* normalize_lookup_key now returns Cow<'_, str> and lookup_filter
  stringifies its input as a borrow against stack-resident itoa/zmij
  buffers — the common case (already-canonical string lookup) no
  longer allocates a String per filter call.
* RENDER_ERROR_COUNT is now zeroed at the top of run() so a second
  in-process invocation (tests, embedding, future REPL/MCP path)
  reports its own count instead of inheriting the previous run's.
* normalize_lookup_key f64 path now uses an exclusive upper bound
  (num < i64::MAX as f64) so the boundary value 2^63 — which rounds
  up from i64::MAX during the f64 cast and would saturate back to
  i64::MAX on the i64 cast — falls through to zmij instead of
  producing a misleading canonical key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* --outsubdir-size 0 used to reach an integer divide-by-zero inside the
  per-row bucket calculation `(global_row - 1) / outsubdir_numfiles` and
  panic the moment we tried to place a file in a subdirectory. The
  failure was silent at parse time because docopt accepts any u16. Reject
  the value up front in run() with an "incorrect usage" error and add a
  regression test.

* Document why the register_lookup pre-scan eagerly registers every
  syntactic register_lookup() call in the template, including ones
  guarded by conditional logic that the per-row render would skip:
  doing the work up front trades wasted-on-unused-lookups effort for
  fail-fast on a malformed URL or unreachable lookup at startup, instead
  of failing at the first row that triggers the conditional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the short-flag addition from 919f613.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The subdir-name width was reusing the row-count width, so a 60k-row run
with --outsubdir-size 5000 produced subdirs named "00000".."00011"
(5 digits) when "00".."11" suffices. Compute a separate subdir_width
from (rowcount - 1) / outsubdir_numfiles for the known-rowcount path,
and fall back to the existing width for stdin (where rowcount is
unknown until the stream is consumed). File names continue to pad to
the row-count width so output sorts correctly.

Update the multibatch regression test to assert the new (decoupled)
file-vs-subdir widths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run() was a single 462-line function. Pull four self-contained chunks
into named helpers above it so the body of run() reads as the actual
top-level flow:

* register_qsv_extensions(env) — minijinja_contrib + the qsv-specific
  function and filter set. Documents why the filename-template env
  intentionally does NOT call this.
* open_bulk_writer(output) — the BufWriter-wrapping match for stdout
  vs. file output.
* pre_register_lookups(env, template) — the regex pre-scan that
  registers every register_lookup() call before the batch loop.
* init_lookup_globals(...) — DELIMITER / QSV_CACHE_DIR / CKAN_API /
  CKAN_TOKEN OnceLock initialization, with QSV_CKAN_* env vars taking
  precedence over the corresponding CLI flags.

Pure refactor — no behavior change. run() is now 376 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous extraction commit inserted four helper functions between
the existing doc comment for normalize_lookup_key and the function
itself, leaving the doc visually attached to init_lookup_globals.
Move the comment back down adjacent to its function.

Addresses roborev review job 1658.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented Apr 25, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 73 complexity

Metric Results
Complexity 73

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

End-to-end cleanup and correctness/performance improvements for qsv template, focused on output directory bucketing, lookup behavior/performance, render-error visibility, and run() refactoring.

Changes:

  • Fix multi-batch output subdirectory bucketing (global row-based bucketing; create_dir_all; improved padding logic).
  • Improve lookup behavior and performance (shared key normalization, O(1) case-insensitive lookup via precomputed index, reduced allocations).
  • Add end-of-run stderr summary for per-row render failures; refactor run() by extracting helper functions; regenerate help docs and add regressions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/cmd/template.rs Core fixes + perf improvements (lookup), render-error summary, and run() helper extraction/refactor.
tests/test_template.rs Adds regression coverage for multibatch subdir layout, --outsubdir-size 0, render-error stderr summary, and key normalization.
docs/help/template.md Regenerated help reflecting CLI flag changes and wording tweaks.

Comment thread src/cmd/template.rs Outdated
The earlier validator rejects --outsubdir-size 0 only when output_to_dir
is true, but the subdir_width computation ran unconditionally and would
panic with "attempt to divide by zero" when --outsubdir-size 0 was
passed without an <outdir>. Add the output_to_dir guard to the width
calculation; the flag is meaningless without an outdir anyway.

Reproducer:
  $ printf "a\n1\n" | qsv template --template '{{a}}' --outsubdir-size 0
  thread 'main' panicked at ... attempt to divide by zero

Addresses Copilot review on PR #3740.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 5e2cc8e into master Apr 25, 2026
15 checks passed
@jqnatividad jqnatividad deleted the template-review-202604 branch April 25, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants