template: review-driven cleanup (subdir bug, lookup perf, render-error visibility, helper extraction)#3740

jqnatividad · 2026-04-25T12:03:46Z

Summary

End-to-end review of qsv template (src/cmd/template.rs) plus follow-up
fixes for everything I found, two roborev review passes, and a structural
refactor of run().

Bugs fixed

Subdirectory layout broken across batches. The <outdir>/<subdir>/...
bucketing used a batch-local index, so on inputs spanning more than one
batch (e.g. --batch 50000 + 60k rows) every batch wrote into the same
subdir range and the documented "outsubdir-size files per subdir"
contract was violated. Fixed by deriving the bucket from a global row
number and persisting current_subdir across batches.
-j short flag collision. -j was declared for both --globals-json
and --jobs in docopt; --jobs silently won. Dropped the duplicate;
--globals-json now uses -J.
Latent panic on --outsubdir-size 0. Reached an integer
divide-by-zero inside the per-row bucket calc. Now rejected up front
with a clear "incorrect usage" error.

Behavior changes

Render errors now surface a stderr summary. Per-row template render
failures used to be silently written into the output and the command
exited 0. Still exits 0; still writes the per-row marker, but now also
prints qsv template: N row(s) failed to render to stderr at end-of-run.
Lookup-key normalization unified across both sides. A CSV key like
" 42 " or "100.0" and a template-side integer 42 / string "42" now
hash to the same key. Whole-number floats collapse to integer form.

Performance

Case-insensitive lookups: O(N) → O(1). Builds a lowercased→original
key index at register_lookup time so |lookup(..., false) no longer
linear-scans the table per row.
lookup_filter no longer allocates a String per call for the
common case (already-canonical string lookup): normalize_lookup_key
returns Cow<'_, str> and the filter borrows from stack-resident
itoa/zmij buffers.
Per-row file write went from a payload-sized BufWriter (no actual
buffering benefit) to a direct fs::write.

Refactor

Extracted four self-contained chunks of run() into named helpers above
it: register_qsv_extensions, open_bulk_writer, pre_register_lookups,
init_lookup_globals. run() is down from 462 to ~370 lines.

Other

create_dir → create_dir_all (idempotent across batch-boundary subdirs).
Subdir name padding now sized to subdir count, not row count
(00..11 instead of 00000..00011).
Various comment/diagnostic cleanups, an expect() message on a
producer-side invariant, byte-vs-char doc note on substr, comment on
the register_lookup regex pre-scan and its trade-off.
Help docs (docs/help/template.md) regenerated.

Tests

45 template tests pass (was 41 before the review).
New regressions cover: multi-batch subdir layout, --outsubdir-size 0
rejection, render-error stderr summary, lookup-key normalization
(whitespace, 100.0 ≡ 100).

Test plan

cargo test -F all_features template — 45 / 45 pass
cargo clippy -F all_features --bin qsv -- -D warnings — clean
Manual reproducer: 60k rows / --batch 50000 / --jobs 4 /
--outsubdir-size 5000 produces 12 contiguous subdirs of 5,000 files
Reviewer to spot-check the M2/M3 behavior changes vs. their own
templates (render-error stderr line; lookup keys with whitespace /
42.0 form)

🤖 Generated with Claude Code

B1: --outdir subdirectory bucketing used a batch-local index, so for inputs spanning multiple batches the same subdir names were reused across batches and the documented "outsubdir-size files per subdir" contract broke. Compute the bucket from a global row number instead, and persist current_subdir across batches. Switch the per-batch create_dir to create_dir_all so a bucket that straddles a batch boundary doesn't trip the existence check. Add a regression test covering 60k rows / 4 jobs / 5k subdir-size. B2: -j was declared as the short flag for both --globals-json and --jobs in the docopt USAGE; --jobs silently won, leaving the documented short for --globals-json non-functional. Drop the -j short from --globals-json. Also fix a missing space in the --no-headers help text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

M1: case-insensitive lookups linear-scanned the lookup table and re-lowercased every key on every filter call (O(N) per call, O(N*M) across an input). Wrap each table in a LookupTable struct that holds the row map plus a pre-built lowercased-key -> original-key index; case-insensitive lookup is now a single hash hit. Also: * Add a comment on the register_lookup pre-scan regex documenting that it is intentionally a textual scan, not a parser, and what the failure modes are. * Inline the unused intermediate bindings in to_bool. * Replace .unwrap() on the QSV_ROWNO atoi_simd parse with an .expect documenting the producer-side invariant. * Document that substr operates on byte offsets, not chars. * Replace the per-row sized-to-payload BufWriter (no real buffering) with a direct fs::write. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

M2: per-row template render failures used to be written into the output file (or stdout) but produced no aggregate signal — a CI job that only inspects the exit code or the last few lines of stderr would miss the failure entirely. Track failures in a static AtomicU64 from inside the parallel render closure and emit a single "N row(s) failed to render" summary on stderr at end-of-run. The exit code is still 0; the per-row "RENDERING ERROR (rowno): ..." marker is still in the output, so this is purely additive. Add a regression test. M3: register_lookup canonicalized CSV keys (trim, then i64- or f64- parse, then itoa or zmij) but the lookup filter only applied the canonicalization to numeric Values — a string value with whitespace or a different numeric form would silently miss. Factor the canonicalization into normalize_lookup_key(), use it on both sides, and collapse whole-number floats ("100.0") into the i64 form so they share a key with integer 100. Add a regression test covering whitespace, "100.0" vs "100", and trailing-space string keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* lookup_filter "not found" diagnostic now reports the user-supplied input instead of the post-normalization form, so a template that passed " 42 " sees that string in the error rather than the canonicalized "42". * normalize_lookup_key now returns Cow<'_, str> and lookup_filter stringifies its input as a borrow against stack-resident itoa/zmij buffers — the common case (already-canonical string lookup) no longer allocates a String per filter call. * RENDER_ERROR_COUNT is now zeroed at the top of run() so a second in-process invocation (tests, embedding, future REPL/MCP path) reports its own count instead of inheriting the previous run's. * normalize_lookup_key f64 path now uses an exclusive upper bound (num < i64::MAX as f64) so the boundary value 2^63 — which rounds up from i64::MAX during the f64 cast and would saturate back to i64::MAX on the i64 cast — falls through to zmij instead of producing a misleading canonical key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* --outsubdir-size 0 used to reach an integer divide-by-zero inside the per-row bucket calculation `(global_row - 1) / outsubdir_numfiles` and panic the moment we tried to place a file in a subdirectory. The failure was silent at parse time because docopt accepts any u16. Reject the value up front in run() with an "incorrect usage" error and add a regression test. * Document why the register_lookup pre-scan eagerly registers every syntactic register_lookup() call in the template, including ones guarded by conditional logic that the per-row render would skip: doing the work up front trades wasted-on-unused-lookups effort for fail-fast on a malformed URL or unreachable lookup at startup, instead of failing at the first row that triggers the conditional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the short-flag addition from 919f613. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The subdir-name width was reusing the row-count width, so a 60k-row run with --outsubdir-size 5000 produced subdirs named "00000".."00011" (5 digits) when "00".."11" suffices. Compute a separate subdir_width from (rowcount - 1) / outsubdir_numfiles for the known-rowcount path, and fall back to the existing width for stdin (where rowcount is unknown until the stream is consumed). File names continue to pad to the row-count width so output sorts correctly. Update the multibatch regression test to assert the new (decoupled) file-vs-subdir widths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

run() was a single 462-line function. Pull four self-contained chunks into named helpers above it so the body of run() reads as the actual top-level flow: * register_qsv_extensions(env) — minijinja_contrib + the qsv-specific function and filter set. Documents why the filename-template env intentionally does NOT call this. * open_bulk_writer(output) — the BufWriter-wrapping match for stdout vs. file output. * pre_register_lookups(env, template) — the regex pre-scan that registers every register_lookup() call before the batch loop. * init_lookup_globals(...) — DELIMITER / QSV_CACHE_DIR / CKAN_API / CKAN_TOKEN OnceLock initialization, with QSV_CKAN_* env vars taking precedence over the corresponding CLI flags. Pure refactor — no behavior change. run() is now 376 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous extraction commit inserted four helper functions between the existing doc comment for normalize_lookup_key and the function itself, leaving the doc visually attached to init_lookup_globals. Move the comment back down adjacent to its function. Addresses roborev review job 1658. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-04-25T12:07:45Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 73 complexity

Metric Results

Complexity 73

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

End-to-end cleanup and correctness/performance improvements for qsv template, focused on output directory bucketing, lookup behavior/performance, render-error visibility, and run() refactoring.

Changes:

Fix multi-batch output subdirectory bucketing (global row-based bucketing; create_dir_all; improved padding logic).
Improve lookup behavior and performance (shared key normalization, O(1) case-insensitive lookup via precomputed index, reduced allocations).
Add end-of-run stderr summary for per-row render failures; refactor run() by extracting helper functions; regenerate help docs and add regressions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/cmd/template.rs`	Core fixes + perf improvements (lookup), render-error summary, and `run()` helper extraction/refactor.
`tests/test_template.rs`	Adds regression coverage for multibatch subdir layout, `--outsubdir-size 0`, render-error stderr summary, and key normalization.
`docs/help/template.md`	Regenerated help reflecting CLI flag changes and wording tweaks.

The earlier validator rejects --outsubdir-size 0 only when output_to_dir is true, but the subdir_width computation ran unconditionally and would panic with "attempt to divide by zero" when --outsubdir-size 0 was passed without an <outdir>. Add the output_to_dir guard to the width calculation; the flag is meaningless without an outdir anyway. Reproducer: $ printf "a\n1\n" | qsv template --template '{{a}}' --outsubdir-size 0 thread 'main' panicked at ... attempt to divide by zero Addresses Copilot review on PR #3740. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 11 commits April 25, 2026 06:38

chore(template): add -J shortcut for --global-json option

919f613

docs: regenerate template help for -J short on --globals-json

c0c72c8

Picks up the short-flag addition from 919f613. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(template): mute clippy::cast_precision_loss lint

552769b

jqnatividad requested a review from Copilot April 25, 2026 12:04

Copilot started reviewing on behalf of jqnatividad April 25, 2026 12:04 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread src/cmd/template.rs Outdated

jqnatividad merged commit 5e2cc8e into master Apr 25, 2026
15 checks passed

jqnatividad deleted the template-review-202604 branch April 25, 2026 12:13

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

template: review-driven cleanup (subdir bug, lookup perf, render-error visibility, helper extraction)#3740