template: review-driven cleanup (subdir bug, lookup perf, render-error visibility, helper extraction)#3740
Merged
Merged
Conversation
B1: --outdir subdirectory bucketing used a batch-local index, so for inputs spanning multiple batches the same subdir names were reused across batches and the documented "outsubdir-size files per subdir" contract broke. Compute the bucket from a global row number instead, and persist current_subdir across batches. Switch the per-batch create_dir to create_dir_all so a bucket that straddles a batch boundary doesn't trip the existence check. Add a regression test covering 60k rows / 4 jobs / 5k subdir-size. B2: -j was declared as the short flag for both --globals-json and --jobs in the docopt USAGE; --jobs silently won, leaving the documented short for --globals-json non-functional. Drop the -j short from --globals-json. Also fix a missing space in the --no-headers help text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M1: case-insensitive lookups linear-scanned the lookup table and re-lowercased every key on every filter call (O(N) per call, O(N*M) across an input). Wrap each table in a LookupTable struct that holds the row map plus a pre-built lowercased-key -> original-key index; case-insensitive lookup is now a single hash hit. Also: * Add a comment on the register_lookup pre-scan regex documenting that it is intentionally a textual scan, not a parser, and what the failure modes are. * Inline the unused intermediate bindings in to_bool. * Replace .unwrap() on the QSV_ROWNO atoi_simd parse with an .expect documenting the producer-side invariant. * Document that substr operates on byte offsets, not chars. * Replace the per-row sized-to-payload BufWriter (no real buffering) with a direct fs::write. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M2: per-row template render failures used to be written into the output
file (or stdout) but produced no aggregate signal — a CI job that only
inspects the exit code or the last few lines of stderr would miss the
failure entirely. Track failures in a static AtomicU64 from inside the
parallel render closure and emit a single "N row(s) failed to render"
summary on stderr at end-of-run. The exit code is still 0; the per-row
"RENDERING ERROR (rowno): ..." marker is still in the output, so this
is purely additive. Add a regression test.
M3: register_lookup canonicalized CSV keys (trim, then i64- or f64-
parse, then itoa or zmij) but the lookup filter only applied the
canonicalization to numeric Values — a string value with whitespace
or a different numeric form would silently miss. Factor the
canonicalization into normalize_lookup_key(), use it on both sides,
and collapse whole-number floats ("100.0") into the i64 form so they
share a key with integer 100. Add a regression test covering
whitespace, "100.0" vs "100", and trailing-space string keys.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* lookup_filter "not found" diagnostic now reports the user-supplied input instead of the post-normalization form, so a template that passed " 42 " sees that string in the error rather than the canonicalized "42". * normalize_lookup_key now returns Cow<'_, str> and lookup_filter stringifies its input as a borrow against stack-resident itoa/zmij buffers — the common case (already-canonical string lookup) no longer allocates a String per filter call. * RENDER_ERROR_COUNT is now zeroed at the top of run() so a second in-process invocation (tests, embedding, future REPL/MCP path) reports its own count instead of inheriting the previous run's. * normalize_lookup_key f64 path now uses an exclusive upper bound (num < i64::MAX as f64) so the boundary value 2^63 — which rounds up from i64::MAX during the f64 cast and would saturate back to i64::MAX on the i64 cast — falls through to zmij instead of producing a misleading canonical key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* --outsubdir-size 0 used to reach an integer divide-by-zero inside the per-row bucket calculation `(global_row - 1) / outsubdir_numfiles` and panic the moment we tried to place a file in a subdirectory. The failure was silent at parse time because docopt accepts any u16. Reject the value up front in run() with an "incorrect usage" error and add a regression test. * Document why the register_lookup pre-scan eagerly registers every syntactic register_lookup() call in the template, including ones guarded by conditional logic that the per-row render would skip: doing the work up front trades wasted-on-unused-lookups effort for fail-fast on a malformed URL or unreachable lookup at startup, instead of failing at the first row that triggers the conditional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the short-flag addition from 919f613. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The subdir-name width was reusing the row-count width, so a 60k-row run with --outsubdir-size 5000 produced subdirs named "00000".."00011" (5 digits) when "00".."11" suffices. Compute a separate subdir_width from (rowcount - 1) / outsubdir_numfiles for the known-rowcount path, and fall back to the existing width for stdin (where rowcount is unknown until the stream is consumed). File names continue to pad to the row-count width so output sorts correctly. Update the multibatch regression test to assert the new (decoupled) file-vs-subdir widths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run() was a single 462-line function. Pull four self-contained chunks into named helpers above it so the body of run() reads as the actual top-level flow: * register_qsv_extensions(env) — minijinja_contrib + the qsv-specific function and filter set. Documents why the filename-template env intentionally does NOT call this. * open_bulk_writer(output) — the BufWriter-wrapping match for stdout vs. file output. * pre_register_lookups(env, template) — the regex pre-scan that registers every register_lookup() call before the batch loop. * init_lookup_globals(...) — DELIMITER / QSV_CACHE_DIR / CKAN_API / CKAN_TOKEN OnceLock initialization, with QSV_CKAN_* env vars taking precedence over the corresponding CLI flags. Pure refactor — no behavior change. run() is now 376 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous extraction commit inserted four helper functions between the existing doc comment for normalize_lookup_key and the function itself, leaving the doc visually attached to init_lookup_globals. Move the comment back down adjacent to its function. Addresses roborev review job 1658. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 73 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
End-to-end cleanup and correctness/performance improvements for qsv template, focused on output directory bucketing, lookup behavior/performance, render-error visibility, and run() refactoring.
Changes:
- Fix multi-batch output subdirectory bucketing (global row-based bucketing;
create_dir_all; improved padding logic). - Improve lookup behavior and performance (shared key normalization, O(1) case-insensitive lookup via precomputed index, reduced allocations).
- Add end-of-run stderr summary for per-row render failures; refactor
run()by extracting helper functions; regenerate help docs and add regressions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/cmd/template.rs |
Core fixes + perf improvements (lookup), render-error summary, and run() helper extraction/refactor. |
tests/test_template.rs |
Adds regression coverage for multibatch subdir layout, --outsubdir-size 0, render-error stderr summary, and key normalization. |
docs/help/template.md |
Regenerated help reflecting CLI flag changes and wording tweaks. |
The earlier validator rejects --outsubdir-size 0 only when output_to_dir
is true, but the subdir_width computation ran unconditionally and would
panic with "attempt to divide by zero" when --outsubdir-size 0 was
passed without an <outdir>. Add the output_to_dir guard to the width
calculation; the flag is meaningless without an outdir anyway.
Reproducer:
$ printf "a\n1\n" | qsv template --template '{{a}}' --outsubdir-size 0
thread 'main' panicked at ... attempt to divide by zero
Addresses Copilot review on PR #3740.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end review of
qsv template(src/cmd/template.rs) plus follow-upfixes for everything I found, two roborev review passes, and a structural
refactor of
run().Bugs fixed
<outdir>/<subdir>/...bucketing used a batch-local index, so on inputs spanning more than one
batch (e.g.
--batch 50000+ 60k rows) every batch wrote into the samesubdir range and the documented "outsubdir-size files per subdir"
contract was violated. Fixed by deriving the bucket from a global row
number and persisting
current_subdiracross batches.-jshort flag collision.-jwas declared for both--globals-jsonand
--jobsin docopt;--jobssilently won. Dropped the duplicate;--globals-jsonnow uses-J.--outsubdir-size 0. Reached an integerdivide-by-zero inside the per-row bucket calc. Now rejected up front
with a clear "incorrect usage" error.
Behavior changes
failures used to be silently written into the output and the command
exited 0. Still exits 0; still writes the per-row marker, but now also
prints
qsv template: N row(s) failed to renderto stderr at end-of-run." 42 "or"100.0"and a template-side integer42/ string"42"nowhash to the same key. Whole-number floats collapse to integer form.
Performance
key index at
register_lookuptime so|lookup(..., false)no longerlinear-scans the table per row.
lookup_filterno longer allocates aStringper call for thecommon case (already-canonical string lookup):
normalize_lookup_keyreturns
Cow<'_, str>and the filter borrows from stack-residentitoa/zmijbuffers.BufWriter(no actualbuffering benefit) to a direct
fs::write.Refactor
run()into named helpers aboveit:
register_qsv_extensions,open_bulk_writer,pre_register_lookups,init_lookup_globals.run()is down from 462 to ~370 lines.Other
create_dir→create_dir_all(idempotent across batch-boundary subdirs).(
00..11instead of00000..00011).expect()message on aproducer-side invariant, byte-vs-char doc note on
substr, comment onthe
register_lookupregex pre-scan and its trade-off.docs/help/template.md) regenerated.Tests
--outsubdir-size 0rejection, render-error stderr summary, lookup-key normalization
(whitespace,
100.0≡100).Test plan
cargo test -F all_features template— 45 / 45 passcargo clippy -F all_features --bin qsv -- -D warnings— clean--batch 50000/--jobs 4/--outsubdir-size 5000produces 12 contiguous subdirs of 5,000 filestemplates (render-error stderr line; lookup keys with whitespace /
42.0form)🤖 Generated with Claude Code