Skip to content

feat(describegpt): add --markdown-template for customizable Markdown output#3834

Merged
jqnatividad merged 7 commits into
masterfrom
describegpt-markdown-template
May 8, 2026
Merged

feat(describegpt): add --markdown-template for customizable Markdown output#3834
jqnatividad merged 7 commits into
masterfrom
describegpt-markdown-template

Conversation

@jqnatividad

@jqnatividad jqnatividad commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds --markdown-template <file> to describegpt so users can customize Markdown output without recompiling, with an embedded resources/describegpt_md_defaults.toml default that reproduces the legacy format!() wrapper byte-for-byte.
  • Refactors the dictionary table from formatters::format_dictionary_markdown into a dictionary_md_body_template, exposing structured entries + addl_col_names plus custom Mini Jinja filters (pipe_escape, br_replace, human_count, dict_cell, humanize_examples) so the per-field layout is also TOML-driven.
  • Threads a single SharedRenderCtx (one attribution + one timestamp) through both render helpers per phase, caches the Mini Jinja Environment in a LazyLock, and locks in byte-identity to the legacy output via three embedded tests.
  • Drive-by: fixes a pre-existing OnceLock::set().unwrap() race in get_prompt_file that could panic the loser when two threads raced to initialize the prompt file (latent until two unit tests started exercising it concurrently).

Test plan

  • cargo build --locked --bin qsv -F all_features
  • cargo test describegpt -F all_features — 64 integration tests pass
  • cargo test cmd::describegpt::tests -F all_features — 16 embedded module tests pass (including 3 new byte-identity tests)
  • QSV_TEST_DESCRIBEGPT=1 cargo test describegpt -F all_features — 64 CI tests pass against live LM Studio
  • cargo clippy --bin qsv -F all_features --no-deps — clean
  • qsv --generate-help-mddocs/help/describegpt.md regenerated
  • roborev review sniff: refactor to just use the csv crate and drop the qsv-sniffer crate #1976 — addressed all 5 Low findings, closed

For #3735

🤖 Generated with Claude Code

jqnatividad and others added 3 commits May 7, 2026 16:22
…output

Adds a `--markdown-template <file>` flag and a default
`resources/describegpt_md_defaults.toml` so users can customize the
Markdown layout without recompiling. Per-inference-type templates
(`dictionary_md_template`, `description_md_template`, `tags_md_template`,
`custom_prompt_md_template`) drive the H1/REASONING/TOKEN-USAGE wrapper,
and a separate `dictionary_md_body_template` drives the per-field table
that fills `{{ llm_response }}` for the Dictionary kind. The wrapper
templates and the body template share one Mini Jinja Environment with
custom filters: `pipe_escape`, `br_replace`, `human_count`, `dict_cell`,
`humanize_examples`. Defaults reproduce the legacy hardcoded `format!()`
output byte-for-byte; two embedded tests lock that in.

Also fixes a pre-existing race in `get_prompt_file` where two threads
racing to initialize `PROMPT_FILE` would panic the loser via
`OnceLock::set().unwrap()`; now the loser silently uses the winner's
value (same pattern as the new `get_md_template_file`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Thread a single SharedRenderCtx (attribution + timestamp) through
  render_dictionary_md_body and render_markdown_template so the body
  footer's attribution and the wrapper's `{{ timestamp }}` /
  `{{ generated_by_signature }}` are byte-identical in the same phase.
- Cache the Mini Jinja Environment in a LazyLock so filters
  (pipe_escape, br_replace, human_count, dict_cell, humanize_examples)
  are registered once instead of on every render call.
- Add `default_wrapper_templates_are_byte_identical` test plus a TOML
  comment locking the four wrapper templates in sync — drift fails
  loudly instead of silently diverging.
- Reuse the shared timestamp from the SharedRenderCtx (no second
  `chrono::Utc::now()` call per dictionary phase).
- Add a doc comment on PROMPT_FILE / MD_TEMPLATE_FILE explaining the
  test-binary singleton constraint and what tests touching custom
  templates would need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cmd/describegpt.rs Dismissed
Comment thread src/cmd/describegpt.rs Dismissed
@codacy-production

codacy-production Bot commented May 7, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -11 complexity

Metric Results
Complexity -11

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a TOML-driven MiniJinja templating layer for qsv describegpt Markdown output so users can customize the rendered Markdown (including the dictionary table body) without recompiling, while preserving byte-identical legacy output via default templates and tests.

Changes:

  • Introduces --markdown-template <file> and a loader that parses/caches a Markdown template TOML (or uses an embedded default).
  • Refactors dictionary Markdown rendering into MiniJinja templates + custom filters, using a shared per-phase render context (single attribution + timestamp).
  • Adds resources/describegpt_md_defaults.toml, updates generated help docs, and adds tests asserting byte-identity with the legacy Markdown output.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/cmd/describegpt/formatters.rs Removes the legacy hardcoded dictionary Markdown formatter (dictionary Markdown now template-driven).
src/cmd/describegpt.rs Adds --markdown-template, template loading/rendering helpers, shared render context, and byte-identity tests; also fixes a OnceLock::set().unwrap() race in prompt loading.
resources/describegpt_md_defaults.toml New embedded default Markdown template TOML (wrapper + dictionary-body templates, with documented variables/filters).
docs/help/describegpt.md Regenerated help docs to include --markdown-template.

Comment thread src/cmd/describegpt.rs Outdated
Comment thread docs/help/describegpt.md Outdated
Comment thread src/cmd/describegpt.rs Outdated
Comment thread src/cmd/describegpt.rs Outdated
jqnatividad and others added 4 commits May 7, 2026 16:44
- USAGE help and generated docs now mention dictionary_md_body_template
  and the available custom Mini Jinja filters, pointing users at the
  inline variable/filter docs in the default TOML.
- Markdown template TOML parse error now includes the source path (or
  "<default embedded TOML>") so users can tell which file failed.
- Metadata fields (name/description/author/version) on
  MarkdownTemplateFile are now `#[serde(default)]` so users can supply
  a minimal --markdown-template file without copying the metadata
  header.
- Replaced struct-level `#[allow(dead_code)]` with per-field allows
  on only the genuinely-unused metadata fields, so future genuinely
  unused additions get caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Added MarkdownTemplateOverride struct with per-field Option<String>
  fields and an `apply_to` overlay that fills any omitted field from
  the embedded defaults. Now a user can drop in a --markdown-template
  TOML containing only the templates they want to change instead of
  copying the entire default file. USAGE updated to mention this.
- New `markdown_template_override_falls_back_per_field` test locks in
  that omitted templates fall back to embedded defaults.
- Inlined `source_label` inside the toml::from_str map_err closure
  using `args.flag_markdown_template.as_deref().unwrap_or("...")`,
  removing the cross-branch tuple binding that was fragile under
  future refactors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI uses `cargo +nightly fmt --check`, which respects the unstable
`struct_field_align_threshold = 20` in rustfmt.toml. My local stable
rustfmt silently ignores that option, so a few struct field declarations
landed with hand-aligned colons that nightly rejects. Run nightly fmt
to bring the file in line with what CI expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…indows CI)

The markdown_default_template_byte_identical_to_legacy and
dictionary_body_default_template_byte_identical_to_legacy tests fail on
Windows runners because:
  - `* text=auto` in .gitattributes lets git check out *.toml with CRLF
    on Windows; `include_str!` then bundles the embedded default
    `describegpt_md_defaults.toml` with CRLF in its template strings.
  - Mini Jinja preserves the source line endings verbatim, so the
    rendered Markdown carries CRLF on Windows.
  - The legacy `format!()` baseline the tests compare against is LF only.

Two-pronged fix:
  1. Code-level: normalize CRLF -> LF in `get_md_template_file` before
     handing content to `toml::from_str`, so both the embedded default
     and any user-supplied --markdown-template TOML render with LF
     regardless of the platform git checked them out on.
  2. .gitattributes: add `*.toml text eol=lf` so the TOML files are
     committed AND checked out as LF on every platform.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit a8d5cc5 into master May 8, 2026
15 of 17 checks passed
@jqnatividad jqnatividad deleted the describegpt-markdown-template branch May 8, 2026 08:55
jqnatividad added a commit that referenced this pull request May 9, 2026
…arent dir (#3838)

* fix(moarstats): retry on stats coverage mismatch + fsync joined CSV parent dir

Two conservative guards for the Linux-only flake on
test_moarstats::moarstats_join_type_full_runs_and_writes_bivariate
(CI run 25545197594, surfaced via PR #3834).

The macOS fix (#3830) and Windows fix (#3831) hardened the parent-side
fsync. Linux still surfaces the same symptom — bivariate output missing
secondary columns — at a much lower frequency. The most plausible
remaining gap is the `qsv stats`-on-joined-CSV subprocess handoff: the
spawned child has been observed to produce stats records for fewer
columns than the joined CSV actually has.

Two guards:

1. Retry once on stats-coverage mismatch. After `qsv stats` runs on the
   joined CSV, cross-check that every column in the joined CSV's header
   appears in the stats output's `field` column. On mismatch, re-fsync
   the joined CSV and re-run `qsv stats` once. Only fail loud if the
   mismatch persists. Self-heals the rare race; surfaces a precise error
   if a different root cause ever appears.

2. Fsync the joined CSV's parent directory after `join_datasets_internal`
   returns. `fsync(file)` on Linux doesn't flush parent-dir metadata, and
   rare FS/timing edge cases (overlayfs, fuse, network mounts) can affect
   read-after-write visibility for the follow-up subprocess. New helper
   `util::sync_directory()` is a no-op on Windows, where dir-handle
   `FlushFileBuffers` would error.

Also includes `tools/repro_moarstats_linux_flake.sh` — a parallel-worker
harness that replicates the failing test as a shell loop, optionally
adds page-cache pressure (DD_LOAD=1), and classifies failures into
"retry caught it" / "retry didn't catch it" / "different bug" buckets
to confirm root cause on a Linux VM.

Verified on macOS: `cargo build --locked --bin qsv -F all_features` and
`cargo clippy --locked --bin qsv -F all_features -- -D warnings` clean;
`cargo t -F all_features --test tests test_moarstats::` 80/80 pass at
--test-threads=8; failing test in a 30x stress loop all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address copilot review: best-effort dir fsync, error propagation, script hardening

All six Copilot comments on PR #3838 applied:

src/util.rs
- sync_directory: errors no longer propagated. Directory fsync is not
  uniformly supported across Linux filesystems (FAT, some FUSE/overlay/
  network mounts return EINVAL/EOPNOTSUPP); a defensive guard layered on
  top of file-level fsync must NOT fail the caller. Errors are now logged
  at debug level and swallowed. Signature changed from CliResult<()> to
  () to make the contract explicit.
- New unit test test_sync_directory_smoke covering both the existing-dir
  and missing-dir paths (Unix-only).

src/cmd/moarstats.rs
- Call site updated for the new sync_directory signature.
- Stats-output reader now propagates csv parse errors instead of
  silently dropping rows via filter_map(Result::ok). A malformed/
  truncated stats row now surfaces as "Failed to parse joined-stats
  row N" rather than as a misleading downstream "missing columns"
  assertion — which is exactly the diagnostic mode this fix is trying
  to eliminate.

tools/repro_moarstats_linux_flake.sh
- FIFO is now created inside a real mktemp -d directory, replacing the
  inherently racy mktemp -u.
- cleanup trap now removes LOAD_DIR (when DD_LOAD=1) and FIFO_DIR so
  temp resources are not leaked on interrupts/early exits.
- bivariate awk check now distinguishes "header malformed" (exit 2 ->
  FAIL_OTHER) from "coverage failed" (exit 1 -> FAIL_OLD_BIVARIATE), so
  a truncated bivariate header is not silently misclassified as a
  column-drop bug.

Verified on macOS: cargo build --locked clean, cargo clippy -- -D warnings
clean, all 80 test_moarstats tests pass at --test-threads=8, the new
test_sync_directory_smoke unit test passes, smoke-run of repro script
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants