Skip to content

feat(describegpt): --two-pass cross-field Data Dictionary refinement#3863

Merged
jqnatividad merged 3 commits into
masterfrom
describegpt-two-pass
May 16, 2026
Merged

feat(describegpt): --two-pass cross-field Data Dictionary refinement#3863
jqnatividad merged 3 commits into
masterfrom
describegpt-two-pass

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

  • Adds an opt-in --two-pass flag to qsv describegpt that runs a second LLM call after the first dictionary pass, feeding the entire first-pass Data Dictionary back to the LLM as {{ first_pass_dictionary }} context so it can refine Label, Description and (when --infer-content-type is set) Content Type using cross-field awareness.
  • The LLM can now recognize groups of related fields (street_no + street_name + city + state + zip → mailing address; first_name + last_name → person name; lat + lng → coordinate pair; date + time → split timestamp) and label/describe them with that context, rather than field-by-field in isolation.
  • Roughly doubles dictionary LLM cost and latency, so it is opt-in.

How it works

  • New PromptType::DictionaryRefine variant gives the refine pass its own prompt template (dictionary_refine_prompt in resources/describegpt_defaults.toml) and its own cache entry. Cache key for the refine entry is derived from a BLAKE3 fingerprint of the first-pass dictionary, so a different first-pass output misses the cache.
  • New FIRST_PASS_DICT_JSON: RwLock<Option<String>> global surfaces the first-pass dictionary to the refine prompt's {{ first_pass_dictionary }} template variable.
  • New combine_dictionary_entries_with_baseline() merges baseline + refine LLM responses so that any field the refine pass omits inherits the first-pass Label/Description rather than reverting to code-derived defaults. The deterministic unique_id stamp and the LLM-unique_id-rejection defenses are preserved across both passes.
  • The refined dictionary becomes the emitted output AND the {{ dictionary }} context for downstream --description, --tags and --prompt phases.

Compatibility and validation

  • Mutually exclusive with --prepare-context and --process-response: MCP sampling is single-turn per phase and the refine prompt depends on the first-pass response, so combining them errors out. Documented in USAGE.
  • Warns (non-fatal) when combined with --no-cache — every invocation pays both LLM calls in that mode.
  • Backward compat: dictionary_refine_prompt is #[serde(default = "default_dictionary_refine_prompt")] on PromptFile, so existing user --prompt-file TOMLs continue to parse and get the built-in template as a fallback. A test enforces byte-identity between the TOML block and the const fallback.
  • Cascade cache invalidation: --forget --dictionary (and --forget with no inference flag) now also purges the DictionaryRefine cache entry.

Tests added

  • combine_with_baseline_preserves_baseline_when_refine_omits_field — the core correctness invariant.
  • combine_with_baseline_rejects_refine_supplied_unique_id — deterministic stamp survives refine; LLM unique_id can't be smuggled in for non-ALL_UNIQUE fields.
  • combine_with_baseline_refine_overrides_valid_baseline_content_type — the point of the feature: refine can upgrade free_textaddress_street once cross-field context is visible.
  • combine_with_baseline_matches_single_pass_when_refine_empty — sanity check that the new merge function agrees with the single-pass combine when refine is empty.
  • dictionary_refine_cache_key_differs_from_dictionary_cache_key, two_pass_flag_does_not_change_first_pass_cache_key, dictionary_refine_cache_key_reflects_first_pass_dict_json — cache isolation between passes and dependency on first-pass content.
  • dictionary_refine_prompt_renders_first_pass_dictionary_var — end-to-end prompt rendering of {{ first_pass_dictionary }}.
  • prompt_file_without_dictionary_refine_prompt_field_still_parses — backward compat for pre-existing user prompt files.
  • default_dictionary_refine_prompt_matches_resource — keeps the #[serde(default)] const in sync with the TOML default.

Test plan

  • cargo build --locked --bin qsv -F all_features
  • cargo t describegpt -F all_features — 64 passed, 0 failed
  • cargo clippy -F all_features — clean
  • cargo +nightly fmt
  • ./target/debug/qsv --generate-help-mddocs/help/describegpt.md regenerated with the new --two-pass row
  • End-to-end smoke against a real LLM with a CSV exercising cross-field relationships (e.g. address columns + name columns) — verify the refined Labels/Descriptions reference composite concepts and that the Markdown output emits one Dictionary table (not two)
  • qsv describegpt sample.csv --dictionary --two-pass --prepare-context — confirm clear error
  • qsv describegpt sample.csv --dictionary --two-pass --no-cache — confirm warning prints
  • Re-run with --fresh and confirm both pass-1 and pass-2 cache entries refresh in ~/.qsv-cache/describegpt
  • Re-run with --prompt-file custom.toml where custom.toml lacks dictionary_refine_prompt and confirm the serde(default) fallback works

🤖 Generated with Claude Code

Adds an opt-in `--two-pass` flag that runs a second LLM call after the
first dictionary pass, feeding the entire first-pass dictionary back to
the LLM via a new `{{ first_pass_dictionary }}` template variable.
Cross-field context lets the LLM recognize groups of related fields
(street_no + street_name + city + state + zip describing one mailing
address; first_name + last_name naming one person; lat + lng forming a
coordinate pair) and refine Label, Description and Content Type
accordingly. The refined dictionary becomes the emitted output and is
what downstream description / tags / prompt phases see as dictionary
context.

Key invariants:
- baseline-preserving merge: if the refine LLM omits a field, the
  first-pass Label/Description survive (the alternative would silently
  wipe them back to code-derived defaults)
- deterministic `unique_id` stamp survives the refine pass
- LLM-supplied `unique_id` rejected in both baseline and refine passes
- distinct cache entry under `PromptType::DictionaryRefine`, keyed off
  the first-pass dictionary fingerprint; cascade invalidation from
  `Dictionary` -> `DictionaryRefine`
- mutually exclusive with `--prepare-context` / `--process-response`
  (MCP sampling is single-turn per phase)
- backward compat: `dictionary_refine_prompt` is `#[serde(default)]` so
  existing user `--prompt-file` TOMLs keep parsing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cmd/describegpt.rs Dismissed
@codacy-production

codacy-production Bot commented May 16, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 17 complexity

Metric Results
Complexity 17

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in --two-pass mode to qsv describegpt that runs a second “refine” LLM call using the full first-pass Data Dictionary as context, then merges baseline + refine results so omitted fields don’t regress to code-derived defaults.

Changes:

  • Introduces PromptType::DictionaryRefine plus a new dictionary_refine_prompt template (with #[serde(default)] fallback for older user prompt TOMLs).
  • Implements a two-pass dictionary merge (combine_dictionary_entries_with_baseline) and wires a two-pass flow in run_dictionary_phase (including separate caching for the refine pass).
  • Updates defaults/docs to document --two-pass and ship the refine prompt template.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/cmd/describegpt/dictionary.rs Adds baseline-preserving merge function for two-pass output + unit tests for invariants.
src/cmd/describegpt.rs Adds --two-pass, DictionaryRefine prompt type, first-pass dictionary context plumbing, refine caching logic, and two-pass execution path.
resources/describegpt_defaults.toml Adds dictionary_refine_prompt default template and documents {{ first_pass_dictionary }}.
docs/help/describegpt.md Regenerates help to include the new --two-pass flag.

Comment thread src/cmd/describegpt.rs Outdated
Comment thread src/cmd/describegpt.rs Outdated
jqnatividad and others added 2 commits May 16, 2026 16:21
Addresses two roborev review findings on the --two-pass change:

- Wrap FIRST_PASS_DICT_JSON seeding in a `FirstPassDictGuard` RAII type
  so the slot is cleared on all exit paths from `run_dictionary_phase`,
  including `?`-propagated errors from the refine LLM call. Replaces the
  manual seed/clear pair which would leave stale state on the error path
  (harmless in CLI one-shot, but a leak in the test-harness /
  repeated-invocation case the static's doc comment calls out).
- Make the `get_prompt` and `get_cache_key_with_flag` reads `.expect()`
  on poison to match the write side's behavior. Previously the reads
  silently coerced a poisoned RwLock to an empty `first_pass_dictionary`
  / empty fingerprint, which would waste the refine LLM call (no
  cross-field context) with no signal to the user. Now a poisoned lock
  fails loud, matching `FirstPassDictGuard::seed()`'s `.expect()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… refine cache

Copilot review on PR #3863 caught that `build_first_pass_dictionary_json_string`
was expanding `{GENERATED_BY_SIGNATURE}` via `replace_attribution_placeholder`,
which injects a live `chrono::Utc::now()` timestamp and the full process
`command_line`. That made `FIRST_PASS_DICT_JSON` change on every invocation
even for byte-identical first-pass dictionary content, busting the
`DictionaryRefine` cache key (which is BLAKE3-fingerprinted from that string)
and forcing a fresh second LLM call on every run.

Fix: drop the attribution field outright when building the first-pass JSON.
The attribution is for the user-facing emit path; the LLM doesn't need it and
keeping it in the cache-fingerprint input only adds noise. The unused
`model`/`base_url` parameters are removed from the function signature as a
result.

Adds `first_pass_dictionary_json_is_stable_across_invocations` as a regression
guard — calls the builder twice with a 10ms sleep between calls and asserts
byte-identical output plus absence of both `attribution` and the literal
`{GENERATED_BY_SIGNATURE}` placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit b4278d6 into master May 16, 2026
16 of 17 checks passed
@jqnatividad jqnatividad deleted the describegpt-two-pass branch May 16, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants