feat(describegpt): `--two-pass` cross-field Data Dictionary refinement#3863

jqnatividad · 2026-05-16T19:59:28Z

Summary

Adds an opt-in --two-pass flag to qsv describegpt that runs a second LLM call after the first dictionary pass, feeding the entire first-pass Data Dictionary back to the LLM as {{ first_pass_dictionary }} context so it can refine Label, Description and (when --infer-content-type is set) Content Type using cross-field awareness.
The LLM can now recognize groups of related fields (street_no + street_name + city + state + zip → mailing address; first_name + last_name → person name; lat + lng → coordinate pair; date + time → split timestamp) and label/describe them with that context, rather than field-by-field in isolation.
Roughly doubles dictionary LLM cost and latency, so it is opt-in.

How it works

New PromptType::DictionaryRefine variant gives the refine pass its own prompt template (dictionary_refine_prompt in resources/describegpt_defaults.toml) and its own cache entry. Cache key for the refine entry is derived from a BLAKE3 fingerprint of the first-pass dictionary, so a different first-pass output misses the cache.
New FIRST_PASS_DICT_JSON: RwLock<Option<String>> global surfaces the first-pass dictionary to the refine prompt's {{ first_pass_dictionary }} template variable.
New combine_dictionary_entries_with_baseline() merges baseline + refine LLM responses so that any field the refine pass omits inherits the first-pass Label/Description rather than reverting to code-derived defaults. The deterministic unique_id stamp and the LLM-unique_id-rejection defenses are preserved across both passes.
The refined dictionary becomes the emitted output AND the {{ dictionary }} context for downstream --description, --tags and --prompt phases.

Compatibility and validation

Mutually exclusive with --prepare-context and --process-response: MCP sampling is single-turn per phase and the refine prompt depends on the first-pass response, so combining them errors out. Documented in USAGE.
Warns (non-fatal) when combined with --no-cache — every invocation pays both LLM calls in that mode.
Backward compat: dictionary_refine_prompt is #[serde(default = "default_dictionary_refine_prompt")] on PromptFile, so existing user --prompt-file TOMLs continue to parse and get the built-in template as a fallback. A test enforces byte-identity between the TOML block and the const fallback.
Cascade cache invalidation: --forget --dictionary (and --forget with no inference flag) now also purges the DictionaryRefine cache entry.

Tests added

combine_with_baseline_preserves_baseline_when_refine_omits_field — the core correctness invariant.
combine_with_baseline_rejects_refine_supplied_unique_id — deterministic stamp survives refine; LLM unique_id can't be smuggled in for non-ALL_UNIQUE fields.
combine_with_baseline_refine_overrides_valid_baseline_content_type — the point of the feature: refine can upgrade free_text → address_street once cross-field context is visible.
combine_with_baseline_matches_single_pass_when_refine_empty — sanity check that the new merge function agrees with the single-pass combine when refine is empty.
dictionary_refine_cache_key_differs_from_dictionary_cache_key, two_pass_flag_does_not_change_first_pass_cache_key, dictionary_refine_cache_key_reflects_first_pass_dict_json — cache isolation between passes and dependency on first-pass content.
dictionary_refine_prompt_renders_first_pass_dictionary_var — end-to-end prompt rendering of {{ first_pass_dictionary }}.
prompt_file_without_dictionary_refine_prompt_field_still_parses — backward compat for pre-existing user prompt files.
default_dictionary_refine_prompt_matches_resource — keeps the #[serde(default)] const in sync with the TOML default.

Test plan

🤖 Generated with Claude Code

Adds an opt-in `--two-pass` flag that runs a second LLM call after the first dictionary pass, feeding the entire first-pass dictionary back to the LLM via a new `{{ first_pass_dictionary }}` template variable. Cross-field context lets the LLM recognize groups of related fields (street_no + street_name + city + state + zip describing one mailing address; first_name + last_name naming one person; lat + lng forming a coordinate pair) and refine Label, Description and Content Type accordingly. The refined dictionary becomes the emitted output and is what downstream description / tags / prompt phases see as dictionary context. Key invariants: - baseline-preserving merge: if the refine LLM omits a field, the first-pass Label/Description survive (the alternative would silently wipe them back to code-derived defaults) - deterministic `unique_id` stamp survives the refine pass - LLM-supplied `unique_id` rejected in both baseline and refine passes - distinct cache entry under `PromptType::DictionaryRefine`, keyed off the first-pass dictionary fingerprint; cascade invalidation from `Dictionary` -> `DictionaryRefine` - mutually exclusive with `--prepare-context` / `--process-response` (MCP sampling is single-turn per phase) - backward compat: `dictionary_refine_prompt` is `#[serde(default)]` so existing user `--prompt-file` TOMLs keep parsing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-16T20:00:32Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 17 complexity

Metric Results

Complexity 17

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

This PR adds an opt-in --two-pass mode to qsv describegpt that runs a second “refine” LLM call using the full first-pass Data Dictionary as context, then merges baseline + refine results so omitted fields don’t regress to code-derived defaults.

Changes:

Introduces PromptType::DictionaryRefine plus a new dictionary_refine_prompt template (with #[serde(default)] fallback for older user prompt TOMLs).
Implements a two-pass dictionary merge (combine_dictionary_entries_with_baseline) and wires a two-pass flow in run_dictionary_phase (including separate caching for the refine pass).
Updates defaults/docs to document --two-pass and ship the refine prompt template.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/cmd/describegpt/dictionary.rs	Adds baseline-preserving merge function for two-pass output + unit tests for invariants.
src/cmd/describegpt.rs	Adds `--two-pass`, `DictionaryRefine` prompt type, first-pass dictionary context plumbing, refine caching logic, and two-pass execution path.
resources/describegpt_defaults.toml	Adds `dictionary_refine_prompt` default template and documents `{{ first_pass_dictionary }}`.
docs/help/describegpt.md	Regenerates help to include the new `--two-pass` flag.

Addresses two roborev review findings on the --two-pass change: - Wrap FIRST_PASS_DICT_JSON seeding in a `FirstPassDictGuard` RAII type so the slot is cleared on all exit paths from `run_dictionary_phase`, including `?`-propagated errors from the refine LLM call. Replaces the manual seed/clear pair which would leave stale state on the error path (harmless in CLI one-shot, but a leak in the test-harness / repeated-invocation case the static's doc comment calls out). - Make the `get_prompt` and `get_cache_key_with_flag` reads `.expect()` on poison to match the write side's behavior. Previously the reads silently coerced a poisoned RwLock to an empty `first_pass_dictionary` / empty fingerprint, which would waste the refine LLM call (no cross-field context) with no signal to the user. Now a poisoned lock fails loud, matching `FirstPassDictGuard::seed()`'s `.expect()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… refine cache Copilot review on PR #3863 caught that `build_first_pass_dictionary_json_string` was expanding `{GENERATED_BY_SIGNATURE}` via `replace_attribution_placeholder`, which injects a live `chrono::Utc::now()` timestamp and the full process `command_line`. That made `FIRST_PASS_DICT_JSON` change on every invocation even for byte-identical first-pass dictionary content, busting the `DictionaryRefine` cache key (which is BLAKE3-fingerprinted from that string) and forcing a fresh second LLM call on every run. Fix: drop the attribution field outright when building the first-pass JSON. The attribution is for the user-facing emit path; the LLM doesn't need it and keeping it in the cache-fingerprint input only adds noise. The unused `model`/`base_url` parameters are removed from the function signature as a result. Adds `first_pass_dictionary_json_is_stable_across_invocations` as a regression guard — calls the builder twice with a 10ms sleep between calls and asserts byte-identical output plus absence of both `attribution` and the literal `{GENERATED_BY_SIGNATURE}` placeholder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 16, 2026

View reviewed changes

Comment thread src/cmd/describegpt.rs Dismissed

jqnatividad requested a review from Copilot May 16, 2026 20:01

Copilot started reviewing on behalf of jqnatividad May 16, 2026 20:02 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread src/cmd/describegpt.rs Outdated

Comment thread src/cmd/describegpt.rs Outdated

jqnatividad and others added 2 commits May 16, 2026 16:21

jqnatividad merged commit b4278d6 into master May 16, 2026
16 of 17 checks passed

jqnatividad deleted the describegpt-two-pass branch May 16, 2026 20:29

BrewTestBot mentioned this pull request May 18, 2026

qsv 20.1.0 Homebrew/homebrew-core#283447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(describegpt): `--two-pass` cross-field Data Dictionary refinement#3863