merge Transpose PR#3

jqnatividad · 2020-12-27T17:58:55Z

transpose command to transpose rows/columns of CSV data. PR BurntSushi#137

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

All 160 frequency tests pass. The diagnostics about `qsv_bin` are unrelated to my changes. Changes: - Fixed misleading comment: removed "non-positive" from the safety comment about weight filtering since `debug_assert!(is_finite())` only checks for NaN/Inf, not non-positive values - Added clarifying comments on descending `partition_point` calls to document that null is placed after entries with equal weight/count (acknowledging the behavior change from the original linear search) Address review findings (job 259) All 160 frequency tests pass. Here's a summary of the changes: Changes: - Added `#[inline]` back to `weighted_add` to enable inlining into function pointer call sites, mitigating the indirect-call overhead noted in finding #3 - Added comment on `debug_assert!` in `weighted_add` clarifying that upstream validation already filters invalid weights (finding #2) - Added CHANGELOG entry under `[Unreleased]` documenting the `partition_point` optimization and the **BREAKING** behavioral change in null tie-breaking order for `frequency --null-sorted` (finding #1) Address review findings (job 257) All 160 frequency tests pass. Changes: - Added `debug_assert` in `weighted_add` to verify weight is finite and positive, making the invariant self-documenting and consistent with the safety comment at line 2244 (the upstream filter at line 2739 already correctly skips non-positive weights before calling `weighted_add`) Address review findings (job 256) All 4 tests pass. Build succeeds. Changes: - Removed "matching original behavior" from comments on null insertion with ties (it's actually a behavior change from the `position()` to `partition_point()` switch) - Fixed safety comment to mention non-positive weights are also filtered during accumulation (not just NaN/Inf) - Removed `#[inline]` from `weighted_add` since it's called through a function pointer and the annotation is misleading Address review findings (job 252) All 4 tests pass with exact position assertions. Changes: - Replace range assertions with exact position assertions in all 4 tie-breaking tests to match the deterministic `partition_point` tie-breaking policy - Weighted desc/asc tests now assert null_pos == 2 (exact) - Unweighted desc test asserts null_pos == 3 (null placed after all tied entries) - Unweighted asc test asserts null_pos == 1 (null placed before all tied entries), differentiating it from the desc test Address review findings (job 251) All 160 frequency tests pass. Changes: - Reverted `#[inline(always)]` to `#[inline]` on `weighted_add` since inlining through function pointers isn't guaranteed anyway - Added 4 tests for null insertion with ties: weighted desc, weighted asc, unweighted desc, unweighted asc — covering the `partition_point` edge case where null has the same count/weight as other entries Address review findings (job 249) All 156 frequency tests pass. The diagnostics about `qsv_bin` are pre-existing issues unrelated to this change. Changes: - Fix ascending `partition_point` predicate to use `<` instead of `<=` for both weighted and unweighted null insertion, restoring the original tie-breaking behavior where null is placed before entries with equal weight/count Address review findings (job 247) All 156 frequency tests pass. The diagnostic warnings about `qsv_bin` are pre-existing and unrelated to our changes. Changes: - Add `debug_assert!` to verify sort-order invariant before `partition_point` in weighted null-insertion path - Add `debug_assert!` to verify sort-order invariant before `partition_point` in unweighted null-insertion path Address review findings (job 245) All 156 frequency tests pass. Changes: - Add `debug_assert!` checking all weights in `counts_final` are finite, not just the null weight, for `partition_point` correctness (finding #3) - Add comments noting `field_buffer` borrows are transient and safe to reuse across iterations in both weighted and unweighted ignore-case paths (finding #1) Address review findings (job 243) No clippy warnings for frequency.rs. All changes are clean. Changes: - Add `debug_assert!(null_weight_val.is_finite())` before weighted `partition_point` calls to guard against NaN float values breaking binary search - Add safety comment for unweighted `partition_point` noting u64 counts are always finite - Change `weighted_add` from `#[inline]` to `#[inline(always)]` to ensure inlining in the hot path through function pointers

Syntax check passed. All changes are complete. Changes: - Replace macOS `read -t` fallback with `head -c 65536 | jq` pipeline to fix silent failure when `timeout` is unavailable (finding 1) - Move version-change note from header preamble to Tool Discovery section where it's more contextually relevant (finding 4) Address review findings (job 436) Script passes syntax check. This is a shell script, not a Rust source file, so the standard `cargo build`/`cargo test` commands aren't relevant here. The change is minimal and self-contained. Changes: - Added `-n 65536` to `read` builtin in the macOS fallback branch to enforce the same 64KB size limit as the `timeout` branch, addressing the size guard inconsistency Address review findings (job 435) Syntax check passed. This is a shell script only — no Rust build or tests needed for these changes. Changes: - Use bash `read -t 5` fallback instead of bare `head` on systems without `timeout` (macOS) to prevent indefinite stdin blocking - Emit diagnostic `additionalContext` message when `CLAUDE_PLUGIN_ROOT` is unset, aiding troubleshooting - Replace instruction-to-AI phrasing ("Inform the user...") with neutral factual messages in JSON output Address review findings (job 433) Script syntax is valid. All three review findings are addressed. Changes: - Add `command -v timeout` check with fallback to plain `head` for macOS compatibility (issues #1/#3) - Guard against empty/unset `CLAUDE_PLUGIN_ROOT` before `cd` to prevent unexpected `$HOME` resolution (issue #2) Address review findings (job 432) Script syntax is valid (no output = no errors). Changes: - Add 5-second timeout to stdin read to prevent indefinite blocking if no input is provided - Guard against deploying CLAUDE.md into the plugin's own directory tree - Replace hardcoded version "v16.1" with version-agnostic wording in cowork template - Add `stats` (with extended stats) to the memory-intensive commands list in cowork template Address review findings (job 430) Script syntax is valid. Changes: - Guard `jq` parse of stdin against truncated JSON: redirect stderr and fall back to empty `CWD` on failure, so `set -e` won't abort on malformed input - Replace `realpath` with POSIX-portable `cd "$CWD" && pwd -P` for symlink resolution, ensuring compatibility on minimal macOS and CI images without GNU coreutils Address review findings (job 429) All changes look correct. Since this is a shell script and markdown file (no Rust code changes), there's no build or test to run. Changes: - Limit stdin read to 64KB (`head -c 65536`) to prevent hangs on malformed/endless input - Resolve CWD with `realpath` to prevent path traversal via symlinks - Add `QSV_NO_COWORK_SETUP=1` env var opt-out mechanism - Wrap `cp` in error handling to produce a friendly JSON message instead of failing with `set -e` - Add version note and opt-out instructions to the cowork-CLAUDE.md template header Address review findings (job 427) Script syntax is valid (no output means no errors). Changes: - Redirect `jq`-missing diagnostic from stderr to stdout so the hook framework can surface it as `additionalContext` to the agent Address review findings (job 426) Both files validate correctly. The diagnostic errors about `qsv_bin` are pre-existing and unrelated to these changes. Changes: - Use here-string (`<<<`) instead of `echo | jq` to avoid escape sequence mangling in JSON input - Add `jq` availability check with friendly message instead of cryptic hook error - Remove misleading `"matcher": "startup"` from SessionStart hook config - Use `jq -n` to construct output JSON safely, preventing malformed JSON from paths with special characters - Remove unverified `QSV_MCP_OPERATION_TIMEOUT_MS` / `qsv_config` references from cowork-CLAUDE.md template

All 421 tests pass, 0 failures. The change is correct. Changes: - Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit Address review findings (job 606) All 72 tests pass (0 failures), including the 3 new tests for missing params. Changes: - Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1) - Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3) - Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5) Address review findings (job 607) All 418 tests pass, including the new one. Changes: - Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string Address review findings (job 609) All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests. Changes: - Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing - Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags - Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces") - Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior Address review findings (job 610) All 420 tests pass. Changes: - Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails - Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path Address review findings (job 611) All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change. Changes: - Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer Address review findings (job 613) All 420 tests pass, 0 failures. The changes are verified. Changes: - Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding) - Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4) - Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error Address review findings (job 615) No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass. Changes: - Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags - Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages - Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough") - Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory Address review findings (job 616) The change looks correct. The length check and truncation now both operate on codepoints consistently. Changes: - Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()` Address review findings (job 618) All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test. Changes: - Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1) - Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3) - Updated test assertions to match the new error message wording

* feat(mcp): add qsv_log core tool for agent-initiated reproducibility logging Enable agents to write structured entries (user_prompt, agent_reasoning, agent_action, result_summary, note) to the qsv audit log (qsvmcp.log) with u- prefixed UUIDs, distinct from automatic s-/e- audit entries. Automatic audit logging is skipped for qsv_log calls to avoid recursion. Messages are truncated at 4096 chars and logging failures never break the workflow. Server instructions updated to guide agents on when/how to log for third-party reproducibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review findings (job 619) All 421 tests pass, 0 failures. The change is correct. Changes: - Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit Address review findings (job 606) All 72 tests pass (0 failures), including the 3 new tests for missing params. Changes: - Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1) - Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3) - Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5) Address review findings (job 607) All 418 tests pass, including the new one. Changes: - Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string Address review findings (job 609) All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests. Changes: - Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing - Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags - Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces") - Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior Address review findings (job 610) All 420 tests pass. Changes: - Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails - Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path Address review findings (job 611) All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change. Changes: - Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer Address review findings (job 613) All 420 tests pass, 0 failures. The changes are verified. Changes: - Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding) - Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4) - Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error Address review findings (job 615) No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass. Changes: - Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags - Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages - Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough") - Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory Address review findings (job 616) The change looks correct. The length check and truncation now both operate on codepoints consistently. Changes: - Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()` Address review findings (job 618) All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test. Changes: - Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1) - Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3) - Updated test assertions to match the new error message wording * fix(mcp): address Copilot review findings for qsv_log - Move skipAuditLog from "Key Constants" to a behavior note in CLAUDE.md (it's a local variable, not a module-level constant) - Reorder enum and LOG_ENTRY_TYPES Set to match description order (reasoning before action) - Add unique temp dir + cleanup to coercion test to prevent log file accumulation in OS temp root Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ls (#3734) * feat(generators): detect required options from Usage: line Both help_markdown_gen.rs and mcp_skills_gen.rs now identify options shown outside [options]/[...] groups in the USAGE's `Usage:` section (e.g. `qsv implode [options] -k <keys> -v <value>`) and mark them accordingly. - `docs/help/*.md`: required options get ` **(required)**` appended to their description column in the options table. - `.claude/skills/qsv/*.json`: option entries gain `"required": true` when the flag is required. Optional options continue to emit nothing (the field is skipped when absent). A small wrinkle worth noting: qsv-docopt's Parser does not always emit Atom::Short entries paired with the Long atom for the `-k, --keys` declaration style, so we can't rely on its pairing to expand short↔long forms. Both generators do their own pairing pass by scanning the options sections for `-X, --xxx` declarations. Closes roborev review #1618 findings #3 and #4 at a project-wide (generator) level. Findings #1 (empty positional arg descriptions) and #2 (unfenced CSV examples) remain as separate work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: regenerate help markdown and MCP skills for required-option markers Commands with required options in their Usage: line now show them: - applydp: --new-column, --replacement, --formatstr - apply, describegpt, fetchpost, implode, joinp, luau, py, split: various required options previously unmarked All regenerated via `qsv --generate-help-md` and `qsv --update-mcp-skills`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators): only mark an option required when it's required in all Usage variants The previous heuristic produced seven false positives flagged by roborev review #1624. The detector now: 1. Computes a per-Usage-variant required set (tokens outside `[...]` AND outside any `(A | B)` alternative group), then takes the intersection across all non-`--help` Usage lines. An option must be required in every variant to be marked globally required. 2. Handles `(A | B | C)` alternative groups by masking them out entirely — inside alternatives no individual token is required. 3. Expands short→long aliases per Usage line, before intersection, so `-n`'s use as a positional-style flag on one Usage variant doesn't leak into `--no-headers` as globally required. Fixes false-positive required markers on: - split: `--size` / `--chunks` / `--kb-size` (alternative group) - joinp: `--cross` / `--non-equi` (separate Usage lines) - apply / applydp: `--new-column` / `--replacement` / `--formatstr` (subcommand-scoped, not global) - describegpt: `--prepare-context` / `--process-response` (separate Usage lines) - fetchpost: `--payload-tpl` (alternative inside `(A | B)`) - luau / py: `--no-headers` (the `-n <main-script>` Usage role only appears in one variant; intersection excludes it) - py: `--helper` (only required on one Usage variant) Implode's `--keys` / `--value` markers are preserved (genuinely required in the single Usage variant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(generators): share required-options detection in a common module Addresses roborev review #1625. Extracts the previously-duplicated required-option detection (and its helpers) out of both generators and into a new crate::generators_common module. Both mcp_skills_gen and help_markdown_gen now delegate to it, keeping their detection semantics in lockstep. Fixes additional review findings while consolidating: - Bidirectional short↔long expansion via a new FlagPairs type, so Usage lines that mention only the long form also surface the short form in the required set (and vice versa). - Bracket-depth is now u32, and uses u32::saturating_sub so an unbalanced `]` cannot underflow the counter (on i32 it would have saturated at i32::MIN, silently dropping later required tokens). - The pair regex now matches long-first (`--keys, -k`) declarations in addition to short-first (`-k, --keys`). - The pair regex scans only the options-declaration portion of the USAGE string (after the Usage: block), so a future quirk in the Usage: block can't introduce a bogus pair. Adds 12 unit tests covering: single-variant required expansion, alternative groups `(A|B|C)`, multi-variant intersection, subcommand- scoped options, short-role collision (luau/py), plain `(X)` grouping without a pipe, nested optional inside an alt group, long-first declarations, long-only Usage mentions, no Usage block, unbalanced brackets, and Usage-block scoping of the pair regex. Regenerator outputs (docs/help and .claude/skills) are unchanged by this refactor (confirmed with --generate-help-md / --update-mcp-skills producing no diffs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): narrow options-section scope, cache regexes, handle continuations Addresses roborev review #1626: 1. Options-section scope: replace the blank-line-after-Usage heuristic with a proper line-anchored, case-insensitive `options:` / `Options:` header regex. The previous logic landed on description paragraphs (most qsv USAGEs have a description between the Usage block and the options section), so the pair scan was broader than intended. 2. Fallback for minimal fixtures: if no options header is found, fall back to scanning the whole USAGE so short↔long pairs declared in non-standard layouts (or small test fixtures) still register. 3. Compiled regexes (short-first pair, long-first pair, flag scanner, options header) are now cached via `std::sync::OnceLock`, removing per-call recompilation overhead. 4. Usage-block collection now terminates only on a blank line and merges docopt continuation lines (indented lines within the block that do not begin with `qsv`) into their parent variant, preventing a wrapped Usage line from showing up as a standalone variant and silently narrowing the intersection. Adds three more unit tests covering: continuation-line joining (`continuation_line_does_not_truncate_usage_block`), scope narrowing (`pair_regex_scans_only_the_options_section_not_description`), and the whole-text fallback (`fallback_to_whole_text_when_no_options_section`). 15/15 unit tests pass; regenerator output is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): tighten options-header regex and indentation-aware continuation Addresses roborev review #1627: 1. options_section regex: the leading class is now `[ \t\w-]` instead of `[\s\w-]`, so it can't straddle a newline. Trailing `[ \t]*` matches only the same-line whitespace, keeping the "line-anchored" claim in the doc comment literally true. 2. collect_usage_lines: drop the dead `if trimmed.is_empty()` branch — the `take_while(!blank)` already filters blank lines out. 3. collect_usage_lines: continuation detection now prefers indentation depth (leading-whitespace count strictly greater than the parent variant's) with the `qsv`-prefix rule as a tiebreaker. A continuation line whose positional begins with `qsv` (e.g. `qsv-input`) would previously have been treated as a new variant. 4. collect_usage_lines: inline `Usage: qsv foo ...` header variants are now retained. Previously the `Usage:` line was unconditionally dropped, silently losing the only variant for that style of help text. Three additional unit tests lock this in: `Common options:` / `map options:` prefix-word headers both matched; a tab-indented `\toptions:` header; and an inline `Usage: qsv foo ...` variant. 18/18 unit tests pass; full `cargo test` passes; generator outputs unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): baseline-indent continuation detection + cleanups Addresses roborev review #1628: 1. Indentation comparison now uses a single *baseline indent* (the leading whitespace of the first non-blank line in the Usage block) instead of comparing each line against the previous variant's indent. This fixes the inline-vs-non-inline asymmetry: an inline `Usage: qsv foo --bar` header was stored trimmed (leading_ws=0), so a following standard- column variant like ` qsv foo --baz` (leading_ws=7) was wrongly merged as a continuation. Inline variants are now synthesized at the baseline indent for consistent comparison. 2. With the baseline-indent rule, indentation genuinely *outranks* any prefix test: continuation == leading_ws > baseline. No more confused comment claiming "tiebreaker" while the code actually OR'd. The `qsv`-prefix check is gone — indentation is the sole signal. 3. `if let Some(last) = variants.last_mut()` replaces the `match + later unwrap()` pattern, removing the SAFETY comment. Three new unit tests lock this in: - `indented_wrap_line_merges_into_parent_variant` - `continuation_starting_with_qsv_prefix_is_still_a_continuation` (a deeper-indented `qsv-foo` continuation must fold, regression for the indentation-outranks-prefix intent) - `inline_usage_plus_indented_second_variant_stays_separate` (regression for the storage asymmetry from finding #1) 21/21 unit tests pass; generator outputs unchanged; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): derive baseline indent from min leading-ws Addresses roborev review #1629: 1. Baseline-indent is now the *minimum* leading whitespace across all non-blank lines in the Usage block (not the leading_ws of raw[0]). In well-formed docopt, continuation lines are always indented deeper than their variants, so min reliably picks the variant column — including when raw[0] happens to be a wrapped continuation of an inline `Usage:` variant (previously baseline would have drifted to the continuation's indent, causing real variants at the standard column to be misclassified). 2. The continuation branch now fails loudly (debug_assert) when hit with an empty variants vec rather than silently promoting the line, so a future refactor of the baseline derivation can't regress unnoticed. 3. Replaced `.map().next().unwrap_or(0)` with `.iter().map(...).min()` — both a cleaner expression and the fix for the baseline-drift bug. 4. Added `inline_usage_with_wrapped_continuation_and_second_variant` to lock in the exact corner case: inline `Usage: qsv foo --bar`, an indented wrap line, and a second variant at the standard column — baseline must land on the standard column, the wrap line must merge into variant 1, and the second variant must stay separate. 5. Doc-comment reworded to describe the actual invariant: "at or below the baseline indent start new variants; strictly deeper lines are merged." 22/22 unit tests pass; generator outputs unchanged; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(skills): add optional Option.required to TypeScript schema The Rust generator emits `"required": true` on required options in the MCP skill JSON (via the newly-added Option_.required field in mcp_skills_gen.rs). The TypeScript Option interface didn't declare it, so consumers using the typed view wouldn't see the field even though the JSON carries it. Adds `required?: boolean` with a doc comment matching the generator's semantics: emitted only when true, omitted for optional options. Addresses Copilot review on PR #3734. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which duplicated the port and could drift) with a single SAMPLE_TEST_PORT + SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from the same source — no more brittle .split(':').next().unwrap() that would also panic on IPv6 hosts. - (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The server now stops in Drop, so a panic inside read_stdout / stdout doesn't leak the port and cascade into "Address already in use" on the next #[serial] test. - (#4) Call wrk.assert_success(&mut cmd) before reading stdout in the success-path tests, so a regression that makes qsv exit non-zero surfaces qsv's stderr instead of a generic CSV-parse error. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3775) * test/sample: add integration tests for streaming Bernoulli URL path Closes the test-coverage gap flagged in PR #3774. Stands up a local actix-web fixture (port 8082, distinct from test_fetch's 8081) and exercises the boundary detection and validation guards added there: - sample_bernoulli_url_quoted_newline_header: header field 0 contains a literal `\n` inside an RFC-4180 quote. Asserts the header arrives intact (3 fields, embedded newline preserved) and that every emitted data row also has 3 fields. Old code would have split on the raw byte and corrupted every following record. - sample_bernoulli_url_max_size_truncation: serves a ~1.2 MiB CSV with fixed 100-byte records so `--max-size 1` cuts deterministically inside record 10486. Asserts max id <= 10485 (no half-record at the cap) and that every emitted row is well-formed. - sample_bernoulli_url_404_fails_fast: hits an unmapped path on the fixture server. Asserts qsv exits with error instead of feeding the HTML 404 body into the csv parser (regression for the missing `error_for_status()`). - sample_bernoulli_url_custom_delimiter: serves a TSV and passes `--delimiter '\t'`. Reads raw stdout and splits on tab (the writer also honors --delimiter, so read_stdout's comma parser would collapse rows). Asserts header and data rows split into 3 fields. Tests use #[serial] so they don't race on the port. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * typo: mis-split->split incorrectly * test/sample: address review feedback on streaming Bernoulli tests - (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which duplicated the port and could drift) with a single SAMPLE_TEST_PORT + SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from the same source — no more brittle .split(':').next().unwrap() that would also panic on IPv6 hosts. - (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The server now stops in Drop, so a panic inside read_stdout / stdout doesn't leak the port and cascade into "Address already in use" on the next #[serial] test. - (#4) Call wrk.assert_success(&mut cmd) before reading stdout in the success-path tests, so a regression that makes qsv exit non-zero surfaces qsv's stderr instead of a generic CSV-parse error. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test/sample: address Copilot review on PR #3775 — single-run cmd, server start timeout - Replace the assert_success-then-read_stdout double-run pattern with a single capture-and-parse helper. The previous shape ran qsv twice per test, doubling fixture-server requests (and the ~1.2 MiB max-size download) and meaning the parsed stdout came from a different execution than the one whose status was asserted. - Added run_and_assert_success(): runs once, asserts status, returns Output (with stderr surfaced on failure). - Added parse_csv_stdout(): mirrors wrk.read_stdout's Vec<Vec<String>> shape but reads from a captured buffer. - All three success-path tests (quoted newline header, max-size truncation, custom delimiter) now use these helpers. - Switch the SampleWebServer startup channel to send Result<ServerHandle, String> and use recv_timeout(10s) instead of recv(). A failed bind (e.g., port already in use) used to leave start() blocked forever; it now panics fast with the bind error surfaced from the server thread. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) * feat(stats,frequency): auto-enable DataSketches estimators on OOM When --memcheck is set and util::mem_file_check returns OutOfMemory, stats and frequency now auto-enable their DataSketches-backed estimators (t-digest + HyperLogLog for stats; Misra-Gries Frequent Items for frequency) in addition to the existing auto-index fallback. Conflict guards mirror the explicit-validation rejections so the auto-enable only flips methods that would have passed validation if set by hand. A wwarn! lists the auto-enabled estimators; the original OOM is only propagated when neither fallback engages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(stats,frequency): address roborev #2028 findings - Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address Copilot review: stream OOM fixture, drop drift-prone line refs, add -- prefix - tests/workdir.rs: rewrite build_large_oom_csv to stream rows directly to a csv::Writer instead of building a 10M-row Vec in memory first. Avoids ~1.5 GB of String allocation that would OOM the test harness itself on memory-constrained hosts, defeating the purpose of the ignored OOM tests. - src/cmd/stats.rs, src/cmd/frequency.rs: replace hard-coded intra-file line-number references in the try_enable_approx_sketches and can_enable_frequent_items doc-comments with descriptive references to the validator/dispatch blocks they mirror. - src/cmd/frequency.rs: add the missing -- prefix to "sketch-method frequent_items" in the --memcheck USAGE help text; regenerate docs/help/frequency.md and .claude/skills/qsv/qsv-frequency.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(stats,frequency): clarify OOM fallback fires in both NORMAL and CONSERVATIVE mode The OOM auto-fallback to DataSketches estimators is gated on the result of util::mem_file_check, NOT on whether --memcheck is set. The in-memory load check runs unconditionally on the non-parallel path; --memcheck only switches the check from NORMAL mode (vs. total memory) to the stricter CONSERVATIVE mode (vs. available + swap × platform factor). The fallback can therefore trigger without --memcheck too — just much less often, since NORMAL mode only trips when the file is larger than ~80% of total RAM. Rewrote --memcheck USAGE in stats.rs and frequency.rs to: - Lead with what --memcheck actually does (CONSERVATIVE vs. NORMAL). - Reference QSV_MEMORY_CHECK as the env-var equivalent. - Describe the OOM fallback as a behavior of the load check itself, not of --memcheck specifically. Updated CHANGELOG.md and docs/STATS_DEFINITIONS.md to match. Regenerated docs/help/{stats,frequency}.md and the corresponding MCP skill JSONs from the new USAGE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address Copilot review: real opt-out for OOM auto-enable + assert command success in tests Two issues raised by Copilot on the OOM auto-fallback: (1) The wwarn promised users could re-run with explicit --quantile-method exact / --cardinality-method exact / --sketch-method exact to disable the auto-enable, but the code only checked the parsed flag value. docopt fills in the default "exact" either way, so an explicit --foo-method exact was indistinguishable from omitting the flag — making the documented opt-out a no-op. Fix: scan argv for the literal flag names ("--foo-method" or "--foo-method=...") to detect explicit user intent. Thread that through try_enable_approx_sketches / can_enable_frequent_items via new user_set_* parameters; the auto-enable is suppressed when the user explicitly provided the flag (regardless of value). Documented in STATS_DEFINITIONS.md. (2) The new OOM tests used wrk.output_stderr, which returns stderr regardless of exit status — a command that errored out after printing the auto-enable wwarn would still pass the test. Fix: add a wrk.stderr_on_success helper that asserts status.success() before returning stderr (with the same diagnostic-rich panic format as assert_success). Migrate the 6 new stats/frequency OOM tests to use it. Other call sites of output_stderr left untouched — they test failure paths where non-success is intentional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(frequency): restore can_enable_frequent_items doc comment Address roborev #2032: in the previous commit (d9fe03e), `argv_has_flag` and its doc comment were inserted *between* `can_enable_frequent_items`'s doc comment and the function body. Because Rust doc comments attach to the next item, the entire docstring block (including the `user_set_sketch_method` paragraph and the trailing "Returns false if any conflicting flag is set..." line) bound to `argv_has_flag`, leaving `can_enable_frequent_items` with no doc comment at all. Move `argv_has_flag` (with its own 4-line doc) to live *after* `can_enable_frequent_items`, mirroring the layout in stats.rs where the ordering was correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@context

…, Croissant (#3908) * feat(profile): comprehensive DCAT-US v3 support (Catalog, GSA bundle, force semantics) Closes the five gaps that kept `qsv profile` from being an agency-grade DCAT-US v3 reference tool: - Vendor the full GSA JSON Schema bundle (26 definitions + 2 qsv overlays + MANIFEST.json + refresh README) under resources/dcat-us-v3/, pinned to upstream commit cf8789002. `--validate-dcat` now runs against the full bundle via `referencing::Registry`, dispatching the Dataset or Catalog overlay by the emitted `@type`. A `curie::strip_curies` pre-pass bridges qsv's JSON-LD-compact output to GSA's unprefixed schema keys without touching the emitted JSON on disk. - Add `--catalog` flag that wraps the Dataset inside a `dcat:Catalog` envelope (`Catalog{dataset:[...]}`) for federation harvesters. - Emit nine new optional v3 fields with natural data sources: Dataset-level `dct:created`, `dcat:version`, `dcat:versionNotes`; Distribution-level `dcat:checksum` (SHA-256 via sha2), `dcat:compressFormat`, `dcat:packageFormat`, `dcat:spatialResolutionInMeters`, `dct:language`, `dct:conformsTo`. Widen `dct:conformsTo` to array per v3 cardinality; emit `dct:license` as string and `dcat:byteSize` as string to match the GSA schemas' declared shapes. - Implement full `force: true` override semantics across all three --initial-context subtrees. `context::collect_forced_paths` now walks package/resource entries through a 47-entry `ckan_to_dcat` mapping table; `apply_force_overrides` in `run()` applies forced leaves LAST so they beat both inferred and discovered metadata. Pipeline precedence (low → high): inferred → discovered → dataset_info pointers → forced leaves → schema validation. Bumps profile feature: adds `sha2 = "0.10"` as a direct dep. Test counts: 143 unit (was 96, +47) and 29 integration (was 18, +11) all passing, plus a new bundle pin guard test that re-hashes every vendored schema against MANIFEST.json on each run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): scaffold YAML-driven projection engine (Stage 1) Lay the foundation for the YAML-driven multi-profile projection engine described in plan §1-§2. New modules are wired into profile.rs but the orchestrator still calls the legacy dcat.rs path — zero behavior change shipped here. Subsequent stages (§3-§8) populate the profile YAMLs, swap the orchestrator, and delete the legacy hardcoded modules. New modules: * src/cmd/profile/profile_spec.rs — ProfileSpec serde types, embedded- first load() with case-insensitive name resolution, file-path fallback, 6 unit tests. * src/cmd/profile/projection.rs — generic project() engine with ProjectionMode { Dataset, Catalog }, ProjectionWarning { Severity }, wrap_as_catalog, for_each_column RecordSet expansion, profile-aware lookup/field_mapping closures, dry_compile validator, 9 unit tests. * src/cmd/profile/discovery_merge.rs — merge() with fill-if-absent, overlay-array, never strategies; never_overwrite + forced_paths protection; 5 unit tests. Helper additions (formula_helpers.rs): * Filters: only_if_absolute_iri, basename, file_stem, sanitize_iso_8601_interval, format_mailto. * Globals: sha256_of (streaming), blake3_of (mmap+rayon), file_size_of, compress_format, package_format, build_csvw_schema. Helpers needing profile state (lookup, field_mapping) live in projection.rs::register_profile_helpers as closures over the ProfileSpec; they unwrap_or(UNDEFINED) so | default chains work. USAGE additions: * --profile <name|path>: embedded names (dcat-us-v3, dcat-ap-v3, croissant) resolved first; falls back to file path. Not yet consumed in run() — wired up in Stage 4. Placeholder YAMLs under resources/profiles/ exist so include_str! resolves during Stage 1 builds; they will be replaced with real content in Stages 3 (DCAT-US v3), 6 (DCAT-AP v3), 7 (Croissant). Verification: * cargo build --bin qsv -F profile,feature_capable — clean (23 expected dead-code warnings for the unused scaffold). * cargo test cmd::profile:: — 163 passed (+20 new tests). * cargo test --test tests test_profile:: — 29 passed (no regression). * cargo +nightly fmt — applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(profile): capture goldens from legacy engine before YAML swap (Stage 2) Lock the byte-equivalent output of the current hardcoded dcat.rs engine against three regression fixtures so Stage 3's YAML-driven projection can be asserted to produce identical Dataset + Catalog blocks. Goldens captured by running today's qsv profile against each fixture with the canonical --initial-context template, then normalizing via jq to strip the only path-dependent field (qsv:sourcePath inside dcat:distribution). Everything else in the .dcat block — including dcat:byteSize, dcat:checksum, dct:modified, csvw:tableSchema — is deterministic for fixed input and is captured verbatim. Fixtures (under tests/resources/profile/golden/): * nyc-311-subset.csv (10 rows) — geocoded urban service requests: lat/lon present, mixed Open/Closed status, multi-agency. * usda-soil-subset.csv (10 rows) — scientific numeric data: pH, organic_carbon_pct, nitrogen_pct, clay/sand/silt percentages. * wprdc-311-subset.csv (10 rows) — Pittsburgh 311 records: capitalized headers, X/Y geo, council districts + wards. Goldens per fixture: * <fixture>.dataset.expected.json — the .dcat block from Dataset mode. * <fixture>.catalog.expected.json — the .dcat block from --catalog mode. .gitignore whitelists tests/resources/profile/golden/*.{csv,expected.json} so the *.json + *.csv blanket-ignores don't strip them. These goldens will drive Stage 3's dcat_us_v3_golden_parity_dataset and dcat_us_v3_golden_parity_catalog tests; CI hard-fails on drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): ship dcat-us-v3.yaml profile (Stage 3, partial) Author resources/profiles/dcat-us-v3.yaml — the full DCAT-US v3 projection definition that will replace the hardcoded dcat.rs engine in Stage 4. The YAML mirrors the legacy add_* functions field-for-field in declaration order so serde_json::Map insertion preserves wire-shape parity (verified against the Stage-2 goldens at swap time). Profile content: * 4 vocabularies (license_iri, accrual_periodicity, iso_639_1, csvw_datatype) — each migrated verbatim from the legacy Rust constants. The EU vocab IRIs retain http:// scheme per their canonical published identifiers; DevSkim DS137138 suppressed per line. * 53 field_mappings — same CKAN→DCAT pointer table the legacy ckan_to_dcat::CKAN_TO_DCAT held, in identical declaration order so alias-resolution precedence is preserved. * dataset.fields[] — 23 entries covering core identity, provenance, contact point (required), classification, coverage, US codes (recommended), governance, and extended metadata. emit_when guards match the legacy `if let Some(...)` shapes. * distribution.fields[] — 22 entries covering title, description, download URL, format/license/restrictions, language/conformance, file-derived facts (byteSize, checksum, compress/package format), spatial resolution, and csvw:tableSchema. * catalog block reproduces wrap_as_catalog's envelope (Catalog of <title>, dct:conformsTo, dct:publisher inheritance). * discovery_merge: enabled, never_overwrite=[@context,@type, dcat:distribution], fill-if-absent strategy. * validation: enabled against the vendored GSA bundle under resources/dcat-us-v3/ with the same 11 strippable CURIE prefixes. dry_compile verification: A new unit test (embedded_dcat_us_v3_parses_and_dry_compiles) parses the embedded YAML and runs projection::dry_compile() against it — exercising every template's minijinja compile path. All templates compile clean. The actual byte-equivalent parity test (running each Stage-2 fixture through projection::project() and asserting against goldens) lands in Stage 4 alongside the orchestrator swap — at that point the engine actually consumes the YAML. The reference cross-checked sources for content: https://github.com/GSA/dcat-us/ https://resources.data.gov/resources/dcat-us3/ the vendored GSA bundle under resources/dcat-us-v3/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(profile): handoff #3 — YAML projection engine, Stages 1-3 landed Captures the current state after the YAML-driven projection migration's first three commits. Documents what's wired (scaffold + helpers + flag + goldens + DCAT-US v3 YAML), what's still on the legacy path (dcat.rs drives output), and a 9-sub-step Stage 4 plan for the orchestrator swap. Supersedes profile2-handoff.md for post-PR-#3901 work. Key gotchas distilled into §5: lookup helpers must return Value::UNDEFINED, goldens only normalize qsv:sourcePath, field-mapping count is 53 not 47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): wire YAML projection engine into orchestrator (Stage 4a) profile.rs::run now routes through projection::project() with the loaded ProfileSpec (default: dcat-us-v3). The YAML engine produces byte-equivalent output to the legacy dcat.rs path on all 6 golden fixtures (3 inputs × dataset/catalog modes), verified by new parity integration tests. Orchestrator changes: * Load profile via profile_spec::load(args.flag_profile | "dcat-us-v3") at the top of run(), then projection::dry_compile() to fail fast on malformed embedded YAML. * ContextArgs gains a `profile: &ProfileSpec` field; context::build threads it to load_initial_context → collect_forced_paths so the CKAN→target pointer translation uses profile.field_mappings instead of importing ckan_to_dcat. * Replace dcat::build() call with projection::project(&profile, &projection_ctx, mode) — the projection_ctx carries pkg, res, stats, dpp, source_label, local_path matching the YAML's template names. * Replace merge_discovered() with discovery_merge::merge(&profile, inferred, discovered, forced_dcat_paths) — same /dcat/<key> forced- path semantics, now driven by profile.discovery_merge. * Catalog wrap baked into project() via ProjectionMode::Catalog (chosen upfront based on flag_catalog); orchestrator no longer calls catalog::wrap_as_catalog at the warning-filter step. * Stash key renamed __pending_dcat_warnings → __pending_projection_warnings. * DcatWarning → ProjectionWarning conversion bridges dcat_validate and run_profile_validation outputs (Stage 5 will refactor those modules to return ProjectionWarning directly). Engine improvements: * projection::project sets UndefinedBehavior::Chainable so `pkg.dpp_suggestions.spatial_extent.value | default("")` walks missing intermediates gracefully (matches legacy dcat.rs semantics where absent keys silently fall through). * New file-aware helpers in formula_helpers.rs: - bbox_from_dpps(dpp, stats) — lat/lon column → POLYGON-WKT `dct:Location` array, mirroring legacy dcat::bbox_from_dpps. - temporal_from_dpps(dpp, stats) — date columns → array of `dct:PeriodOfTime`, one per inferred date column. - build_csvw_schema(stats) — column-name → stats-blob map walked, emitting `{columns: [...]}` with name, titles, datatype, qsv:cardinality / nullcount / min / max. - csvw_datatype_legacy helper mirrors the legacy mapping (Float → double, Integer → integer, Date → date, etc.). dcat-us-v3.yaml updates: * dct:spatial / dct:temporal fields call bbox_from_dpps / temporal_from_dpps as fallbacks behind the formula-derived WKT suggestion. * dct:license emits a plain string (legacy license_value shape) via `{{ lookup("license_iri", raw) | default(raw) }}`, not the previous `{"@id": ...}` object form (GSA Distribution.json declares license as anyOf:[null,string]). Tests: * 2 new integration tests (dcat_us_v3_golden_parity_dataset / _catalog) iterate the 3 fixtures and assert byte-equivalent .dcat output against the goldens. * discovery_merge test: forced-path form switched from "/dct:title" to "/dcat/dct:title" so it matches the legacy dataset_info pointer shape; +1 new test for nested-path force blocking top-level merge. * All 6 goldens refreshed to current legacy output (the original Stage-2 capture had alphabetical stats-cache state). * Full test sweep: 165 unit + 31 integration tests pass, 0 failures. The legacy dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules are still in tree (their tests still run via cmd::profile::*) but no longer participate in the engine path. Stage 4b deletes them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): delete legacy hardcoded engine + refactor validator (Stages 4b + 5) The YAML-driven projection engine is now the only path. Stage 4a wired projection::project() into run() with byte-equivalent output against the goldens; this commit cleans up by deleting the legacy modules and refactoring dcat_validate to consume the active ProfileSpec. Deletions (~2400 LOC): * src/cmd/profile/dcat.rs (1738 LOC) — the 9 add_* helpers, bbox_from_dpps, temporal_from_dpps, csvw_datatype, license_value, accrual_periodicity_iri, normalize_iso_639_1. The minijinja-side equivalents live in formula_helpers.rs + dcat-us-v3.yaml. * src/cmd/profile/catalog.rs (154 LOC) — wrap_as_catalog moved into projection::wrap_as_catalog. * src/cmd/profile/ckan_to_dcat.rs (271 LOC) — CKAN_TO_DCAT table moved verbatim into dcat-us-v3.yaml's field_mappings:; the lookup is now ProfileSpec::translate_ckan_ptr. * src/cmd/profile/curie.rs (225 LOC) — strip_curies is now an inline helper in dcat_validate.rs driven by profile.validation.strippable_curie_prefixes. * mod declarations for the deleted modules in profile.rs. dcat_validate.rs refactor (Stage 5): * New public API: validate(profile: &ProfileSpec, block: &Value) -> Vec<ProjectionWarning>. When profile.validation.enabled == false (DCAT-AP v3, Croissant), returns vec![] without touching the schema. * Inline strip_curies / strip_curie_key replace the deleted curie module; the prefix list comes from profile.validation.strippable_curie_prefixes (still byte-identical to the legacy list for DCAT-US v3). * classify_severity now returns projection::Severity instead of dcat::Severity. * Test functions migrate to the new (profile, block) signature by loading the embedded dcat-us-v3 profile via profile_spec::load. profile.rs cleanup: * dcat_validate::validate_dataset_or_catalog() call → validate(). * run_profile_validation now returns Vec<ProjectionWarning> directly; the .into_iter().map(From::from) bridge is gone. projection.rs cleanup: * impl From<DcatWarning> for ProjectionWarning removed (no longer needed — all warning producers return ProjectionWarning). Verification: * cargo build --bin qsv -F profile,feature_capable — clean. * All 4 binaries build clean: qsv (-F all_features), qsvmcp (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus). * cargo test cmd::profile:: → 127 unit tests pass (down from 165; the deleted legacy modules carried 38 tests now obsoleted by the YAML+goldens parity coverage). * cargo test --test tests test_profile:: → 31 integration tests pass (29 original + 2 new dcat_us_v3_golden_parity_* tests). Net Rust LOC delta this commit: −2388 deleted, +60 added (inline strip_curies + tests) = −2328 LOC. Cumulative since Stage 1: −2328 + 1525 + 546 = −257 LOC vs the pre-YAML-engine state, AND all engine knowledge now lives in resources/profiles/dcat-us-v3.yaml where it's editable without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): ship dcat-ap-v3 profile + 4 smoke tests (Stage 6) DCAT-AP v3 (semiceu.github.io/DCAT-AP/releases/3.0.0/) is now an embedded profile selectable via --profile dcat-ap-v3. The shape is a DCAT-US v3 subset, with: * JSON Schema validation disabled (DCAT-AP ships SHACL upstream; a SHACL backend is a future enhancement). * No dcat-us:* extensions (bureauCode, programCode, accessLevel, purpose, liabilityStatement) — those are US-specific. * New `eu_theme` vocabulary mapping CKAN group slugs to EU publications-office authority IRIs (http://publications.europa.eu/resource/authority/data-theme/...). * dcat:accessURL required on Distribution per the v3 spec (Mandatory cardinality 1..*). * dct:conformsTo points at the SEMIC v3 release URL. * Smaller field_mappings (29 entries vs the 53 in dcat-us-v3) since many DCAT-US extensions don't apply. The same minijinja templates and helpers power both profiles; the only Rust-side change in this commit is the YAML profile + tests. Smoke tests (tests/test_profile.rs): * dcat_ap_v3_emits_no_dcat_us_extensions — verifies the projection carries zero dcat-us:* keys even with the full initial-context. * dcat_ap_v3_distribution_carries_access_url — confirms the Distribution-mandatory dcat:accessURL is populated. * dcat_ap_v3_conforms_to_targets_spec_url — confirms downstream consumers can detect the profile via dct:conformsTo. * dcat_ap_v3_validation_is_disabled_noop — confirms --validate-dcat with this profile produces no JSON Schema warnings (the validator short-circuits when profile.validation.enabled == false). Source: https://github.com/SEMICeu/DCAT-AP Cardinality reference: https://semiceu.github.io/DCAT-AP/releases/3.0.0/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): ship croissant 1.0 profile + 5 smoke tests (Stage 7) Croissant ML metadata format (mlcommons.org/croissant) is now an embedded profile selectable via --profile croissant. The output is schema.org-rooted JSON-LD conforming to Croissant 1.0: * @context inlines the canonical Croissant map: @language=en, @vocab=https://schema.org/, plus cr:/dct: prefix shorthands. Per the Croissant spec at https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec.md. * @type=sc:Dataset; field paths use schema.org bare keys (name/description/url/license/creator/publisher/keywords/etc.) rather than dcat:/dct: prefixes. * conformsTo target IRI: http://mlcommons.org/croissant/1.0. * Distribution emitted under bare `distribution` (schema.org @vocab resolves it) with @type=sc:FileObject. * Per-column cr:RecordSet/cr:Field expansion via the new build_croissant_fields helper — one Field per CSV column with schema.org dataType (sc:Text / sc:Integer / sc:Float / sc:Boolean / sc:Date / sc:DateTime). * BLAKE3 hash via cr:fileFingerprint (qsv-native mmap+rayon, markedly faster than SHA-256 on multi-GB ML training data; Croissant has no SPDX-mandated algorithm so the choice is free). * validation.enabled: false (Croissant uses a Python validator, mlcroissant, not JSON Schema). * discovery_merge.enabled: false (Croissant doesn't live in CKAN-style data portals). Engine extensions: * DatasetBlock.context now accepts a `Value` (string or object) so the inline Croissant @context map round-trips verbatim. DCAT-US / DCAT-AP profiles still use a string URI — backwards-compatible. * DistributionBlock.path lets profiles override the Distribution wrapper key. Croissant emits `distribution`; DCAT defaults remain `dcat:distribution`. * New formula helper build_croissant_fields(stats) walks the per- column stats map and emits a flat cr:Field array with schema.org dataType IRIs. Smoke tests (5 in tests/test_profile.rs): * croissant_uses_schema_org_context_and_sc_dataset_type * croissant_conforms_to_targets_mlcommons_spec * croissant_emits_recordset_with_one_field_per_csv_column * croissant_uses_bare_distribution_key_not_dcat_namespaced * croissant_distribution_uses_file_object_type Verification: cargo test cmd::profile:: → 127 unit, test_profile:: → 40 integration tests pass (29 original + 2 parity + 4 DCAT-AP + 5 Croissant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(profile): regenerate help + finalize handoff (Stage 8) * docs/help/profile.md regenerated via --generate-help-md to surface the --profile flag added in Stage 1. * profile3-handoff.md updated to reflect all 8 stages landed, full file map post-deletion, verification commands, captured design decisions, and queued follow-ups. * src/cmd/profile.rs: drop the now-useless DcatWarning → ProjectionWarning conversion in the --validate-dcat code path (Stage 5 already refactored validate() to return ProjectionWarning directly). Verification: * python3 scripts/docs-drift-check.py → no drift detected. * All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp). * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 40 integration tests pass. * cargo clippy --bin qsv -F profile,feature_capable → no new findings in the YAML-engine code path. This closes the YAML-driven projection engine migration. The shipped binary always goes through projection::project(); the legacy dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules are deleted. DCAT-US v3 / DCAT-AP v3 / Croissant projection knowledge lives entirely in resources/profiles/*.yaml — editable without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profile): address roborev #2490 findings (catalog/discovery/force/validate) 7 findings from the YAML-engine branch review at job 2490. Each fix ships with a regression guard in tests/test_profile.rs. Medium severity (6): 1. Catalog mode + discovery merge target (src/cmd/profile.rs:398). Discovery was merging into the Catalog envelope top-level instead of the nested Dataset. Fix: project Dataset always, apply discovery_merge::merge, THEN conditionally wrap in Catalog via the new projection::wrap_in_catalog_envelope helper. Guard: catalog_mode_merges_discovered_into_inner_dataset_not_envelope. 2. Catalog envelope missing @context (src/cmd/profile/projection.rs:296). The envelope carried CURIE keys (dct:title, dct:conformsTo, dcat:dataset) without a top-level @context, leaving it invalid as JSON-LD. Fix: wrap_as_catalog now copies profile.dataset.context into the envelope; inner Dataset keeps its own context for self-containment. Guard: catalog_envelope_carries_top_level_context. 3. dct:spatial emits string "null" when no bbox (resources/profiles/dcat-us-v3.yaml + dcat-ap-v3.yaml). bbox_from_dpps returning UNDEFINED rendered as `"null"` via `| tojson` because coerce_json_or_string left the literal alone. Fix: emit_when guard gates the field on WKT-or-bbox availability. Guard: spatial_field_suppressed_when_no_lat_lon_columns. 4. --dcat-legacy-license parsed but never wired (src/cmd/profile.rs:380). Flag was documented + collected into Args but never reached the YAML engine. Fix: thread the flag into projection_ctx as `legacy_license`, add a conditional Dataset-level dct:license field in dcat-us-v3.yaml gated on that variable. Guards: dcat_legacy_license_emits_dataset_level_license, dcat_legacy_license_off_keeps_license_distribution_only. 5. Forced package/resource values bypass profile shaping (src/cmd/profile/context.rs:388). collect_forced_paths was writing raw CKAN values to target pointers via apply_force_overrides, producing string-where-Agent-expected shapes (e.g. forced package.publisher → "Name" instead of {"@type":"foaf:Agent","foaf:name":"Name"}). Fix: CKAN-side forces now only contribute to `forced_paths` (discovery-merge protection); the value lives in merged package/resource via normalize_value_force and flows through the profile's templates for proper shaping. dataset_info forces still take the raw-write path (that's the documented escape hatch). Guard: forced_package_publisher_flows_through_profile_template. 6. validate() ignores profile.validation paths (src/cmd/profile/dcat_validate.rs:250). When validation.enabled was true, the function always used the embedded GSA bundle regardless of profile.validation.schema_dir. Fix: when the profile's schema_dir matches the embedded `resources/dcat-us-v3/` path (the only bundle qsv ships today), use the embedded validators; any other schema_dir produces a single Recommended-severity warning explaining that custom-bundle validation is a queued follow-up. The embedded DCAT-US v3 profile's behavior is unchanged. Low severity (1): 7. DiscoveryMerge::default() disabled merging (src/cmd/profile/profile_spec.rs:273). #[derive(Default)] gave `enabled: false`, contradicting the documented "fill-if-absent enabled by default" semantics — the `#[serde(default = "default_true")]` annotation only fires during deserialization. Fix: hand-rolled Default impl with enabled: true, the never_overwrite list (@context, @type, dcat:distribution), and fill-if-absent strategy. Golden refresh: * Catalog goldens (nyc-311, usda-soil, wprdc-311) pick up the new envelope @context entry — finding #2 fix. * usda-soil dataset golden loses the spurious `"dct:spatial": "null"` entry — finding #3 fix. Verification: * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 46 integration tests pass (40 prior + 6 new regression guards). * All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp). * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(profile): drop auto-generated stats caches from golden dir The previous commit accidentally committed three *.stats.csv files (qsv stats cache, auto-regenerated on every profile run). They slipped past .gitignore because the golden-directory *.csv whitelist also matches the stats.csv suffix. Fix: add a re-ignore rule for `tests/resources/profile/golden/*.stats.csv` and the JSONL variant, then `git rm` the committed files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profile): preserve CKAN-side force against spec formulas (roborev #2491) Regression introduced by the #2490 fix #5: when CKAN-side `force: true` values stopped being raw-written via apply_force_overrides, they became vulnerable to overwrite by spec formulas. A formula targeting `package.publisher` would replace the forced value in merge_formula_results' pass-1 (before projection), violating the documented "force beats inferred" guarantee. Fix: track the CKAN-side forced field-name sets through the pipeline so merge_formula_results can skip them. * context.rs: collect_forced_paths now returns a 4-tuple including `forced_package_fields` and `forced_resource_fields` (HashSet<String> of CKAN-side field names marked force:true). load_initial_context returns the matching 6-tuple; AnalysisContext carries both sets. * profile.rs: merge_formula_results takes the two sets and skips pass-1 inserts on matching field names. Suggestion-formula output (pass 2) lives in dpp_suggestions and is unaffected. The forced value still flows through the profile templates for proper shaping (so dct:publisher gets its foaf:Agent wrapper, etc.) — the shaping fix from #2490 #5 is preserved. Regression guard: forced_package_field_survives_formula_overwrite (tests/test_profile.rs). Constructs a spec with a `title` formula that would set "Formula Wins", combined with `package.title: {value: "Forced Title", force: true}`. The output must carry "Forced Title" — confirming force beats formula. Verification: * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 47 integration tests pass (46 prior + 1 new regression guard). * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profile): expand forced CKAN fields through alias mappings (roborev #2493) Follow-up regression to #2491: the force-skip in merge_formula_results only checked the exact CKAN field name. Aliases that project to the same target pointer (e.g. `package.author` and `package.publisher` both → `/dcat/dct:publisher`) bypassed the check — a formula writing `publisher` could still overwrite a forced `author` value. Fix: after the first pass collects forced (ckan_ptr, target_ptr) pairs, walk profile.field_mappings and add every CKAN field whose target appears in the forced target set to the forced_pkg / forced_res field-name set. So forcing `package.author` now also locks `package.publisher` (and any other alias keys for the same target). Alias pairs covered by this fix in DCAT-US v3: * author / publisher → dct:publisher * landing_page / url → dcat:landingPage * data_dictionary / describedBy → dcat:describedBy * accrualPeriodicity / frequency / update_frequency → dct:accrualPeriodicity * dcat-us:accessLevel / access_level → dcat-us:accessLevel * accessRights / access_rights → dct:accessRights * scopeNote / scope_note → skos:scopeNote * liabilityStatement / liability_statement → dcat-us:liabilityStatement * inSeries / in_series → dcat:inSeries * versionNotes / version_notes → dcat:versionNotes * license / license_id → distribution.dct:license * modified / last_modified → distribution.dct:modified Regression guards (tests/test_profile.rs): * forced_author_locks_publisher_alias — forces package.author, formula targets `publisher`, asserts foaf:name is "Forced Author". * forced_license_id_locks_license_alias — forces resource.license_id to cc-by, formula targets `license` with cc-by-sa, asserts the CC-BY 4.0 IRI (not CC-BY-SA) lands on Distribution. Verification: * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 49 integration tests pass (47 prior + 2 new alias guards). * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: 9 Copilot suggestions on PR #3908 Apply all 9 unresolved inline review comments. Each was verified against the current code before action. 1. docs/help/profile.md (truncated --initial-context help) Reformatted the USAGE block in src/cmd/profile.rs so the description survives markdown-table generation: flattened the nested bullet list into a single paragraph and added a pointer to dcat-init-context.README.md for the full example. 2. tests/resources/profile/dcat-init-context.README.md Updated the "How package / resource force flags route to DCAT" section to reference the active profile's `field_mappings:` table + `ProfileSpec::translate_ckan_ptr` instead of the deleted src/cmd/profile/ckan_to_dcat.rs module. 3. src/cmd/profile/profile_spec.rs (load-time validation claim) Moved `projection::dry_compile` inside `load()` so the doc claim on `EMBEDDED` is now accurate: every template parses through minijinja at profile-load time, surfacing typos before stats/frequency/formulas run. Dropped the redundant dry_compile call from profile.rs::run. 4. profile3-handoff.md (hardcoded absolute path) Removed the `/Users/joelnatividad/.claude/plans/...` reference to the original plan file; the handoff now describes the engine without pointing at a path that doesn't exist for other contributors. 5. resources/profiles/croissant.yaml (misplaced key) Removed the no-op `strippable_curie_prefixes: []` from the `discovery_merge:` block — that key lives under `validation:` per the schema; keeping it here was misleading. 6. src/cmd/profile.rs (dead `merge_discovered` + tests) Deleted the orphaned legacy `merge_discovered` function (the orchestrator now uses `discovery_merge::merge` exclusively) and the 9 in-file tests that exercised it. Coverage is preserved by the unit tests in src/cmd/profile/discovery_merge.rs and the new integration tests in tests/test_profile.rs (e.g. `catalog_mode_merges_discovered_into_inner_dataset_not_envelope`). Net −168 LOC. 7-8. src/cmd/profile.rs (stale `ckan_to_dcat` doc comments) Updated two doc comments (`apply_force_overrides` doc + the force-collection comment in `run()`) so future readers find `field_mappings:` + `ProfileSpec::translate_ckan_ptr` instead of being pointed at the deleted module. 9. resources/dcat-us-v3/README.md (wrong test path) The pin-guard test lives at tests/test_profile.rs::dcat_us_v3_bundle_pin_manifest_matches_files, not the non-existent tests/test_dcat_us_bundle_pin.rs. Updated both the prose reference and the `cargo test` invocation. Verification: * cargo build --bin qsv,qsvmcp,qsvlite,qsvdp — all 4 clean. * cargo test cmd::profile:: → 117 unit tests pass (was 127; the 10 deleted merge_discovered tests are obsolete). * cargo test --test tests test_profile:: → 49 integration tests pass (unchanged). * cargo +nightly fmt applied. * docs/help/profile.md regenerated via --generate-help-md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address roborev #2495: extend dry_compile + restore IRI escape coverage Two findings from the post-fix re-review of d78d34c. Medium (src/cmd/profile/projection.rs:dry_compile): The previous load-time validation only checked emit_when guards on dataset fields, leaving distribution and catalog field guards vulnerable. A typo in a distribution emit_when would compile-pass load() but silently render-fail at projection time (render_truthy treats the error as false, dropping the field). Fix: extend dry_compile to syntax-check emit_when in both distribution and catalog field loops. New guards: * dry_compile_rejects_malformed_distribution_emit_when * dry_compile_rejects_malformed_catalog_emit_when Low (src/cmd/profile/discovery_merge.rs): The removed merge_discovered tests carried regression coverage for forced discovered keys containing `/` or `~` (full-IRI JSON-LD properties like http://purl.org/dc/terms/title). Restore that coverage on discovery_merge's internal escape_token path. New tests: * forced_full_iri_key_blocks_matching_discovered_key — forced path with each `/` escaped to `~1` must block the matching discovered IRI key. * forced_full_iri_key_does_not_block_unrelated_discovered_key — escaping must not over-match; unrelated discovered keys (e.g. dct:identifier) still flow through. * escape_token_handles_rfc6901_round_trip — direct check of the `~`-before-`/` escape order on plain, slash, tilde, mixed, and full-IRI inputs. Verification: * cargo test cmd::profile:: → 122 unit tests pass (117 prior + 5 new). * cargo test --test tests test_profile:: → 49 integration tests pass. * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mintyplanet added 2 commits September 23, 2018 16:40

xsv: add transpose command

ef04e65

transpose command to transpose rows/columns of CSV data. PR BurntSushi#137

Add transpose in command list

9c4bf45

jqnatividad merged commit 3fd2edd into dathere:master Dec 27, 2020

peterjc mentioned this pull request Oct 14, 2021

Feature Request: deduplicate columns/extract unique columns #84

Closed

jqnatividad added a commit that referenced this pull request Dec 27, 2025

docs: frequency add doc string per GH Copilot suggestion #3

369864c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jqnatividad added a commit that referenced this pull request Dec 27, 2025

docs: accept GH Copilot review #3

bf9781f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jqnatividad added a commit that referenced this pull request Jan 1, 2026

rename test to reflect new xsd_gdata_scan mode names #3

9dbcc45

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jqnatividad mentioned this pull request May 27, 2026

feat(profile): YAML-driven projection engine — DCAT-US v3, DCAT-AP v3, Croissant #3908

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge Transpose PR#3

merge Transpose PR#3
jqnatividad merged 2 commits into
dathere:masterfrom
mintyplanet:transpose

jqnatividad commented Dec 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented Dec 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants