Add a new rename command to change the headers#5
Merged
Conversation
jqnatividad
added a commit
that referenced
this pull request
Jan 1, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jqnatividad
added a commit
that referenced
this pull request
Feb 16, 2026
All 24 individual tests pass (all checkmarks). The "FAIL" is a pre-existing vitest configuration issue ("No test suite found") unrelated to my changes — all actual test cases pass.
Changes:
- In `translateSql`, only the first standalone `_t_1` occurrence gets the `AS _t_1` alias; subsequent standalone occurrences get the bare `readExpr` to avoid duplicate alias issues in UNION/subquery contexts
- Added test case for multiple standalone `_t_1` references verifying alias-on-first-only behavior
Address review findings (job 115)
All 274 tests pass. The changes compile cleanly and tests are green.
Changes:
- Include Parquet conversion warning in the error path (not just success), so users see the full chain of failures when both DuckDB and sqlp fail
- Log when Parquet warning cannot be prepended due to unexpected content structure
Address review findings (job 117)
All changes are complete. Build passes and all 274 tests pass.
Changes:
- Fix SQL injection via `compression`/codec parameter: whitelist valid codecs (`zstd`, `snappy`, `gzip`, `lz4`, `uncompressed`) and reject anything else
- Fix SQL injection via `delimiter` parameter: validate exactly 1 character and escape single quotes
- Fix race condition in `ensureParquet()`: add in-memory lock map to deduplicate concurrent conversions for the same file
- Add error checking after each qsv subprocess call in `ensureParquet()`: wrap each step in try/catch with descriptive error messages identifying which step failed
- Verify Parquet output file exists after conversion before returning success
Address review findings (job 118)
No tests reference the codec list. The change is minimal and the build succeeds.
Changes:
- Add `brotli` and `lzo` to `VALID_PARQUET_CODECS` set to match DuckDB's supported compression codecs
Address review findings (job 120)
All changes complete. Build succeeds and all 276 tests pass.
Changes:
- Remove unused `runQsvSimple` import from `duckdb.ts` (finding #1)
- Add trust assumption comment documenting that `sql` parameter comes from trusted MCP agent in `tryDuckDbExecution` (finding #2)
- Clean up partial `.parquet` file on conversion failure in `doParquetConversion` to prevent stale corrupted files (finding #4)
- Detect multi-table SQL queries (`_t_2`, `_t_3`, etc.) and fall back to sqlp preemptively instead of sending untranslated references to DuckDB (finding #5)
- Add tests for multi-table reference detection regex and `_t_2+` non-translation behavior
Address review findings (job 121)
Everything looks clean. Here's a summary:
Changes:
- Fix multi-table sqlp fallback: stop setting `SKIP_INPUT` for multi-table queries so sqlp receives the original input file and can resolve `_t_2+` references natively
- Extract `MULTI_TABLE_PATTERN` regex to a shared constant exported from `duckdb.ts`, eliminating duplication between `mcp-tools.ts` and `duckdb.test.ts`
- Clean up asymmetric brace style in the DuckDB/sqlp interception block into a clean if/else
Address review findings (job 122)
All 276 tests pass. The changes are minimal and focused:
Changes:
- Remove case-insensitive `/i` flag from `MULTI_TABLE_PATTERN` regex since sqlp table references (`_t_N`) are case-sensitive
- Add test cases verifying uppercase `_T_2` and `_T_10` do not match the multi-table pattern
Address review findings (job 124)
All changes applied, build and tests pass (276/276).
Changes:
- Add `.sz` compressed file variants to `isCsvLikeFile()` so Snappy-compressed CSV/TSV/TAB/SSV files are recognized and converted to Parquet (finding 1)
- Move `ensureParquet()` call after `MULTI_TABLE_PATTERN` check to avoid wasted Parquet conversion on multi-table queries that fall through to sqlp (finding 2)
- Restore case-insensitive flag on `MULTI_TABLE_PATTERN` regex for consistency with case-insensitive `_t_1` replacement in `translateSql` (finding 3)
- Update test assertions to expect uppercase `_T_2`/`_T_10` to match the now case-insensitive pattern
Address review findings (job 125)
All 280 tests pass. Here's the summary:
Changes:
- Reverted `MULTI_TABLE_PATTERN` regex to case-sensitive (removed `/i` flag) since sqlp table references (`_t_N`) are case-sensitive and uppercase `_T_2` would cause query errors
- Reverted duckdb tests to assert uppercase `_T_N` does NOT match the pattern
- Exported `isCsvLikeFile` function from `mcp-tools.ts` to enable testing
- Added unit tests for `isCsvLikeFile` covering standard extensions, `.sz` Snappy-compressed variants, case-insensitivity, and non-CSV rejection
Address review findings (job 127)
Changes:
- `getParquetPath` now uses `basename()` for extension matching, preventing false matches when directory paths contain CSV-like extensions (e.g., `/data/CSV_FILES/test.csv`)
- `MULTI_TABLE_PATTERN` made case-insensitive (`/i` flag) to be consistent with the `_t_1` replacement pattern, preventing confusing DuckDB errors when agents send uppercase `_T_2`
- `translateSql` now logs a warning for unknown file extensions (e.g., `.xlsx`) that are treated as CSV, helping diagnose cryptic DuckDB errors
- Secondary SIGKILL timer in `executeDuckDbQuery` is now properly cleared in both `close` and `error` handlers, preventing unnecessary SIGKILL on already-exited processes
- Updated `duckdb.test.ts` to match the case-insensitive `MULTI_TABLE_PATTERN` behavior
Address review findings (job 128)
All changes are complete. TypeScript compiles cleanly and all 286 tests pass.
Changes:
- Fix `MULTI_TABLE_PATTERN` flip-flop: make regex case-sensitive again, add `normalizeTableRefs()` to lowercase `_T_N` → `_t_N` before routing, so sqlp receives consistent lowercase refs
- Extract `CSV_LIKE_EXTENSIONS` shared constant in `duckdb.ts`, reuse in both `isCsvLikeFile()` and `translateSql()` to eliminate duplicated extension lists
- Export `getParquetPath()` and add regression tests for directory-path false-match bug (e.g., `/data/csv_files/test.json`)
- Add tests for `normalizeTableRefs()` and its interaction with `MULTI_TABLE_PATTERN`
Address review findings (job 129)
All 286 tests pass. The diagnostics about `qsv_bin` are pre-existing Rust issues unrelated to these changes.
Changes:
- Added comment explaining `.txt`/`.dat` are warning-only suppressions, not in `CSV_LIKE_EXTENSIONS` (no Parquet conversion)
- Simplified `params.sql` assignment by removing conditional check — always assign normalized SQL
Address review findings (job 131)
All 286 tests pass. Here's the summary:
Changes:
- Remove `params.output` fallback in `tryDuckDbExecution` to use only the pre-validated `outputFile` parameter, closing a potential path validation bypass
- Add clarifying comment to `getParquetPath` explaining why the mixed-case logic (lowercase matching + original-path slicing) is correct
- Add comment to `translateSql` documenting that `_t_1` replacement also applies inside SQL string literals (acceptable limitation for MCP agent usage)
Address review findings (job 133)
All 287 tests pass, 0 failures.
Changes:
- Consolidated `ChildProcess` type import with `spawn` into a single `child_process` import in `duckdb.ts`
- Extracted inline `maxStderrSize` magic number to named constant `MAX_STDERR_SIZE` alongside `DUCKDB_VALIDATION_TIMEOUT_MS`
- Exported `ensureParquet` and added tests for its early-return paths (non-CSV file passthrough)
Address review findings (job 135)
All 287 tests pass.
Changes:
- Made `MULTI_TABLE_PATTERN` case-insensitive (`/i` flag) for defense-in-depth, so uppercase `_T_N` references are caught even without `normalizeTableRefs` being called first
- Fixed TOCTOU race in `ensureParquet` by checking the lock map before any async `stat()` calls
- Added zero-byte file validation after Parquet conversion to catch corrupted/empty output files
- Updated test assertions to match the case-insensitive `MULTI_TABLE_PATTERN` behavior
Address review findings (job 136)
All 287 tests pass. The diagnostics about `qsv_bin` and `lesson_*.rs` are pre-existing issues unrelated to this change.
Changes:
- Restructured zero-byte file validation in `doParquetConversion` to use separate try/catch for `stat()` and a plain `if` for the size check, eliminating brittle string-matching error re-throw logic
Address review findings (job 138)
All changes are complete. Build passes and all 289 tests pass.
Changes:
- Fix `translateSql` to skip `_t_1` replacements inside single-quoted SQL string literals using a quote-aware regex
- Fix `normalizeTableRefs` regex to handle both `_T_N` and `_t_N` variants (was only matching uppercase `_T_`)
- Prefix delimiter validation error with `[DuckDB]` for clearer error attribution
- Document `getParquetPath` behavior for non-CSV files (appends `.parquet` suffix)
- Add tests for string literal protection (`SELECT '_t_1'` preserved) and escaped quotes
- Add test for `normalizeTableRefs` idempotency on already-lowercase refs
Address review findings (job 140)
All 290 tests pass. The changes are minimal and focused on the review findings.
Changes:
- Added comment documenting that multi-table queries (`_t_2+`) don't benefit from Parquet auto-conversion, directing users to manually convert with `qsv_to_parquet`
- Added comment clarifying delimiter validation: actual tab character (`\t`, length 1) works correctly; only the 2-char string literal `\\t` would be rejected
- Added test verifying multi-character delimiter strings are rejected by `translateSql`
Address review findings (job 142)
All 290 tests pass, build succeeds. Here's a summary:
Changes:
- Fix `translateSql` to use `(read_expr) AS _t_1` alias so qualified column refs like `_t_1.column` remain valid (only standalone `_t_1` not followed by `.` is replaced)
- Reuse `translateSql()` in sqlp fallback path instead of duplicating the `_t_1` → `read_parquet()` regex logic inline
- Add early `inputFile` existence check in `doParquetConversion` with clear "Input file not found" error message
- Simplify the stats/schema freshness check by removing unnecessary nested try/catch (input file is now validated upfront)
- Update all test assertions to match the new aliased read expression format
jqnatividad
added a commit
that referenced
this pull request
Feb 24, 2026
The file is a JS script (not Rust), so no build/test step applies here. Changes: - Guard stderr parse-failure warning with `input.trim()` check so empty stdin doesn't produce log noise Address review findings (job 439) The fix looks correct. The original bash script `cowork-setup.sh` was already converted to `cowork-setup.js` in a prior commit, so the review finding about the bash script is addressed by adding the equivalent 5-second timeout protection to the Node.js version. Changes: - Add 5-second timeout protection to `cowork-setup.js` stdin reading to prevent indefinite hangs if stdin isn't closed promptly, restoring the safety net that was present in the original bash implementation Address review findings (job 442) JS syntax check passed (no errors from `node --check`). The diagnostics shown are pre-existing Rust issues unrelated to these changes. Changes: - Add `process.stdin.destroy()` after 64KB size limit break to ensure immediate stdin cleanup instead of waiting for timeout - Strip trailing separators from `normalizedRoot` before path prefix comparison to prevent Windows edge case with trailing backslash - Use `resolvedPluginRoot` (resolved real path) instead of raw `pluginRoot` for template path construction, consistent with security pattern - Clarify in cowork-CLAUDE.md that `QSV_NO_COWORK_SETUP=1` should be set in the shell environment before launching Claude Code - Finding #5 (ESM without `"type": "module"`) confirmed already addressed — package.json has `"type": "module"` Address review findings (job 443) The syntax check passed (no output means no errors). This is a standalone JS script (not part of the Rust build), so no further build/test steps are needed. Changes: - Apply trailing separator strip to `cwdForCompare` for symmetry with `rootForCompare`, ensuring directory equality checks work when cwd has a trailing separator Address review findings (job 445) Syntax check passes. All changes are complete. Changes: - Documented the 5-second stdin timeout as a known constraint in cowork-setup.js (finding 1) - Used `normalize()` on the path prefix string for the containment check to handle mixed separators on Windows (finding 2) - Qualified the SQL auto-conversion statement in cowork-CLAUDE.md to scope it to `qsv_sqlp` specifically (finding 3) - Verified `package.json` already has `"type": "module"`, so ESM imports in `.js` files work correctly (finding 4 — no change needed) Address review findings (job 446) `normalize` is still used on lines 71-72, so the import stays. The fix is complete — it simply reverts to the original correct logic. Changes: - Revert `normalize(rootForCompare + separator)` back to `rootForCompare + separator` in the path containment check to prevent `normalize()` from stripping the trailing separator, which would cause false prefix matches (e.g., `/home/user/qsv` matching `/home/user/qsv-other`) Address review findings (job 448) Changes: - Replace `setTimeout`+`process.exit(0)` stdin timeout with `process.stdin.destroy()` approach that cleanly breaks the `for await` loop instead of abruptly terminating - Make path comparison case-insensitive on macOS (`darwin`) in addition to Windows, since macOS uses case-insensitive HFS+ by default - Add stderr log messages when the 64KB stdin limit is hit or the 5s timeout fires, to aid debugging - Enrich the Operation Timeout section in `cowork-CLAUDE.md` with the 10-minute default and `QSV_MCP_OPERATION_TIMEOUT_MS` env var details Address review findings (job 449) Syntax is valid. No build or test issues. Changes: - Wrapped `for await` stdin loop in `try/catch` to handle errors from `process.stdin.destroy()` on timeout, preventing unhandled exceptions and noisy stack traces Address review findings (job 451) No syntax errors. The changes only touch the JS file, so no Rust build/test is needed. Changes: - Document macOS case-sensitive APFS volume limitation in path prefix check comment (finding 1) - Log unexpected stdin errors instead of silently swallowing all errors in catch block (finding 2) - Add stderr warning when JSON parse fails on stdin input (finding 3) - Verified `QSV_MCP_OPERATION_TIMEOUT_MS` and 30-minute max are implemented in `config.ts` — no change needed (finding 4) - Finding 5 (test coverage) noted but not addressed as it's a suggestion for future work
jqnatividad
added a commit
that referenced
this pull request
Mar 2, 2026
All 421 tests pass, 0 failures. The change is correct. Changes: - Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit Address review findings (job 606) All 72 tests pass (0 failures), including the 3 new tests for missing params. Changes: - Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1) - Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3) - Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5) Address review findings (job 607) All 418 tests pass, including the new one. Changes: - Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string Address review findings (job 609) All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests. Changes: - Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing - Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags - Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces") - Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior Address review findings (job 610) All 420 tests pass. Changes: - Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails - Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path Address review findings (job 611) All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change. Changes: - Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer Address review findings (job 613) All 420 tests pass, 0 failures. The changes are verified. Changes: - Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding) - Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4) - Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error Address review findings (job 615) No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass. Changes: - Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags - Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages - Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough") - Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory Address review findings (job 616) The change looks correct. The length check and truncation now both operate on codepoints consistently. Changes: - Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()` Address review findings (job 618) All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test. Changes: - Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1) - Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3) - Updated test assertions to match the new error message wording
jqnatividad
added a commit
that referenced
this pull request
Mar 2, 2026
* feat(mcp): add qsv_log core tool for agent-initiated reproducibility logging Enable agents to write structured entries (user_prompt, agent_reasoning, agent_action, result_summary, note) to the qsv audit log (qsvmcp.log) with u- prefixed UUIDs, distinct from automatic s-/e- audit entries. Automatic audit logging is skipped for qsv_log calls to avoid recursion. Messages are truncated at 4096 chars and logging failures never break the workflow. Server instructions updated to guide agents on when/how to log for third-party reproducibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review findings (job 619) All 421 tests pass, 0 failures. The change is correct. Changes: - Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit Address review findings (job 606) All 72 tests pass (0 failures), including the 3 new tests for missing params. Changes: - Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1) - Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3) - Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5) Address review findings (job 607) All 418 tests pass, including the new one. Changes: - Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string Address review findings (job 609) All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests. Changes: - Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing - Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags - Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces") - Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior Address review findings (job 610) All 420 tests pass. Changes: - Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails - Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path Address review findings (job 611) All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change. Changes: - Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer Address review findings (job 613) All 420 tests pass, 0 failures. The changes are verified. Changes: - Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding) - Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4) - Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error Address review findings (job 615) No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass. Changes: - Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags - Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages - Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough") - Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory Address review findings (job 616) The change looks correct. The length check and truncation now both operate on codepoints consistently. Changes: - Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()` Address review findings (job 618) All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test. Changes: - Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1) - Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3) - Updated test assertions to match the new error message wording * fix(mcp): address Copilot review findings for qsv_log - Move skipAuditLog from "Key Constants" to a behavior note in CLAUDE.md (it's a local variable, not a module-level constant) - Reorder enum and LOG_ENTRY_TYPES Set to match description order (reasoning before action) - Add unique temp dir + cleanup to coercion test to prevent log file accumulation in OS temp root Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
jqnatividad
added a commit
that referenced
this pull request
May 10, 2026
- Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad
added a commit
that referenced
this pull request
May 11, 2026
) * feat(stats,frequency): auto-enable DataSketches estimators on OOM When --memcheck is set and util::mem_file_check returns OutOfMemory, stats and frequency now auto-enable their DataSketches-backed estimators (t-digest + HyperLogLog for stats; Misra-Gries Frequent Items for frequency) in addition to the existing auto-index fallback. Conflict guards mirror the explicit-validation rejections so the auto-enable only flips methods that would have passed validation if set by hand. A wwarn! lists the auto-enabled estimators; the original OOM is only propagated when neither fallback engages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(stats,frequency): address roborev #2028 findings - Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address Copilot review: stream OOM fixture, drop drift-prone line refs, add -- prefix - tests/workdir.rs: rewrite build_large_oom_csv to stream rows directly to a csv::Writer instead of building a 10M-row Vec in memory first. Avoids ~1.5 GB of String allocation that would OOM the test harness itself on memory-constrained hosts, defeating the purpose of the ignored OOM tests. - src/cmd/stats.rs, src/cmd/frequency.rs: replace hard-coded intra-file line-number references in the try_enable_approx_sketches and can_enable_frequent_items doc-comments with descriptive references to the validator/dispatch blocks they mirror. - src/cmd/frequency.rs: add the missing -- prefix to "sketch-method frequent_items" in the --memcheck USAGE help text; regenerate docs/help/frequency.md and .claude/skills/qsv/qsv-frequency.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(stats,frequency): clarify OOM fallback fires in both NORMAL and CONSERVATIVE mode The OOM auto-fallback to DataSketches estimators is gated on the result of util::mem_file_check, NOT on whether --memcheck is set. The in-memory load check runs unconditionally on the non-parallel path; --memcheck only switches the check from NORMAL mode (vs. total memory) to the stricter CONSERVATIVE mode (vs. available + swap × platform factor). The fallback can therefore trigger without --memcheck too — just much less often, since NORMAL mode only trips when the file is larger than ~80% of total RAM. Rewrote --memcheck USAGE in stats.rs and frequency.rs to: - Lead with what --memcheck actually does (CONSERVATIVE vs. NORMAL). - Reference QSV_MEMORY_CHECK as the env-var equivalent. - Describe the OOM fallback as a behavior of the load check itself, not of --memcheck specifically. Updated CHANGELOG.md and docs/STATS_DEFINITIONS.md to match. Regenerated docs/help/{stats,frequency}.md and the corresponding MCP skill JSONs from the new USAGE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address Copilot review: real opt-out for OOM auto-enable + assert command success in tests Two issues raised by Copilot on the OOM auto-fallback: (1) The wwarn promised users could re-run with explicit --quantile-method exact / --cardinality-method exact / --sketch-method exact to disable the auto-enable, but the code only checked the parsed flag value. docopt fills in the default "exact" either way, so an explicit --foo-method exact was indistinguishable from omitting the flag — making the documented opt-out a no-op. Fix: scan argv for the literal flag names ("--foo-method" or "--foo-method=...") to detect explicit user intent. Thread that through try_enable_approx_sketches / can_enable_frequent_items via new user_set_* parameters; the auto-enable is suppressed when the user explicitly provided the flag (regardless of value). Documented in STATS_DEFINITIONS.md. (2) The new OOM tests used wrk.output_stderr, which returns stderr regardless of exit status — a command that errored out after printing the auto-enable wwarn would still pass the test. Fix: add a wrk.stderr_on_success helper that asserts status.success() before returning stderr (with the same diagnostic-rich panic format as assert_success). Migrate the 6 new stats/frequency OOM tests to use it. Other call sites of output_stderr left untouched — they test failure paths where non-success is intentional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(frequency): restore can_enable_frequent_items doc comment Address roborev #2032: in the previous commit (d9fe03e), `argv_has_flag` and its doc comment were inserted *between* `can_enable_frequent_items`'s doc comment and the function body. Because Rust doc comments attach to the next item, the entire docstring block (including the `user_set_sketch_method` paragraph and the trailing "Returns false if any conflicting flag is set..." line) bound to `argv_has_flag`, leaving `can_enable_frequent_items` with no doc comment at all. Move `argv_has_flag` (with its own 4-line doc) to live *after* `can_enable_frequent_items`, mirroring the layout in stats.rs where the ordering was correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad
added a commit
that referenced
this pull request
May 27, 2026
…, Croissant (#3908) * feat(profile): comprehensive DCAT-US v3 support (Catalog, GSA bundle, force semantics) Closes the five gaps that kept `qsv profile` from being an agency-grade DCAT-US v3 reference tool: - Vendor the full GSA JSON Schema bundle (26 definitions + 2 qsv overlays + MANIFEST.json + refresh README) under resources/dcat-us-v3/, pinned to upstream commit cf8789002. `--validate-dcat` now runs against the full bundle via `referencing::Registry`, dispatching the Dataset or Catalog overlay by the emitted `@type`. A `curie::strip_curies` pre-pass bridges qsv's JSON-LD-compact output to GSA's unprefixed schema keys without touching the emitted JSON on disk. - Add `--catalog` flag that wraps the Dataset inside a `dcat:Catalog` envelope (`Catalog{dataset:[...]}`) for federation harvesters. - Emit nine new optional v3 fields with natural data sources: Dataset-level `dct:created`, `dcat:version`, `dcat:versionNotes`; Distribution-level `dcat:checksum` (SHA-256 via sha2), `dcat:compressFormat`, `dcat:packageFormat`, `dcat:spatialResolutionInMeters`, `dct:language`, `dct:conformsTo`. Widen `dct:conformsTo` to array per v3 cardinality; emit `dct:license` as string and `dcat:byteSize` as string to match the GSA schemas' declared shapes. - Implement full `force: true` override semantics across all three --initial-context subtrees. `context::collect_forced_paths` now walks package/resource entries through a 47-entry `ckan_to_dcat` mapping table; `apply_force_overrides` in `run()` applies forced leaves LAST so they beat both inferred and discovered metadata. Pipeline precedence (low → high): inferred → discovered → dataset_info pointers → forced leaves → schema validation. Bumps profile feature: adds `sha2 = "0.10"` as a direct dep. Test counts: 143 unit (was 96, +47) and 29 integration (was 18, +11) all passing, plus a new bundle pin guard test that re-hashes every vendored schema against MANIFEST.json on each run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): scaffold YAML-driven projection engine (Stage 1) Lay the foundation for the YAML-driven multi-profile projection engine described in plan §1-§2. New modules are wired into profile.rs but the orchestrator still calls the legacy dcat.rs path — zero behavior change shipped here. Subsequent stages (§3-§8) populate the profile YAMLs, swap the orchestrator, and delete the legacy hardcoded modules. New modules: * src/cmd/profile/profile_spec.rs — ProfileSpec serde types, embedded- first load() with case-insensitive name resolution, file-path fallback, 6 unit tests. * src/cmd/profile/projection.rs — generic project() engine with ProjectionMode { Dataset, Catalog }, ProjectionWarning { Severity }, wrap_as_catalog, for_each_column RecordSet expansion, profile-aware lookup/field_mapping closures, dry_compile validator, 9 unit tests. * src/cmd/profile/discovery_merge.rs — merge() with fill-if-absent, overlay-array, never strategies; never_overwrite + forced_paths protection; 5 unit tests. Helper additions (formula_helpers.rs): * Filters: only_if_absolute_iri, basename, file_stem, sanitize_iso_8601_interval, format_mailto. * Globals: sha256_of (streaming), blake3_of (mmap+rayon), file_size_of, compress_format, package_format, build_csvw_schema. Helpers needing profile state (lookup, field_mapping) live in projection.rs::register_profile_helpers as closures over the ProfileSpec; they unwrap_or(UNDEFINED) so | default chains work. USAGE additions: * --profile <name|path>: embedded names (dcat-us-v3, dcat-ap-v3, croissant) resolved first; falls back to file path. Not yet consumed in run() — wired up in Stage 4. Placeholder YAMLs under resources/profiles/ exist so include_str! resolves during Stage 1 builds; they will be replaced with real content in Stages 3 (DCAT-US v3), 6 (DCAT-AP v3), 7 (Croissant). Verification: * cargo build --bin qsv -F profile,feature_capable — clean (23 expected dead-code warnings for the unused scaffold). * cargo test cmd::profile:: — 163 passed (+20 new tests). * cargo test --test tests test_profile:: — 29 passed (no regression). * cargo +nightly fmt — applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(profile): capture goldens from legacy engine before YAML swap (Stage 2) Lock the byte-equivalent output of the current hardcoded dcat.rs engine against three regression fixtures so Stage 3's YAML-driven projection can be asserted to produce identical Dataset + Catalog blocks. Goldens captured by running today's qsv profile against each fixture with the canonical --initial-context template, then normalizing via jq to strip the only path-dependent field (qsv:sourcePath inside dcat:distribution). Everything else in the .dcat block — including dcat:byteSize, dcat:checksum, dct:modified, csvw:tableSchema — is deterministic for fixed input and is captured verbatim. Fixtures (under tests/resources/profile/golden/): * nyc-311-subset.csv (10 rows) — geocoded urban service requests: lat/lon present, mixed Open/Closed status, multi-agency. * usda-soil-subset.csv (10 rows) — scientific numeric data: pH, organic_carbon_pct, nitrogen_pct, clay/sand/silt percentages. * wprdc-311-subset.csv (10 rows) — Pittsburgh 311 records: capitalized headers, X/Y geo, council districts + wards. Goldens per fixture: * <fixture>.dataset.expected.json — the .dcat block from Dataset mode. * <fixture>.catalog.expected.json — the .dcat block from --catalog mode. .gitignore whitelists tests/resources/profile/golden/*.{csv,expected.json} so the *.json + *.csv blanket-ignores don't strip them. These goldens will drive Stage 3's dcat_us_v3_golden_parity_dataset and dcat_us_v3_golden_parity_catalog tests; CI hard-fails on drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): ship dcat-us-v3.yaml profile (Stage 3, partial) Author resources/profiles/dcat-us-v3.yaml — the full DCAT-US v3 projection definition that will replace the hardcoded dcat.rs engine in Stage 4. The YAML mirrors the legacy add_* functions field-for-field in declaration order so serde_json::Map insertion preserves wire-shape parity (verified against the Stage-2 goldens at swap time). Profile content: * 4 vocabularies (license_iri, accrual_periodicity, iso_639_1, csvw_datatype) — each migrated verbatim from the legacy Rust constants. The EU vocab IRIs retain http:// scheme per their canonical published identifiers; DevSkim DS137138 suppressed per line. * 53 field_mappings — same CKAN→DCAT pointer table the legacy ckan_to_dcat::CKAN_TO_DCAT held, in identical declaration order so alias-resolution precedence is preserved. * dataset.fields[] — 23 entries covering core identity, provenance, contact point (required), classification, coverage, US codes (recommended), governance, and extended metadata. emit_when guards match the legacy `if let Some(...)` shapes. * distribution.fields[] — 22 entries covering title, description, download URL, format/license/restrictions, language/conformance, file-derived facts (byteSize, checksum, compress/package format), spatial resolution, and csvw:tableSchema. * catalog block reproduces wrap_as_catalog's envelope (Catalog of <title>, dct:conformsTo, dct:publisher inheritance). * discovery_merge: enabled, never_overwrite=[@context,@type, dcat:distribution], fill-if-absent strategy. * validation: enabled against the vendored GSA bundle under resources/dcat-us-v3/ with the same 11 strippable CURIE prefixes. dry_compile verification: A new unit test (embedded_dcat_us_v3_parses_and_dry_compiles) parses the embedded YAML and runs projection::dry_compile() against it — exercising every template's minijinja compile path. All templates compile clean. The actual byte-equivalent parity test (running each Stage-2 fixture through projection::project() and asserting against goldens) lands in Stage 4 alongside the orchestrator swap — at that point the engine actually consumes the YAML. The reference cross-checked sources for content: https://github.com/GSA/dcat-us/ https://resources.data.gov/resources/dcat-us3/ the vendored GSA bundle under resources/dcat-us-v3/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(profile): handoff #3 — YAML projection engine, Stages 1-3 landed Captures the current state after the YAML-driven projection migration's first three commits. Documents what's wired (scaffold + helpers + flag + goldens + DCAT-US v3 YAML), what's still on the legacy path (dcat.rs drives output), and a 9-sub-step Stage 4 plan for the orchestrator swap. Supersedes profile2-handoff.md for post-PR-#3901 work. Key gotchas distilled into §5: lookup helpers must return Value::UNDEFINED, goldens only normalize qsv:sourcePath, field-mapping count is 53 not 47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): wire YAML projection engine into orchestrator (Stage 4a) profile.rs::run now routes through projection::project() with the loaded ProfileSpec (default: dcat-us-v3). The YAML engine produces byte-equivalent output to the legacy dcat.rs path on all 6 golden fixtures (3 inputs × dataset/catalog modes), verified by new parity integration tests. Orchestrator changes: * Load profile via profile_spec::load(args.flag_profile | "dcat-us-v3") at the top of run(), then projection::dry_compile() to fail fast on malformed embedded YAML. * ContextArgs gains a `profile: &ProfileSpec` field; context::build threads it to load_initial_context → collect_forced_paths so the CKAN→target pointer translation uses profile.field_mappings instead of importing ckan_to_dcat. * Replace dcat::build() call with projection::project(&profile, &projection_ctx, mode) — the projection_ctx carries pkg, res, stats, dpp, source_label, local_path matching the YAML's template names. * Replace merge_discovered() with discovery_merge::merge(&profile, inferred, discovered, forced_dcat_paths) — same /dcat/<key> forced- path semantics, now driven by profile.discovery_merge. * Catalog wrap baked into project() via ProjectionMode::Catalog (chosen upfront based on flag_catalog); orchestrator no longer calls catalog::wrap_as_catalog at the warning-filter step. * Stash key renamed __pending_dcat_warnings → __pending_projection_warnings. * DcatWarning → ProjectionWarning conversion bridges dcat_validate and run_profile_validation outputs (Stage 5 will refactor those modules to return ProjectionWarning directly). Engine improvements: * projection::project sets UndefinedBehavior::Chainable so `pkg.dpp_suggestions.spatial_extent.value | default("")` walks missing intermediates gracefully (matches legacy dcat.rs semantics where absent keys silently fall through). * New file-aware helpers in formula_helpers.rs: - bbox_from_dpps(dpp, stats) — lat/lon column → POLYGON-WKT `dct:Location` array, mirroring legacy dcat::bbox_from_dpps. - temporal_from_dpps(dpp, stats) — date columns → array of `dct:PeriodOfTime`, one per inferred date column. - build_csvw_schema(stats) — column-name → stats-blob map walked, emitting `{columns: [...]}` with name, titles, datatype, qsv:cardinality / nullcount / min / max. - csvw_datatype_legacy helper mirrors the legacy mapping (Float → double, Integer → integer, Date → date, etc.). dcat-us-v3.yaml updates: * dct:spatial / dct:temporal fields call bbox_from_dpps / temporal_from_dpps as fallbacks behind the formula-derived WKT suggestion. * dct:license emits a plain string (legacy license_value shape) via `{{ lookup("license_iri", raw) | default(raw) }}`, not the previous `{"@id": ...}` object form (GSA Distribution.json declares license as anyOf:[null,string]). Tests: * 2 new integration tests (dcat_us_v3_golden_parity_dataset / _catalog) iterate the 3 fixtures and assert byte-equivalent .dcat output against the goldens. * discovery_merge test: forced-path form switched from "/dct:title" to "/dcat/dct:title" so it matches the legacy dataset_info pointer shape; +1 new test for nested-path force blocking top-level merge. * All 6 goldens refreshed to current legacy output (the original Stage-2 capture had alphabetical stats-cache state). * Full test sweep: 165 unit + 31 integration tests pass, 0 failures. The legacy dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules are still in tree (their tests still run via cmd::profile::*) but no longer participate in the engine path. Stage 4b deletes them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): delete legacy hardcoded engine + refactor validator (Stages 4b + 5) The YAML-driven projection engine is now the only path. Stage 4a wired projection::project() into run() with byte-equivalent output against the goldens; this commit cleans up by deleting the legacy modules and refactoring dcat_validate to consume the active ProfileSpec. Deletions (~2400 LOC): * src/cmd/profile/dcat.rs (1738 LOC) — the 9 add_* helpers, bbox_from_dpps, temporal_from_dpps, csvw_datatype, license_value, accrual_periodicity_iri, normalize_iso_639_1. The minijinja-side equivalents live in formula_helpers.rs + dcat-us-v3.yaml. * src/cmd/profile/catalog.rs (154 LOC) — wrap_as_catalog moved into projection::wrap_as_catalog. * src/cmd/profile/ckan_to_dcat.rs (271 LOC) — CKAN_TO_DCAT table moved verbatim into dcat-us-v3.yaml's field_mappings:; the lookup is now ProfileSpec::translate_ckan_ptr. * src/cmd/profile/curie.rs (225 LOC) — strip_curies is now an inline helper in dcat_validate.rs driven by profile.validation.strippable_curie_prefixes. * mod declarations for the deleted modules in profile.rs. dcat_validate.rs refactor (Stage 5): * New public API: validate(profile: &ProfileSpec, block: &Value) -> Vec<ProjectionWarning>. When profile.validation.enabled == false (DCAT-AP v3, Croissant), returns vec![] without touching the schema. * Inline strip_curies / strip_curie_key replace the deleted curie module; the prefix list comes from profile.validation.strippable_curie_prefixes (still byte-identical to the legacy list for DCAT-US v3). * classify_severity now returns projection::Severity instead of dcat::Severity. * Test functions migrate to the new (profile, block) signature by loading the embedded dcat-us-v3 profile via profile_spec::load. profile.rs cleanup: * dcat_validate::validate_dataset_or_catalog() call → validate(). * run_profile_validation now returns Vec<ProjectionWarning> directly; the .into_iter().map(From::from) bridge is gone. projection.rs cleanup: * impl From<DcatWarning> for ProjectionWarning removed (no longer needed — all warning producers return ProjectionWarning). Verification: * cargo build --bin qsv -F profile,feature_capable — clean. * All 4 binaries build clean: qsv (-F all_features), qsvmcp (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus). * cargo test cmd::profile:: → 127 unit tests pass (down from 165; the deleted legacy modules carried 38 tests now obsoleted by the YAML+goldens parity coverage). * cargo test --test tests test_profile:: → 31 integration tests pass (29 original + 2 new dcat_us_v3_golden_parity_* tests). Net Rust LOC delta this commit: −2388 deleted, +60 added (inline strip_curies + tests) = −2328 LOC. Cumulative since Stage 1: −2328 + 1525 + 546 = −257 LOC vs the pre-YAML-engine state, AND all engine knowledge now lives in resources/profiles/dcat-us-v3.yaml where it's editable without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): ship dcat-ap-v3 profile + 4 smoke tests (Stage 6) DCAT-AP v3 (semiceu.github.io/DCAT-AP/releases/3.0.0/) is now an embedded profile selectable via --profile dcat-ap-v3. The shape is a DCAT-US v3 subset, with: * JSON Schema validation disabled (DCAT-AP ships SHACL upstream; a SHACL backend is a future enhancement). * No dcat-us:* extensions (bureauCode, programCode, accessLevel, purpose, liabilityStatement) — those are US-specific. * New `eu_theme` vocabulary mapping CKAN group slugs to EU publications-office authority IRIs (http://publications.europa.eu/resource/authority/data-theme/...). * dcat:accessURL required on Distribution per the v3 spec (Mandatory cardinality 1..*). * dct:conformsTo points at the SEMIC v3 release URL. * Smaller field_mappings (29 entries vs the 53 in dcat-us-v3) since many DCAT-US extensions don't apply. The same minijinja templates and helpers power both profiles; the only Rust-side change in this commit is the YAML profile + tests. Smoke tests (tests/test_profile.rs): * dcat_ap_v3_emits_no_dcat_us_extensions — verifies the projection carries zero dcat-us:* keys even with the full initial-context. * dcat_ap_v3_distribution_carries_access_url — confirms the Distribution-mandatory dcat:accessURL is populated. * dcat_ap_v3_conforms_to_targets_spec_url — confirms downstream consumers can detect the profile via dct:conformsTo. * dcat_ap_v3_validation_is_disabled_noop — confirms --validate-dcat with this profile produces no JSON Schema warnings (the validator short-circuits when profile.validation.enabled == false). Source: https://github.com/SEMICeu/DCAT-AP Cardinality reference: https://semiceu.github.io/DCAT-AP/releases/3.0.0/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profile): ship croissant 1.0 profile + 5 smoke tests (Stage 7) Croissant ML metadata format (mlcommons.org/croissant) is now an embedded profile selectable via --profile croissant. The output is schema.org-rooted JSON-LD conforming to Croissant 1.0: * @context inlines the canonical Croissant map: @language=en, @vocab=https://schema.org/, plus cr:/dct: prefix shorthands. Per the Croissant spec at https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec.md. * @type=sc:Dataset; field paths use schema.org bare keys (name/description/url/license/creator/publisher/keywords/etc.) rather than dcat:/dct: prefixes. * conformsTo target IRI: http://mlcommons.org/croissant/1.0. * Distribution emitted under bare `distribution` (schema.org @vocab resolves it) with @type=sc:FileObject. * Per-column cr:RecordSet/cr:Field expansion via the new build_croissant_fields helper — one Field per CSV column with schema.org dataType (sc:Text / sc:Integer / sc:Float / sc:Boolean / sc:Date / sc:DateTime). * BLAKE3 hash via cr:fileFingerprint (qsv-native mmap+rayon, markedly faster than SHA-256 on multi-GB ML training data; Croissant has no SPDX-mandated algorithm so the choice is free). * validation.enabled: false (Croissant uses a Python validator, mlcroissant, not JSON Schema). * discovery_merge.enabled: false (Croissant doesn't live in CKAN-style data portals). Engine extensions: * DatasetBlock.context now accepts a `Value` (string or object) so the inline Croissant @context map round-trips verbatim. DCAT-US / DCAT-AP profiles still use a string URI — backwards-compatible. * DistributionBlock.path lets profiles override the Distribution wrapper key. Croissant emits `distribution`; DCAT defaults remain `dcat:distribution`. * New formula helper build_croissant_fields(stats) walks the per- column stats map and emits a flat cr:Field array with schema.org dataType IRIs. Smoke tests (5 in tests/test_profile.rs): * croissant_uses_schema_org_context_and_sc_dataset_type * croissant_conforms_to_targets_mlcommons_spec * croissant_emits_recordset_with_one_field_per_csv_column * croissant_uses_bare_distribution_key_not_dcat_namespaced * croissant_distribution_uses_file_object_type Verification: cargo test cmd::profile:: → 127 unit, test_profile:: → 40 integration tests pass (29 original + 2 parity + 4 DCAT-AP + 5 Croissant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(profile): regenerate help + finalize handoff (Stage 8) * docs/help/profile.md regenerated via --generate-help-md to surface the --profile flag added in Stage 1. * profile3-handoff.md updated to reflect all 8 stages landed, full file map post-deletion, verification commands, captured design decisions, and queued follow-ups. * src/cmd/profile.rs: drop the now-useless DcatWarning → ProjectionWarning conversion in the --validate-dcat code path (Stage 5 already refactored validate() to return ProjectionWarning directly). Verification: * python3 scripts/docs-drift-check.py → no drift detected. * All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp). * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 40 integration tests pass. * cargo clippy --bin qsv -F profile,feature_capable → no new findings in the YAML-engine code path. This closes the YAML-driven projection engine migration. The shipped binary always goes through projection::project(); the legacy dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules are deleted. DCAT-US v3 / DCAT-AP v3 / Croissant projection knowledge lives entirely in resources/profiles/*.yaml — editable without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profile): address roborev #2490 findings (catalog/discovery/force/validate) 7 findings from the YAML-engine branch review at job 2490. Each fix ships with a regression guard in tests/test_profile.rs. Medium severity (6): 1. Catalog mode + discovery merge target (src/cmd/profile.rs:398). Discovery was merging into the Catalog envelope top-level instead of the nested Dataset. Fix: project Dataset always, apply discovery_merge::merge, THEN conditionally wrap in Catalog via the new projection::wrap_in_catalog_envelope helper. Guard: catalog_mode_merges_discovered_into_inner_dataset_not_envelope. 2. Catalog envelope missing @context (src/cmd/profile/projection.rs:296). The envelope carried CURIE keys (dct:title, dct:conformsTo, dcat:dataset) without a top-level @context, leaving it invalid as JSON-LD. Fix: wrap_as_catalog now copies profile.dataset.context into the envelope; inner Dataset keeps its own context for self-containment. Guard: catalog_envelope_carries_top_level_context. 3. dct:spatial emits string "null" when no bbox (resources/profiles/dcat-us-v3.yaml + dcat-ap-v3.yaml). bbox_from_dpps returning UNDEFINED rendered as `"null"` via `| tojson` because coerce_json_or_string left the literal alone. Fix: emit_when guard gates the field on WKT-or-bbox availability. Guard: spatial_field_suppressed_when_no_lat_lon_columns. 4. --dcat-legacy-license parsed but never wired (src/cmd/profile.rs:380). Flag was documented + collected into Args but never reached the YAML engine. Fix: thread the flag into projection_ctx as `legacy_license`, add a conditional Dataset-level dct:license field in dcat-us-v3.yaml gated on that variable. Guards: dcat_legacy_license_emits_dataset_level_license, dcat_legacy_license_off_keeps_license_distribution_only. 5. Forced package/resource values bypass profile shaping (src/cmd/profile/context.rs:388). collect_forced_paths was writing raw CKAN values to target pointers via apply_force_overrides, producing string-where-Agent-expected shapes (e.g. forced package.publisher → "Name" instead of {"@type":"foaf:Agent","foaf:name":"Name"}). Fix: CKAN-side forces now only contribute to `forced_paths` (discovery-merge protection); the value lives in merged package/resource via normalize_value_force and flows through the profile's templates for proper shaping. dataset_info forces still take the raw-write path (that's the documented escape hatch). Guard: forced_package_publisher_flows_through_profile_template. 6. validate() ignores profile.validation paths (src/cmd/profile/dcat_validate.rs:250). When validation.enabled was true, the function always used the embedded GSA bundle regardless of profile.validation.schema_dir. Fix: when the profile's schema_dir matches the embedded `resources/dcat-us-v3/` path (the only bundle qsv ships today), use the embedded validators; any other schema_dir produces a single Recommended-severity warning explaining that custom-bundle validation is a queued follow-up. The embedded DCAT-US v3 profile's behavior is unchanged. Low severity (1): 7. DiscoveryMerge::default() disabled merging (src/cmd/profile/profile_spec.rs:273). #[derive(Default)] gave `enabled: false`, contradicting the documented "fill-if-absent enabled by default" semantics — the `#[serde(default = "default_true")]` annotation only fires during deserialization. Fix: hand-rolled Default impl with enabled: true, the never_overwrite list (@context, @type, dcat:distribution), and fill-if-absent strategy. Golden refresh: * Catalog goldens (nyc-311, usda-soil, wprdc-311) pick up the new envelope @context entry — finding #2 fix. * usda-soil dataset golden loses the spurious `"dct:spatial": "null"` entry — finding #3 fix. Verification: * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 46 integration tests pass (40 prior + 6 new regression guards). * All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp). * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(profile): drop auto-generated stats caches from golden dir The previous commit accidentally committed three *.stats.csv files (qsv stats cache, auto-regenerated on every profile run). They slipped past .gitignore because the golden-directory *.csv whitelist also matches the stats.csv suffix. Fix: add a re-ignore rule for `tests/resources/profile/golden/*.stats.csv` and the JSONL variant, then `git rm` the committed files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profile): preserve CKAN-side force against spec formulas (roborev #2491) Regression introduced by the #2490 fix #5: when CKAN-side `force: true` values stopped being raw-written via apply_force_overrides, they became vulnerable to overwrite by spec formulas. A formula targeting `package.publisher` would replace the forced value in merge_formula_results' pass-1 (before projection), violating the documented "force beats inferred" guarantee. Fix: track the CKAN-side forced field-name sets through the pipeline so merge_formula_results can skip them. * context.rs: collect_forced_paths now returns a 4-tuple including `forced_package_fields` and `forced_resource_fields` (HashSet<String> of CKAN-side field names marked force:true). load_initial_context returns the matching 6-tuple; AnalysisContext carries both sets. * profile.rs: merge_formula_results takes the two sets and skips pass-1 inserts on matching field names. Suggestion-formula output (pass 2) lives in dpp_suggestions and is unaffected. The forced value still flows through the profile templates for proper shaping (so dct:publisher gets its foaf:Agent wrapper, etc.) — the shaping fix from #2490 #5 is preserved. Regression guard: forced_package_field_survives_formula_overwrite (tests/test_profile.rs). Constructs a spec with a `title` formula that would set "Formula Wins", combined with `package.title: {value: "Forced Title", force: true}`. The output must carry "Forced Title" — confirming force beats formula. Verification: * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 47 integration tests pass (46 prior + 1 new regression guard). * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profile): expand forced CKAN fields through alias mappings (roborev #2493) Follow-up regression to #2491: the force-skip in merge_formula_results only checked the exact CKAN field name. Aliases that project to the same target pointer (e.g. `package.author` and `package.publisher` both → `/dcat/dct:publisher`) bypassed the check — a formula writing `publisher` could still overwrite a forced `author` value. Fix: after the first pass collects forced (ckan_ptr, target_ptr) pairs, walk profile.field_mappings and add every CKAN field whose target appears in the forced target set to the forced_pkg / forced_res field-name set. So forcing `package.author` now also locks `package.publisher` (and any other alias keys for the same target). Alias pairs covered by this fix in DCAT-US v3: * author / publisher → dct:publisher * landing_page / url → dcat:landingPage * data_dictionary / describedBy → dcat:describedBy * accrualPeriodicity / frequency / update_frequency → dct:accrualPeriodicity * dcat-us:accessLevel / access_level → dcat-us:accessLevel * accessRights / access_rights → dct:accessRights * scopeNote / scope_note → skos:scopeNote * liabilityStatement / liability_statement → dcat-us:liabilityStatement * inSeries / in_series → dcat:inSeries * versionNotes / version_notes → dcat:versionNotes * license / license_id → distribution.dct:license * modified / last_modified → distribution.dct:modified Regression guards (tests/test_profile.rs): * forced_author_locks_publisher_alias — forces package.author, formula targets `publisher`, asserts foaf:name is "Forced Author". * forced_license_id_locks_license_alias — forces resource.license_id to cc-by, formula targets `license` with cc-by-sa, asserts the CC-BY 4.0 IRI (not CC-BY-SA) lands on Distribution. Verification: * cargo test cmd::profile:: → 127 unit tests pass. * cargo test --test tests test_profile:: → 49 integration tests pass (47 prior + 2 new alias guards). * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: 9 Copilot suggestions on PR #3908 Apply all 9 unresolved inline review comments. Each was verified against the current code before action. 1. docs/help/profile.md (truncated --initial-context help) Reformatted the USAGE block in src/cmd/profile.rs so the description survives markdown-table generation: flattened the nested bullet list into a single paragraph and added a pointer to dcat-init-context.README.md for the full example. 2. tests/resources/profile/dcat-init-context.README.md Updated the "How package / resource force flags route to DCAT" section to reference the active profile's `field_mappings:` table + `ProfileSpec::translate_ckan_ptr` instead of the deleted src/cmd/profile/ckan_to_dcat.rs module. 3. src/cmd/profile/profile_spec.rs (load-time validation claim) Moved `projection::dry_compile` inside `load()` so the doc claim on `EMBEDDED` is now accurate: every template parses through minijinja at profile-load time, surfacing typos before stats/frequency/formulas run. Dropped the redundant dry_compile call from profile.rs::run. 4. profile3-handoff.md (hardcoded absolute path) Removed the `/Users/joelnatividad/.claude/plans/...` reference to the original plan file; the handoff now describes the engine without pointing at a path that doesn't exist for other contributors. 5. resources/profiles/croissant.yaml (misplaced key) Removed the no-op `strippable_curie_prefixes: []` from the `discovery_merge:` block — that key lives under `validation:` per the schema; keeping it here was misleading. 6. src/cmd/profile.rs (dead `merge_discovered` + tests) Deleted the orphaned legacy `merge_discovered` function (the orchestrator now uses `discovery_merge::merge` exclusively) and the 9 in-file tests that exercised it. Coverage is preserved by the unit tests in src/cmd/profile/discovery_merge.rs and the new integration tests in tests/test_profile.rs (e.g. `catalog_mode_merges_discovered_into_inner_dataset_not_envelope`). Net −168 LOC. 7-8. src/cmd/profile.rs (stale `ckan_to_dcat` doc comments) Updated two doc comments (`apply_force_overrides` doc + the force-collection comment in `run()`) so future readers find `field_mappings:` + `ProfileSpec::translate_ckan_ptr` instead of being pointed at the deleted module. 9. resources/dcat-us-v3/README.md (wrong test path) The pin-guard test lives at tests/test_profile.rs::dcat_us_v3_bundle_pin_manifest_matches_files, not the non-existent tests/test_dcat_us_bundle_pin.rs. Updated both the prose reference and the `cargo test` invocation. Verification: * cargo build --bin qsv,qsvmcp,qsvlite,qsvdp — all 4 clean. * cargo test cmd::profile:: → 117 unit tests pass (was 127; the 10 deleted merge_discovered tests are obsolete). * cargo test --test tests test_profile:: → 49 integration tests pass (unchanged). * cargo +nightly fmt applied. * docs/help/profile.md regenerated via --generate-help-md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address roborev #2495: extend dry_compile + restore IRI escape coverage Two findings from the post-fix re-review of d78d34c. Medium (src/cmd/profile/projection.rs:dry_compile): The previous load-time validation only checked emit_when guards on dataset fields, leaving distribution and catalog field guards vulnerable. A typo in a distribution emit_when would compile-pass load() but silently render-fail at projection time (render_truthy treats the error as false, dropping the field). Fix: extend dry_compile to syntax-check emit_when in both distribution and catalog field loops. New guards: * dry_compile_rejects_malformed_distribution_emit_when * dry_compile_rejects_malformed_catalog_emit_when Low (src/cmd/profile/discovery_merge.rs): The removed merge_discovered tests carried regression coverage for forced discovered keys containing `/` or `~` (full-IRI JSON-LD properties like http://purl.org/dc/terms/title). Restore that coverage on discovery_merge's internal escape_token path. New tests: * forced_full_iri_key_blocks_matching_discovered_key — forced path with each `/` escaped to `~1` must block the matching discovered IRI key. * forced_full_iri_key_does_not_block_unrelated_discovered_key — escaping must not over-match; unrelated discovered keys (e.g. dct:identifier) still flow through. * escape_token_handles_rfc6901_round_trip — direct check of the `~`-before-`/` escape order on plain, slash, tilde, mixed, and full-IRI inputs. Verification: * cargo test cmd::profile:: → 122 unit tests pass (117 prior + 5 new). * cargo test --test tests test_profile:: → 49 integration tests pass. * cargo +nightly fmt applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
from BurntSushi/xsv#223