feat(profile): Croissant mlcroissant validator + generic external validator framework#3911

jqnatividad · 2026-05-27T14:25:34Z

Summary

The Croissant 1.0 spec ships no JSON Schema — its canonical validator is the mlcroissant Python package. Until now resources/profiles/croissant.yaml had validation.enabled: false because the engine had no way to run anything but the bundled GSA JSON Schema validator.

This PR adds a generic out-of-process validator path so any profile can declare a validation.external block pointing at a binary on PATH, wires Croissant up to mlcroissant by default, and sets the same pattern up for a future DCAT-AP SHACL backend (pyshacl) with no further engine work.

Engine

New ExternalValidator struct on Validation: command, args (with {file} token substitution), default_severity, label, install_hint. Independent of Validation::enabled (JSON Schema gate) — both can coexist.
New src/cmd/profile/external_validate.rs (~300 LOC + 13 tests): writes block to a .json tempfile, spawns the configured command, captures stderr/stdout. Missing binary → single Severity::Info warning including the configured install_hint so users see the install path at the moment they need it. Non-zero exit → one warning per non-empty stderr line (stdout fallback when stderr empty). Exit 0 → empty Vec.
Orchestrator (src/cmd/profile.rs) wires external validation next to dcat_validate::validate under --validate-dcat. Strict mode filters out Info-severity findings — only actual violations can trip --strict-dcat, never missing-binary notices.

Profile YAML

resources/profiles/croissant.yaml opts in with command: mlcroissant, args: [validate, --jsonld, {file}], and an install_hint carrying the pip command + project URL.

End-to-end verified. On a system without mlcroissant installed:

$ qsv profile <input> --profile croissant --validate-dcat -o out.json
$ jq '.dcat_warnings[] | select(.field=="external_validate")' out.json
{
  "field": "external_validate",
  "severity": "info",
  "message": "\`mlcroissant\` not installed; skipped croissant validation. Install: pip install mlcroissant (https://github.com/mlcommons/croissant/tree/main/python/mlcroissant)"
}

Test plan

cargo test --bin qsv -F profile,feature_capable cmd::profile:: — 154 unit tests pass (13 new in external_validate::tests)
cargo test --test tests -F profile,feature_capable -- test_profile:: — 49 integration tests pass, no regressions in dcat_us_v3_golden_parity_*
All 4 binaries build clean: qsv (-F all_features), qsvmcp (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus)
cargo clippy --bin qsv -F profile,feature_capable — no new warnings
cargo +nightly fmt applied
python3 scripts/docs-drift-check.py — no drift
resources/profiles/README.md updated with new "Validation" section + external config table + Croissant example
Smoke test on tests/resources/profile/golden/nyc-311-subset.csv confirms install hint surfaces with --validate-dcat when binary is absent

Notes

The same external_validate module is ready to host a pyshacl config for DCAT-AP v3 in a follow-up PR — no further engine work needed, just YAML wiring + a small install_hint.

🤖 Generated with Claude Code

The Croissant 1.0 spec ships no JSON Schema — its canonical validator is the `mlcroissant` Python package. Until now, `resources/profiles/croissant.yaml` had `validation.enabled: false` because the engine had no way to run anything but the bundled GSA JSON-Schema validator. This commit adds a generic out-of-process validator path so any profile can declare a `validation.external` block pointing at a binary on PATH, wires Croissant up to mlcroissant by default, and sets the same pattern up for a future DCAT-AP SHACL backend (pyshacl) without further engine work. Engine - New `ExternalValidator` struct on `Validation` with `command`, `args`, `default_severity`, `label`, `install_hint`. Independent of `Validation::enabled` (JSON Schema gate) — both can coexist. - New `src/cmd/profile/external_validate.rs`: `validate(profile, block) -> Vec<ProjectionWarning>`. Writes `block` to a `.json` tempfile, spawns `command` with `{file}` arg substitution (or appends path when no token), captures stderr/stdout. Missing binary → single `Severity::Info` warning including `install_hint` so users see the install path at the moment they need it. Spawn errors → `Severity::Recommended`. Non-zero exit → one warning per non-empty stderr line (or stdout when stderr is empty); defensive fallback when output is empty. Exit 0 → empty Vec. - `src/cmd/profile.rs` orchestrator wires external validation next to `dcat_validate::validate` under `--validate-dcat`. Strict mode filters out `Info`-severity findings (those are the missing-binary notices, not real violations) — only actual findings can trip `--strict-dcat`. Profile YAML - `resources/profiles/croissant.yaml` opts in: `command: mlcroissant`, `args: ["validate", "--jsonld", "{file}"]`, `install_hint: "pip install mlcroissant (https://github.com/mlcommons/croissant/tree/main/python/mlcroissant)"`. Docs - `resources/profiles/README.md` gains a "Validation" section documenting both validator paths + a config table for the `external` block + the Croissant example. Tests - 13 unit tests in `external_validate::tests` covering: missing binary path (with + without install_hint), label override, successful exit, multi-line stderr findings, stdout fallback, exit-with-no-output diagnostic, `{file}` substitution, args-append-when-no-token, severity parsing, and config-absent no-op. Unix-only paths gate on `cfg!(unix)`. - `embedded_croissant_parses_and_dry_compiles` extended to assert the external validator config is present + reachable. End-to-end verified: on a system without `mlcroissant` installed, `qsv profile <input> --profile croissant --validate-dcat` emits exactly one `Severity::Info` warning that surfaces the install command + URL. Addresses the final §6 follow-up from PR #3908's handoff (Croissant `mlcroissant` Python validator integration). The same `external_validate` module is ready to host a `pyshacl` config for DCAT-AP v3 in a future PR. All 154 profile unit tests + 49 integration tests pass. Clippy + docs-drift-check clean. All four binaries (qsv/qsvmcp/qsvlite/qsvdp) build green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-27T14:28:03Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results

Complexity 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

Adds a generic “external validator” hook to the YAML-driven qsv profile engine so profiles can invoke canonical, out-of-process validators (starting with Croissant’s mlcroissant) alongside/independent of the built-in JSON Schema validation flow.

Changes:

Introduces validation.external in profile specs, plus a new external_validate runner that writes JSON-LD to a tempfile and spawns a configured command with {file} substitution.
Wires external validation into qsv profile --validate-dcat, integrating results into dcat_warnings and enforcing failures under --strict-dcat (excluding Info severity).
Updates Croissant profile YAML and profile authoring docs to enable/configure mlcroissant by default.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/cmd/profile/profile_spec.rs	Adds `validation.external` configuration and tests Croissant’s embedded profile wiring.
src/cmd/profile/external_validate.rs	New external validator execution module with unit tests.
src/cmd/profile.rs	Runs external validation during `--validate-dcat` and enforces `--strict-dcat` behavior.
resources/profiles/README.md	Documents JSON Schema vs external validation and external config keys.
resources/profiles/croissant.yaml	Enables Croissant external validation via `mlcroissant` with install hint.

Three Copilot findings on PR #3911: 1. SECURITY (`src/cmd/profile.rs:507`) — `--profile <path>` loads arbitrary YAML from disk, and that YAML controls `validation.external.command`. Under `--validate-dcat` the engine would spawn whatever the file said, enabling arbitrary command execution if a user runs an untrusted profile. Added a trust gate: - New `ProfileSource` enum (`Embedded` / `FilePath`) on `ProfileSpec`, stamped by `load()` after parsing. - New CLI flag `--allow-external-validator` (default off). - Orchestrator runs `validation.external` only when `profile.source.is_embedded() || args.flag_allow_external_validator`. Otherwise emits a `Severity::Recommended` warning naming the would-be validator and pointing at the opt-in flag. Bundled profiles (dcat-us-v3, dcat-ap-v3, croissant) keep running their declared validators frictionlessly because the YAMLs are vetted at qsv release time; only file-loaded profiles require the explicit opt-in. 2. FIELD STABILITY (`external_validate.rs:128`) — `ProjectionWarning.field` is used as a JSON-LD key / pointer elsewhere, but external_validate was building it as `external_validate/{label}` where `label` is user-configurable. That broke downstream filtering and looked like a JSON pointer. Findings now use a stable `field: "external_validate"` and the label moves into the message as `<label>: <line>`. Users still trace findings back to source via the message; downstream filters can target a single string. 3. OS PATH FIDELITY (`external_validate.rs:71`) — the tempfile path was being run through `to_string_lossy()` before substitution. Non-UTF-8 paths (legal on Unix) would get mangled and the spawned validator would then complain that the file doesn't exist. Tempfile path is kept as `OsString` end-to-end. New `substitute_file_token(&str, &OsStr) -> OsString` helper stitches the OsStr path between UTF-8 string parts (args come from YAML so they're always UTF-8; only the path needs OsStr fidelity). `resolve_args` returns `Vec<OsString>` which `Command::args` accepts directly. Tests - `resolve_args_preserves_non_utf8_path_bytes` (Unix-gated, uses `OsStrExt::from_bytes` with bytes `\xFF` and `\xFE`) confirms a path with invalid UTF-8 sequences round-trips byte-for-byte through substitution. - `non_zero_exit_surfaces_one_warning_per_stderr_line` and related tests updated to assert the stable field + new `<label>: <line>` message format. End-to-end verified with a hand-crafted `/tmp/evil.yaml`: - Embedded croissant → spawns; missing-binary Info warning. - File-loaded evil.yaml without flag → spawn refused; Recommended warning surfaces the would-be command + opt-in instructions. - File-loaded + `--allow-external-validator` → spawn proceeds; `--strict-dcat` honors findings as expected. All 155 profile unit tests + 49 integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Roborev #2509 flagged that the trust gate added in 2f02c1c had no automated coverage — the security-sensitive orchestrator path wasn't exercised by any test. Adds 4 integration tests in tests/test_profile.rs: - external_validator_gated_for_file_loaded_profile_without_flag: loads a sh-based external validator from a file, runs profile with --validate-dcat only, asserts the gate warning is present AND no SPAWNED- marker appears anywhere in warnings (proving the command was NOT spawned). - external_validator_runs_for_file_loaded_profile_with_flag: same fixture + --allow-external-validator. Asserts the gate warning is absent AND a finding with the stable field "external_validate" and the new "fake-validator: SPAWNED-fake-validator" message format is surfaced (proving the spawn happened). - external_validator_strict_dcat_fails_on_file_loaded_findings: flag + --strict-dcat. Asserts the command fails with the "external validator finding(s)" summary in stderr. - external_validator_embedded_profile_skips_gate_warning: uses the embedded croissant profile. Asserts the file-loaded gate warning does NOT appear, regardless of whether mlcroissant itself is installed on the CI runner (Info-severity missing-binary warning is acceptable; the gate warning is not). Three of the four tests gate on `cfg(unix)` because they spawn `sh`; the embedded-profile test is portable. All 53 profile integration tests + 155 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e framework (#3912) Closes the final §6 follow-up from profile3-handoff.md. DCAT-AP v3 ships SHACL constraints upstream, not JSON Schema, so qsv's built-in JSON-Schema validator can't validate the emitted block. This commit generalizes the `validation.external` framework introduced in #3911 to support multi-file validators and wires up the canonical Python SHACL engine, `pyshacl`, as the DCAT-AP v3 reference validator. Framework generalization - New `ExternalValidatorResource` struct + `resources: Vec<...>` field on `ExternalValidator`. Each entry declares a logical embedded-resource name resolved against a compile-time `EMBEDDED_RESOURCES` table (`include_str!`-bundled), a token name used as `{<name>}` in args, and an optional file suffix. - New `EMBEDDED_RESOURCES` table in `src/cmd/profile/external_validate.rs` with one entry today: `dcat-ap-v3-shacl-shapes` mapping to the vendored Turtle file. Custom YAML profiles can reference any name listed there but cannot register new ones (qsv-release-time decision). - `resolve_args` / `substitute_tokens` rewritten for multi-token substitution. `{file}` keeps its "append if absent" fallback; named tokens (`{shapes}`, `{schema}`, etc.) substitute in place; unknown `{tokens}` pass through verbatim so a validator that genuinely wants `{foo}` in its CLI gets it. The OsString path-fidelity guarantee from #3911 is preserved. - Validation: reserved `name: "file"` and unknown `embedded` values are rejected at spawn time with `Severity::Required` warnings so a misconfigured profile fails fast. Vendored SHACL bundle - `resources/dcat-ap-v3/shacl/dcat-ap-SHACL.ttl` (176KB, 2321 lines): canonical SHACL shapes from SEMICeu DCAT-AP v3.0.0, fetched verbatim from https://github.com/SEMICeu/DCAT-AP/blob/master/releases/3.0.0/shacl/dcat-ap-SHACL.ttl - `resources/dcat-ap-v3/shacl/README.md`: source + re-vendor procedure, similar to the GSA bundle's documentation. Profile wiring - `resources/profiles/dcat-ap-v3.yaml` declares the pyshacl external validator with `args: [-s, {shapes}, -sf, turtle, -df, json-ld, -f, human, {file}]`, references the embedded SHACL bundle via `resources: [{name: shapes, embedded: dcat-ap-v3-shacl-shapes, suffix: .ttl}]`, and carries an `install_hint` so users see `pip install pyshacl` exactly when pyshacl is missing. Docs - `resources/profiles/README.md`: config table extended with the `resources` field; full DCAT-AP/pyshacl example added; instructions for adding new `EMBEDDED_RESOURCES` entries. Tests (+7 new in external_validate, embedded smoke extended) - resolve_args_substitutes_named_extras - embedded_resources_table_includes_dcat_ap_v3_shapes (sanity- checks the bundle is real Turtle, guarding against bad re-vendor steps) - lookup_embedded_returns_none_for_unknown - resource_with_reserved_name_file_is_rejected - resource_with_unknown_embedded_is_rejected - resource_tempfile_is_materialized_with_correct_suffix - embedded_dcat_ap_v3_parses_and_dry_compiles extended End-to-end verified on a host without pyshacl: profile run with `--validate-dcat --profile dcat-ap-v3` emits one Severity::Info warning carrying the install hint and the projection still ships. All 161 profile unit tests + 53 integration tests pass. Clippy + docs-drift-check clean. All four binaries build green. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Nine clippy warnings have been visible in every `cargo +nightly clippy -F all_features` run since the YAML projection engine landed. None of them affect behavior, but they crowd the lint output and obscure new issues. This commit clears all nine in two files: - `escape_json_pointer_token` was a private helper with no non-test callers; the only reference was a self-referential test in the same file. Both deleted. RFC 6901 escape coverage lives in `discovery_merge::tests::escape_token_handles_rfc6901_round_trip`. - `ProjectionMode::Catalog` is currently only constructed in tests — the orchestrator calls `wrap_in_catalog_envelope` after `discovery_merge` for `--catalog` mode (Roborev #2490 finding #1), bypassing the in-`project` catalog match arm. Added `#[allow(dead_code)]` on the variant with an explanatory doc-comment so the API surface stays intact and the never- constructed lint stops firing in non-test builds. - `coerce_json_or_string` in projection.rs had a collapsible_if pattern flagged by clippy. Collapsed via Rust 2024 let-chain: `if (open/close braces) || (open/close brackets) && let Ok(v) = serde_json::from_str(...)`. Saves one indentation level. - The trust-gate orchestrator path in `src/cmd/profile.rs` used `is_some() && ... .unwrap()` (introduced in #3911); refactored to `if let Some(cfg) = ... && !allow_external` via let-chain. - `is_uuid_like` gained an explicit `assert!(bytes.len() > 23)` before its fixed-offset byte index reads, satisfying the repo-enabled `clippy::missing_asserts_for_indexing` lint and letting the compiler elide bounds checks. The underlying `len() == 36` guard already made the access safe; the assert documents that contract. - Four `doc_lazy_continuation` warnings on `is_uuid_like`'s doc comment were fixed by inserting a blank line between the list and the descriptive paragraph below it. `cargo +nightly clippy --bin qsv -F all_features -- -W clippy::perf` now exits clean except for the harmless "src/main.rs found in multiple build targets" notice (Cargo workspace artifact from qsv/qsvmcp sharing main.rs). All 160 profile unit tests + 53 integration tests pass. Docs drift check clean. Net change: -9 LOC across 2 files. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad requested a review from Copilot May 27, 2026 14:26

Copilot started reviewing on behalf of jqnatividad May 27, 2026 14:27 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread src/cmd/profile.rs Outdated

Comment thread src/cmd/profile/external_validate.rs Outdated

Comment thread src/cmd/profile/external_validate.rs

jqnatividad and others added 2 commits May 27, 2026 11:14

jqnatividad merged commit 10a97da into master May 27, 2026
18 of 20 checks passed

jqnatividad deleted the croissant-mlcroissant-validator branch May 27, 2026 16:41

jqnatividad mentioned this pull request May 27, 2026

feat(profile): DCAT-AP v3 SHACL validator via pyshacl + multi-resource framework #3912

Merged

8 tasks

jqnatividad mentioned this pull request May 27, 2026

chore(profile): sweep pre-existing clippy warnings #3913

Merged

6 tasks

BrewTestBot mentioned this pull request Jun 8, 2026

qsv 21.0.0 Homebrew/homebrew-core#286889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profile): Croissant mlcroissant validator + generic external validator framework#3911