feat(profile): Croissant mlcroissant validator + generic external validator framework#3911
Merged
Merged
Conversation
The Croissant 1.0 spec ships no JSON Schema — its canonical
validator is the `mlcroissant` Python package. Until now,
`resources/profiles/croissant.yaml` had `validation.enabled: false`
because the engine had no way to run anything but the bundled
GSA JSON-Schema validator.
This commit adds a generic out-of-process validator path so any
profile can declare a `validation.external` block pointing at a
binary on PATH, wires Croissant up to mlcroissant by default, and
sets the same pattern up for a future DCAT-AP SHACL backend
(pyshacl) without further engine work.
Engine
- New `ExternalValidator` struct on `Validation` with `command`,
`args`, `default_severity`, `label`, `install_hint`. Independent
of `Validation::enabled` (JSON Schema gate) — both can coexist.
- New `src/cmd/profile/external_validate.rs`:
`validate(profile, block) -> Vec<ProjectionWarning>`. Writes
`block` to a `.json` tempfile, spawns `command` with `{file}`
arg substitution (or appends path when no token), captures
stderr/stdout. Missing binary → single `Severity::Info` warning
including `install_hint` so users see the install path at the
moment they need it. Spawn errors → `Severity::Recommended`.
Non-zero exit → one warning per non-empty stderr line (or
stdout when stderr is empty); defensive fallback when output
is empty. Exit 0 → empty Vec.
- `src/cmd/profile.rs` orchestrator wires external validation
next to `dcat_validate::validate` under `--validate-dcat`.
Strict mode filters out `Info`-severity findings (those are
the missing-binary notices, not real violations) — only actual
findings can trip `--strict-dcat`.
Profile YAML
- `resources/profiles/croissant.yaml` opts in:
`command: mlcroissant`,
`args: ["validate", "--jsonld", "{file}"]`,
`install_hint: "pip install mlcroissant
(https://github.com/mlcommons/croissant/tree/main/python/mlcroissant)"`.
Docs
- `resources/profiles/README.md` gains a "Validation" section
documenting both validator paths + a config table for the
`external` block + the Croissant example.
Tests
- 13 unit tests in `external_validate::tests` covering: missing
binary path (with + without install_hint), label override,
successful exit, multi-line stderr findings, stdout fallback,
exit-with-no-output diagnostic, `{file}` substitution,
args-append-when-no-token, severity parsing, and config-absent
no-op. Unix-only paths gate on `cfg!(unix)`.
- `embedded_croissant_parses_and_dry_compiles` extended to
assert the external validator config is present + reachable.
End-to-end verified: on a system without `mlcroissant` installed,
`qsv profile <input> --profile croissant --validate-dcat` emits
exactly one `Severity::Info` warning that surfaces the install
command + URL.
Addresses the final §6 follow-up from PR #3908's handoff (Croissant
`mlcroissant` Python validator integration). The same
`external_validate` module is ready to host a `pyshacl` config for
DCAT-AP v3 in a future PR.
All 154 profile unit tests + 49 integration tests pass. Clippy +
docs-drift-check clean. All four binaries (qsv/qsvmcp/qsvlite/qsvdp)
build green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a generic “external validator” hook to the YAML-driven qsv profile engine so profiles can invoke canonical, out-of-process validators (starting with Croissant’s mlcroissant) alongside/independent of the built-in JSON Schema validation flow.
Changes:
- Introduces
validation.externalin profile specs, plus a newexternal_validaterunner that writes JSON-LD to a tempfile and spawns a configured command with{file}substitution. - Wires external validation into
qsv profile --validate-dcat, integrating results intodcat_warningsand enforcing failures under--strict-dcat(excludingInfoseverity). - Updates Croissant profile YAML and profile authoring docs to enable/configure
mlcroissantby default.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cmd/profile/profile_spec.rs | Adds validation.external configuration and tests Croissant’s embedded profile wiring. |
| src/cmd/profile/external_validate.rs | New external validator execution module with unit tests. |
| src/cmd/profile.rs | Runs external validation during --validate-dcat and enforces --strict-dcat behavior. |
| resources/profiles/README.md | Documents JSON Schema vs external validation and external config keys. |
| resources/profiles/croissant.yaml | Enables Croissant external validation via mlcroissant with install hint. |
Three Copilot findings on PR #3911: 1. SECURITY (`src/cmd/profile.rs:507`) — `--profile <path>` loads arbitrary YAML from disk, and that YAML controls `validation.external.command`. Under `--validate-dcat` the engine would spawn whatever the file said, enabling arbitrary command execution if a user runs an untrusted profile. Added a trust gate: - New `ProfileSource` enum (`Embedded` / `FilePath`) on `ProfileSpec`, stamped by `load()` after parsing. - New CLI flag `--allow-external-validator` (default off). - Orchestrator runs `validation.external` only when `profile.source.is_embedded() || args.flag_allow_external_validator`. Otherwise emits a `Severity::Recommended` warning naming the would-be validator and pointing at the opt-in flag. Bundled profiles (dcat-us-v3, dcat-ap-v3, croissant) keep running their declared validators frictionlessly because the YAMLs are vetted at qsv release time; only file-loaded profiles require the explicit opt-in. 2. FIELD STABILITY (`external_validate.rs:128`) — `ProjectionWarning.field` is used as a JSON-LD key / pointer elsewhere, but external_validate was building it as `external_validate/{label}` where `label` is user-configurable. That broke downstream filtering and looked like a JSON pointer. Findings now use a stable `field: "external_validate"` and the label moves into the message as `<label>: <line>`. Users still trace findings back to source via the message; downstream filters can target a single string. 3. OS PATH FIDELITY (`external_validate.rs:71`) — the tempfile path was being run through `to_string_lossy()` before substitution. Non-UTF-8 paths (legal on Unix) would get mangled and the spawned validator would then complain that the file doesn't exist. Tempfile path is kept as `OsString` end-to-end. New `substitute_file_token(&str, &OsStr) -> OsString` helper stitches the OsStr path between UTF-8 string parts (args come from YAML so they're always UTF-8; only the path needs OsStr fidelity). `resolve_args` returns `Vec<OsString>` which `Command::args` accepts directly. Tests - `resolve_args_preserves_non_utf8_path_bytes` (Unix-gated, uses `OsStrExt::from_bytes` with bytes `\xFF` and `\xFE`) confirms a path with invalid UTF-8 sequences round-trips byte-for-byte through substitution. - `non_zero_exit_surfaces_one_warning_per_stderr_line` and related tests updated to assert the stable field + new `<label>: <line>` message format. End-to-end verified with a hand-crafted `/tmp/evil.yaml`: - Embedded croissant → spawns; missing-binary Info warning. - File-loaded evil.yaml without flag → spawn refused; Recommended warning surfaces the would-be command + opt-in instructions. - File-loaded + `--allow-external-validator` → spawn proceeds; `--strict-dcat` honors findings as expected. All 155 profile unit tests + 49 integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roborev #2509 flagged that the trust gate added in 2f02c1c had no automated coverage — the security-sensitive orchestrator path wasn't exercised by any test. Adds 4 integration tests in tests/test_profile.rs: - external_validator_gated_for_file_loaded_profile_without_flag: loads a sh-based external validator from a file, runs profile with --validate-dcat only, asserts the gate warning is present AND no SPAWNED- marker appears anywhere in warnings (proving the command was NOT spawned). - external_validator_runs_for_file_loaded_profile_with_flag: same fixture + --allow-external-validator. Asserts the gate warning is absent AND a finding with the stable field "external_validate" and the new "fake-validator: SPAWNED-fake-validator" message format is surfaced (proving the spawn happened). - external_validator_strict_dcat_fails_on_file_loaded_findings: flag + --strict-dcat. Asserts the command fails with the "external validator finding(s)" summary in stderr. - external_validator_embedded_profile_skips_gate_warning: uses the embedded croissant profile. Asserts the file-loaded gate warning does NOT appear, regardless of whether mlcroissant itself is installed on the CI runner (Info-severity missing-binary warning is acceptable; the gate warning is not). Three of the four tests gate on `cfg(unix)` because they spawn `sh`; the embedded-profile test is portable. All 53 profile integration tests + 155 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
jqnatividad
added a commit
that referenced
this pull request
May 27, 2026
…e framework (#3912) Closes the final §6 follow-up from profile3-handoff.md. DCAT-AP v3 ships SHACL constraints upstream, not JSON Schema, so qsv's built-in JSON-Schema validator can't validate the emitted block. This commit generalizes the `validation.external` framework introduced in #3911 to support multi-file validators and wires up the canonical Python SHACL engine, `pyshacl`, as the DCAT-AP v3 reference validator. Framework generalization - New `ExternalValidatorResource` struct + `resources: Vec<...>` field on `ExternalValidator`. Each entry declares a logical embedded-resource name resolved against a compile-time `EMBEDDED_RESOURCES` table (`include_str!`-bundled), a token name used as `{<name>}` in args, and an optional file suffix. - New `EMBEDDED_RESOURCES` table in `src/cmd/profile/external_validate.rs` with one entry today: `dcat-ap-v3-shacl-shapes` mapping to the vendored Turtle file. Custom YAML profiles can reference any name listed there but cannot register new ones (qsv-release-time decision). - `resolve_args` / `substitute_tokens` rewritten for multi-token substitution. `{file}` keeps its "append if absent" fallback; named tokens (`{shapes}`, `{schema}`, etc.) substitute in place; unknown `{tokens}` pass through verbatim so a validator that genuinely wants `{foo}` in its CLI gets it. The OsString path-fidelity guarantee from #3911 is preserved. - Validation: reserved `name: "file"` and unknown `embedded` values are rejected at spawn time with `Severity::Required` warnings so a misconfigured profile fails fast. Vendored SHACL bundle - `resources/dcat-ap-v3/shacl/dcat-ap-SHACL.ttl` (176KB, 2321 lines): canonical SHACL shapes from SEMICeu DCAT-AP v3.0.0, fetched verbatim from https://github.com/SEMICeu/DCAT-AP/blob/master/releases/3.0.0/shacl/dcat-ap-SHACL.ttl - `resources/dcat-ap-v3/shacl/README.md`: source + re-vendor procedure, similar to the GSA bundle's documentation. Profile wiring - `resources/profiles/dcat-ap-v3.yaml` declares the pyshacl external validator with `args: [-s, {shapes}, -sf, turtle, -df, json-ld, -f, human, {file}]`, references the embedded SHACL bundle via `resources: [{name: shapes, embedded: dcat-ap-v3-shacl-shapes, suffix: .ttl}]`, and carries an `install_hint` so users see `pip install pyshacl` exactly when pyshacl is missing. Docs - `resources/profiles/README.md`: config table extended with the `resources` field; full DCAT-AP/pyshacl example added; instructions for adding new `EMBEDDED_RESOURCES` entries. Tests (+7 new in external_validate, embedded smoke extended) - resolve_args_substitutes_named_extras - embedded_resources_table_includes_dcat_ap_v3_shapes (sanity- checks the bundle is real Turtle, guarding against bad re-vendor steps) - lookup_embedded_returns_none_for_unknown - resource_with_reserved_name_file_is_rejected - resource_with_unknown_embedded_is_rejected - resource_tempfile_is_materialized_with_correct_suffix - embedded_dcat_ap_v3_parses_and_dry_compiles extended End-to-end verified on a host without pyshacl: profile run with `--validate-dcat --profile dcat-ap-v3` emits one Severity::Info warning carrying the install hint and the projection still ships. All 161 profile unit tests + 53 integration tests pass. Clippy + docs-drift-check clean. All four binaries build green. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
jqnatividad
added a commit
that referenced
this pull request
May 27, 2026
Nine clippy warnings have been visible in every `cargo +nightly clippy -F all_features` run since the YAML projection engine landed. None of them affect behavior, but they crowd the lint output and obscure new issues. This commit clears all nine in two files: - `escape_json_pointer_token` was a private helper with no non-test callers; the only reference was a self-referential test in the same file. Both deleted. RFC 6901 escape coverage lives in `discovery_merge::tests::escape_token_handles_rfc6901_round_trip`. - `ProjectionMode::Catalog` is currently only constructed in tests — the orchestrator calls `wrap_in_catalog_envelope` after `discovery_merge` for `--catalog` mode (Roborev #2490 finding #1), bypassing the in-`project` catalog match arm. Added `#[allow(dead_code)]` on the variant with an explanatory doc-comment so the API surface stays intact and the never- constructed lint stops firing in non-test builds. - `coerce_json_or_string` in projection.rs had a collapsible_if pattern flagged by clippy. Collapsed via Rust 2024 let-chain: `if (open/close braces) || (open/close brackets) && let Ok(v) = serde_json::from_str(...)`. Saves one indentation level. - The trust-gate orchestrator path in `src/cmd/profile.rs` used `is_some() && ... .unwrap()` (introduced in #3911); refactored to `if let Some(cfg) = ... && !allow_external` via let-chain. - `is_uuid_like` gained an explicit `assert!(bytes.len() > 23)` before its fixed-offset byte index reads, satisfying the repo-enabled `clippy::missing_asserts_for_indexing` lint and letting the compiler elide bounds checks. The underlying `len() == 36` guard already made the access safe; the assert documents that contract. - Four `doc_lazy_continuation` warnings on `is_uuid_like`'s doc comment were fixed by inserting a blank line between the list and the descriptive paragraph below it. `cargo +nightly clippy --bin qsv -F all_features -- -W clippy::perf` now exits clean except for the harmless "src/main.rs found in multiple build targets" notice (Cargo workspace artifact from qsv/qsvmcp sharing main.rs). All 160 profile unit tests + 53 integration tests pass. Docs drift check clean. Net change: -9 LOC across 2 files. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Croissant 1.0 spec ships no JSON Schema — its canonical validator is the
mlcroissantPython package. Until nowresources/profiles/croissant.yamlhadvalidation.enabled: falsebecause the engine had no way to run anything but the bundled GSA JSON Schema validator.This PR adds a generic out-of-process validator path so any profile can declare a
validation.externalblock pointing at a binary on PATH, wires Croissant up tomlcroissantby default, and sets the same pattern up for a future DCAT-AP SHACL backend (pyshacl) with no further engine work.Engine
ExternalValidatorstruct onValidation:command,args(with{file}token substitution),default_severity,label,install_hint. Independent ofValidation::enabled(JSON Schema gate) — both can coexist.src/cmd/profile/external_validate.rs(~300 LOC + 13 tests): writes block to a.jsontempfile, spawns the configured command, captures stderr/stdout. Missing binary → singleSeverity::Infowarning including the configuredinstall_hintso users see the install path at the moment they need it. Non-zero exit → one warning per non-empty stderr line (stdout fallback when stderr empty). Exit 0 → empty Vec.src/cmd/profile.rs) wires external validation next todcat_validate::validateunder--validate-dcat. Strict mode filters outInfo-severity findings — only actual violations can trip--strict-dcat, never missing-binary notices.Profile YAML
resources/profiles/croissant.yamlopts in withcommand: mlcroissant,args: [validate, --jsonld, {file}], and aninstall_hintcarrying the pip command + project URL.End-to-end verified. On a system without
mlcroissantinstalled:Test plan
cargo test --bin qsv -F profile,feature_capable cmd::profile::— 154 unit tests pass (13 new inexternal_validate::tests)cargo test --test tests -F profile,feature_capable -- test_profile::— 49 integration tests pass, no regressions indcat_us_v3_golden_parity_*qsv(-F all_features),qsvmcp(-F qsvmcp),qsvlite(-F lite),qsvdp(-F datapusher_plus)cargo clippy --bin qsv -F profile,feature_capable— no new warningscargo +nightly fmtappliedpython3 scripts/docs-drift-check.py— no driftresources/profiles/README.mdupdated with new "Validation" section +externalconfig table + Croissant exampletests/resources/profile/golden/nyc-311-subset.csvconfirms install hint surfaces with--validate-dcatwhen binary is absentNotes
The same
external_validatemodule is ready to host apyshaclconfig for DCAT-AP v3 in a follow-up PR — no further engine work needed, just YAML wiring + a smallinstall_hint.🤖 Generated with Claude Code