Skip to content

feat(profile): Croissant mlcroissant validator + generic external validator framework#3911

Merged
jqnatividad merged 3 commits into
masterfrom
croissant-mlcroissant-validator
May 27, 2026
Merged

feat(profile): Croissant mlcroissant validator + generic external validator framework#3911
jqnatividad merged 3 commits into
masterfrom
croissant-mlcroissant-validator

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

The Croissant 1.0 spec ships no JSON Schema — its canonical validator is the mlcroissant Python package. Until now resources/profiles/croissant.yaml had validation.enabled: false because the engine had no way to run anything but the bundled GSA JSON Schema validator.

This PR adds a generic out-of-process validator path so any profile can declare a validation.external block pointing at a binary on PATH, wires Croissant up to mlcroissant by default, and sets the same pattern up for a future DCAT-AP SHACL backend (pyshacl) with no further engine work.

Engine

  • New ExternalValidator struct on Validation: command, args (with {file} token substitution), default_severity, label, install_hint. Independent of Validation::enabled (JSON Schema gate) — both can coexist.
  • New src/cmd/profile/external_validate.rs (~300 LOC + 13 tests): writes block to a .json tempfile, spawns the configured command, captures stderr/stdout. Missing binary → single Severity::Info warning including the configured install_hint so users see the install path at the moment they need it. Non-zero exit → one warning per non-empty stderr line (stdout fallback when stderr empty). Exit 0 → empty Vec.
  • Orchestrator (src/cmd/profile.rs) wires external validation next to dcat_validate::validate under --validate-dcat. Strict mode filters out Info-severity findings — only actual violations can trip --strict-dcat, never missing-binary notices.

Profile YAML

  • resources/profiles/croissant.yaml opts in with command: mlcroissant, args: [validate, --jsonld, {file}], and an install_hint carrying the pip command + project URL.

End-to-end verified. On a system without mlcroissant installed:

$ qsv profile <input> --profile croissant --validate-dcat -o out.json
$ jq '.dcat_warnings[] | select(.field=="external_validate")' out.json
{
  "field": "external_validate",
  "severity": "info",
  "message": "\`mlcroissant\` not installed; skipped croissant validation. Install: pip install mlcroissant (https://github.com/mlcommons/croissant/tree/main/python/mlcroissant)"
}

Test plan

  • cargo test --bin qsv -F profile,feature_capable cmd::profile::154 unit tests pass (13 new in external_validate::tests)
  • cargo test --test tests -F profile,feature_capable -- test_profile::49 integration tests pass, no regressions in dcat_us_v3_golden_parity_*
  • All 4 binaries build clean: qsv (-F all_features), qsvmcp (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus)
  • cargo clippy --bin qsv -F profile,feature_capable — no new warnings
  • cargo +nightly fmt applied
  • python3 scripts/docs-drift-check.py — no drift
  • resources/profiles/README.md updated with new "Validation" section + external config table + Croissant example
  • Smoke test on tests/resources/profile/golden/nyc-311-subset.csv confirms install hint surfaces with --validate-dcat when binary is absent

Notes

The same external_validate module is ready to host a pyshacl config for DCAT-AP v3 in a follow-up PR — no further engine work needed, just YAML wiring + a small install_hint.

🤖 Generated with Claude Code

The Croissant 1.0 spec ships no JSON Schema — its canonical
validator is the `mlcroissant` Python package. Until now,
`resources/profiles/croissant.yaml` had `validation.enabled: false`
because the engine had no way to run anything but the bundled
GSA JSON-Schema validator.

This commit adds a generic out-of-process validator path so any
profile can declare a `validation.external` block pointing at a
binary on PATH, wires Croissant up to mlcroissant by default, and
sets the same pattern up for a future DCAT-AP SHACL backend
(pyshacl) without further engine work.

Engine
- New `ExternalValidator` struct on `Validation` with `command`,
  `args`, `default_severity`, `label`, `install_hint`. Independent
  of `Validation::enabled` (JSON Schema gate) — both can coexist.
- New `src/cmd/profile/external_validate.rs`:
  `validate(profile, block) -> Vec<ProjectionWarning>`. Writes
  `block` to a `.json` tempfile, spawns `command` with `{file}`
  arg substitution (or appends path when no token), captures
  stderr/stdout. Missing binary → single `Severity::Info` warning
  including `install_hint` so users see the install path at the
  moment they need it. Spawn errors → `Severity::Recommended`.
  Non-zero exit → one warning per non-empty stderr line (or
  stdout when stderr is empty); defensive fallback when output
  is empty. Exit 0 → empty Vec.
- `src/cmd/profile.rs` orchestrator wires external validation
  next to `dcat_validate::validate` under `--validate-dcat`.
  Strict mode filters out `Info`-severity findings (those are
  the missing-binary notices, not real violations) — only actual
  findings can trip `--strict-dcat`.

Profile YAML
- `resources/profiles/croissant.yaml` opts in:
  `command: mlcroissant`,
  `args: ["validate", "--jsonld", "{file}"]`,
  `install_hint: "pip install mlcroissant
   (https://github.com/mlcommons/croissant/tree/main/python/mlcroissant)"`.

Docs
- `resources/profiles/README.md` gains a "Validation" section
  documenting both validator paths + a config table for the
  `external` block + the Croissant example.

Tests
- 13 unit tests in `external_validate::tests` covering: missing
  binary path (with + without install_hint), label override,
  successful exit, multi-line stderr findings, stdout fallback,
  exit-with-no-output diagnostic, `{file}` substitution,
  args-append-when-no-token, severity parsing, and config-absent
  no-op. Unix-only paths gate on `cfg!(unix)`.
- `embedded_croissant_parses_and_dry_compiles` extended to
  assert the external validator config is present + reachable.

End-to-end verified: on a system without `mlcroissant` installed,
`qsv profile <input> --profile croissant --validate-dcat` emits
exactly one `Severity::Info` warning that surfaces the install
command + URL.

Addresses the final §6 follow-up from PR #3908's handoff (Croissant
`mlcroissant` Python validator integration). The same
`external_validate` module is ready to host a `pyshacl` config for
DCAT-AP v3 in a future PR.

All 154 profile unit tests + 49 integration tests pass. Clippy +
docs-drift-check clean. All four binaries (qsv/qsvmcp/qsvlite/qsvdp)
build green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented May 27, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a generic “external validator” hook to the YAML-driven qsv profile engine so profiles can invoke canonical, out-of-process validators (starting with Croissant’s mlcroissant) alongside/independent of the built-in JSON Schema validation flow.

Changes:

  • Introduces validation.external in profile specs, plus a new external_validate runner that writes JSON-LD to a tempfile and spawns a configured command with {file} substitution.
  • Wires external validation into qsv profile --validate-dcat, integrating results into dcat_warnings and enforcing failures under --strict-dcat (excluding Info severity).
  • Updates Croissant profile YAML and profile authoring docs to enable/configure mlcroissant by default.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/cmd/profile/profile_spec.rs Adds validation.external configuration and tests Croissant’s embedded profile wiring.
src/cmd/profile/external_validate.rs New external validator execution module with unit tests.
src/cmd/profile.rs Runs external validation during --validate-dcat and enforces --strict-dcat behavior.
resources/profiles/README.md Documents JSON Schema vs external validation and external config keys.
resources/profiles/croissant.yaml Enables Croissant external validation via mlcroissant with install hint.

Comment thread src/cmd/profile.rs Outdated
Comment thread src/cmd/profile/external_validate.rs Outdated
Comment thread src/cmd/profile/external_validate.rs
jqnatividad and others added 2 commits May 27, 2026 11:14
Three Copilot findings on PR #3911:

1. SECURITY (`src/cmd/profile.rs:507`) — `--profile <path>` loads
   arbitrary YAML from disk, and that YAML controls
   `validation.external.command`. Under `--validate-dcat` the
   engine would spawn whatever the file said, enabling arbitrary
   command execution if a user runs an untrusted profile.

   Added a trust gate:
   - New `ProfileSource` enum (`Embedded` / `FilePath`) on
     `ProfileSpec`, stamped by `load()` after parsing.
   - New CLI flag `--allow-external-validator` (default off).
   - Orchestrator runs `validation.external` only when
     `profile.source.is_embedded() || args.flag_allow_external_validator`.
     Otherwise emits a `Severity::Recommended` warning naming the
     would-be validator and pointing at the opt-in flag.

   Bundled profiles (dcat-us-v3, dcat-ap-v3, croissant) keep
   running their declared validators frictionlessly because the
   YAMLs are vetted at qsv release time; only file-loaded
   profiles require the explicit opt-in.

2. FIELD STABILITY (`external_validate.rs:128`) —
   `ProjectionWarning.field` is used as a JSON-LD key / pointer
   elsewhere, but external_validate was building it as
   `external_validate/{label}` where `label` is user-configurable.
   That broke downstream filtering and looked like a JSON pointer.

   Findings now use a stable `field: "external_validate"` and the
   label moves into the message as `<label>: <line>`. Users still
   trace findings back to source via the message; downstream
   filters can target a single string.

3. OS PATH FIDELITY (`external_validate.rs:71`) — the tempfile
   path was being run through `to_string_lossy()` before
   substitution. Non-UTF-8 paths (legal on Unix) would get
   mangled and the spawned validator would then complain that
   the file doesn't exist.

   Tempfile path is kept as `OsString` end-to-end. New
   `substitute_file_token(&str, &OsStr) -> OsString` helper
   stitches the OsStr path between UTF-8 string parts (args come
   from YAML so they're always UTF-8; only the path needs OsStr
   fidelity). `resolve_args` returns `Vec<OsString>` which
   `Command::args` accepts directly.

Tests
- `resolve_args_preserves_non_utf8_path_bytes` (Unix-gated, uses
  `OsStrExt::from_bytes` with bytes `\xFF` and `\xFE`) confirms a
  path with invalid UTF-8 sequences round-trips byte-for-byte
  through substitution.
- `non_zero_exit_surfaces_one_warning_per_stderr_line` and
  related tests updated to assert the stable field + new
  `<label>: <line>` message format.

End-to-end verified with a hand-crafted `/tmp/evil.yaml`:
- Embedded croissant → spawns; missing-binary Info warning.
- File-loaded evil.yaml without flag → spawn refused;
  Recommended warning surfaces the would-be command + opt-in
  instructions.
- File-loaded + `--allow-external-validator` → spawn proceeds;
  `--strict-dcat` honors findings as expected.

All 155 profile unit tests + 49 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roborev #2509 flagged that the trust gate added in 2f02c1c had no
automated coverage — the security-sensitive orchestrator path wasn't
exercised by any test. Adds 4 integration tests in tests/test_profile.rs:

- external_validator_gated_for_file_loaded_profile_without_flag:
  loads a sh-based external validator from a file, runs profile with
  --validate-dcat only, asserts the gate warning is present AND no
  SPAWNED- marker appears anywhere in warnings (proving the command
  was NOT spawned).

- external_validator_runs_for_file_loaded_profile_with_flag:
  same fixture + --allow-external-validator. Asserts the gate warning
  is absent AND a finding with the stable field "external_validate"
  and the new "fake-validator: SPAWNED-fake-validator" message format
  is surfaced (proving the spawn happened).

- external_validator_strict_dcat_fails_on_file_loaded_findings:
  flag + --strict-dcat. Asserts the command fails with the
  "external validator finding(s)" summary in stderr.

- external_validator_embedded_profile_skips_gate_warning:
  uses the embedded croissant profile. Asserts the file-loaded gate
  warning does NOT appear, regardless of whether mlcroissant itself
  is installed on the CI runner (Info-severity missing-binary
  warning is acceptable; the gate warning is not).

Three of the four tests gate on `cfg(unix)` because they spawn `sh`;
the embedded-profile test is portable.

All 53 profile integration tests + 155 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 10a97da into master May 27, 2026
18 of 20 checks passed
@jqnatividad jqnatividad deleted the croissant-mlcroissant-validator branch May 27, 2026 16:41
jqnatividad added a commit that referenced this pull request May 27, 2026
…e framework (#3912)

Closes the final §6 follow-up from profile3-handoff.md. DCAT-AP v3
ships SHACL constraints upstream, not JSON Schema, so qsv's built-in
JSON-Schema validator can't validate the emitted block. This commit
generalizes the `validation.external` framework introduced in #3911
to support multi-file validators and wires up the canonical Python
SHACL engine, `pyshacl`, as the DCAT-AP v3 reference validator.

Framework generalization
- New `ExternalValidatorResource` struct + `resources: Vec<...>`
  field on `ExternalValidator`. Each entry declares a logical
  embedded-resource name resolved against a compile-time
  `EMBEDDED_RESOURCES` table (`include_str!`-bundled), a token
  name used as `{<name>}` in args, and an optional file suffix.
- New `EMBEDDED_RESOURCES` table in
  `src/cmd/profile/external_validate.rs` with one entry today:
  `dcat-ap-v3-shacl-shapes` mapping to the vendored Turtle file.
  Custom YAML profiles can reference any name listed there but
  cannot register new ones (qsv-release-time decision).
- `resolve_args` / `substitute_tokens` rewritten for multi-token
  substitution. `{file}` keeps its "append if absent" fallback;
  named tokens (`{shapes}`, `{schema}`, etc.) substitute in
  place; unknown `{tokens}` pass through verbatim so a validator
  that genuinely wants `{foo}` in its CLI gets it. The OsString
  path-fidelity guarantee from #3911 is preserved.
- Validation: reserved `name: "file"` and unknown `embedded`
  values are rejected at spawn time with `Severity::Required`
  warnings so a misconfigured profile fails fast.

Vendored SHACL bundle
- `resources/dcat-ap-v3/shacl/dcat-ap-SHACL.ttl` (176KB, 2321
  lines): canonical SHACL shapes from SEMICeu DCAT-AP v3.0.0,
  fetched verbatim from
  https://github.com/SEMICeu/DCAT-AP/blob/master/releases/3.0.0/shacl/dcat-ap-SHACL.ttl
- `resources/dcat-ap-v3/shacl/README.md`: source + re-vendor
  procedure, similar to the GSA bundle's documentation.

Profile wiring
- `resources/profiles/dcat-ap-v3.yaml` declares the pyshacl
  external validator with `args: [-s, {shapes}, -sf, turtle,
  -df, json-ld, -f, human, {file}]`, references the embedded
  SHACL bundle via `resources: [{name: shapes, embedded:
  dcat-ap-v3-shacl-shapes, suffix: .ttl}]`, and carries an
  `install_hint` so users see `pip install pyshacl` exactly when
  pyshacl is missing.

Docs
- `resources/profiles/README.md`: config table extended with
  the `resources` field; full DCAT-AP/pyshacl example added;
  instructions for adding new `EMBEDDED_RESOURCES` entries.

Tests (+7 new in external_validate, embedded smoke extended)
- resolve_args_substitutes_named_extras
- embedded_resources_table_includes_dcat_ap_v3_shapes (sanity-
  checks the bundle is real Turtle, guarding against bad
  re-vendor steps)
- lookup_embedded_returns_none_for_unknown
- resource_with_reserved_name_file_is_rejected
- resource_with_unknown_embedded_is_rejected
- resource_tempfile_is_materialized_with_correct_suffix
- embedded_dcat_ap_v3_parses_and_dry_compiles extended

End-to-end verified on a host without pyshacl: profile run with
`--validate-dcat --profile dcat-ap-v3` emits one Severity::Info
warning carrying the install hint and the projection still ships.

All 161 profile unit tests + 53 integration tests pass. Clippy +
docs-drift-check clean. All four binaries build green.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 27, 2026
Nine clippy warnings have been visible in every `cargo +nightly
clippy -F all_features` run since the YAML projection engine
landed. None of them affect behavior, but they crowd the lint
output and obscure new issues. This commit clears all nine in
two files:

- `escape_json_pointer_token` was a private helper with no
  non-test callers; the only reference was a self-referential
  test in the same file. Both deleted. RFC 6901 escape coverage
  lives in `discovery_merge::tests::escape_token_handles_rfc6901_round_trip`.
- `ProjectionMode::Catalog` is currently only constructed in
  tests — the orchestrator calls `wrap_in_catalog_envelope`
  after `discovery_merge` for `--catalog` mode (Roborev #2490
  finding #1), bypassing the in-`project` catalog match arm.
  Added `#[allow(dead_code)]` on the variant with an explanatory
  doc-comment so the API surface stays intact and the never-
  constructed lint stops firing in non-test builds.
- `coerce_json_or_string` in projection.rs had a collapsible_if
  pattern flagged by clippy. Collapsed via Rust 2024 let-chain:
  `if (open/close braces) || (open/close brackets) && let Ok(v)
  = serde_json::from_str(...)`. Saves one indentation level.
- The trust-gate orchestrator path in `src/cmd/profile.rs` used
  `is_some() && ... .unwrap()` (introduced in #3911); refactored
  to `if let Some(cfg) = ... && !allow_external` via let-chain.
- `is_uuid_like` gained an explicit `assert!(bytes.len() > 23)`
  before its fixed-offset byte index reads, satisfying the
  repo-enabled `clippy::missing_asserts_for_indexing` lint and
  letting the compiler elide bounds checks. The underlying
  `len() == 36` guard already made the access safe; the assert
  documents that contract.
- Four `doc_lazy_continuation` warnings on `is_uuid_like`'s
  doc comment were fixed by inserting a blank line between the
  list and the descriptive paragraph below it.

`cargo +nightly clippy --bin qsv -F all_features -- -W clippy::perf`
now exits clean except for the harmless "src/main.rs found in
multiple build targets" notice (Cargo workspace artifact from
qsv/qsvmcp sharing main.rs).

All 160 profile unit tests + 53 integration tests pass. Docs
drift check clean. Net change: -9 LOC across 2 files.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants