Skip to content

feat(profile): five §5 follow-ups (sibling-URL discovery, profile-driven validation, force:true, UUID URL-title walk-up, GSA-bundle deferral)#3904

Merged
jqnatividad merged 8 commits into
masterfrom
profile-followups
May 26, 2026
Merged

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

Five queued profile-handoff §5 follow-ups landed on a single branch, plus one roborev fix on top:

Commit § What
87141d920 5.9 url_title_default walks past UUID-like basenames (canonical 8-4-4-4-12 + compact 32-hex) up to 3 levels — /datastore/dump/<uuid> now yields "dump" instead of the opaque hex
443576f54 5.2 Sibling-URL + JSON-LD <script> mechanisms added to dcat_discover::discover. Probes .metadata.json, .dcat.json, <dirname>/datapackage.json, <host>/.well-known/data.json; HTML JSON-LD sniff on the parent landing page
21a871eef 5.8 qsv validate spawns when scheming spec declares validators; RFC4180 failures surface as qsv:validation entries in dcat_warnings
cfe65c6cc 5.3 Deferral doc — full GSA jsonschema bundle vendoring blocked by JSON-LD prefix mismatch (bundle uses unprefixed otherIdentifier, we emit prefixed dct:identifier). Doc spells out the three real adoption paths
eeeff8bb0 5.4 dataset_info force: true honored at merge time — merge_discovered now refuses to overlay paths the user marked forced
2ffc2bb4a Roborev #2469 fix: RFC 6901 escape discovered keys before comparing against forced paths (matters for full-IRI JSON-LD properties like http://purl.org/dc/terms/title)

What's NOT here

  • §5.7 --freq-cache flag — was implemented (default-on, with two roborev rounds catching equivalence bugs) but ultimately dropped from this branch as scope-creep relative to the perf win. The cache work is preserved at branch profile-followups-backup for future revisitation if the perf case strengthens.
  • §5.5 Croissant ML projection — explicitly out of scope (separate big feature, not a follow-up).
  • §5.3 GSA bundle adoption — deferred with explanatory doc; needs JSON-LD expansion / key-translation layer / dcat::build refactor before the bundle can drive validation.

Roborev cycles closed during development

# Commit Verdict Fix commit
2469 eeeff8bb0 F (1 Medium) 2ffc2bb4a — RFC 6901 escaping for forced-path comparison

(Two earlier rounds 2462/2463 were resolved on the dropped §5.7 work and are preserved in the backup branch.)

Test plan

  • cargo test --bin qsv -F all_features cmd::profile — 122 unit tests pass (was 96 pre-branch)
  • cargo test --test tests -F all_features -- test_profile:: — 17 integration tests pass
  • cargo test --test tests -F datapusher_plus -- test_profile:: — 17 integration tests pass against qsvdp binary
  • cargo build clean for qsv, qsvmcp, qsvlite, qsvdp
  • cargo +nightly fmt clean
  • cargo clippy --bin qsv -F all_features clean
  • python3 scripts/docs-drift-check.py reports no drift

🤖 Generated with Claude Code

jqnatividad and others added 6 commits May 26, 2026 07:29
For CKAN-style `/datastore/dump/<uuid>` URLs the leaf basename is
an opaque UUID — better than the random tempfile suffix but still
not a usable title. Walk one level up (capped at 3) and return the
first non-UUID-like segment we find. The classic CKAN dump URL now
yields "dump" instead of a 36-char hex.

New helper `is_uuid_like()` matches:
  - canonical 8-4-4-4-12 hex with dashes
  - compact 32 contiguous hex characters
Both case-insensitive. Other ID-like patterns (MongoDB ObjectId at
24 hex, ULIDs, slugified IDs) are intentionally NOT matched —
over-eager matching would walk past legitimate titles like
"2024-Q3".

Behavior:
  /datastore/dump/<uuid>               -> "dump"          (was: uuid)
  /path/snapshots/<32-hex>             -> "snapshots"     (was: hex)
  /datastore/dump/2024-Q3-payments.csv -> "2024-Q3-payments" (unchanged)
  /<uuid>/<uuid>/<uuid>                -> leaf uuid (fallback after cap)
  36-char non-hex string               -> unchanged (length-collision check)

If every candidate up the 3-level cap is UUID-like, falls back to
the leaf UUID — still reproducible, still beats the tempfile
suffix. Users wanting a prettier title supply
`--initial-context.package.title`; a CKAN
`/api/3/action/resource_show?id=<uuid>` lookup is a deferred
follow-up.

The previous `url_title_preserves_uuid_basename_unchanged` test
documented the old behavior — replaced with four new tests covering
the walk, the all-UUID fallback, the normal-basename regression
check (including a 36-char non-hex length-collision case), and an
`is_uuid_like` unit-level matrix of positives + negatives.

Verified: 99 profile unit tests pass; 15 integration tests pass
under both -F all_features and -F datapusher_plus. cargo +nightly
fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires two more mechanisms into dcat_discover::discover, chained in
priority order after the existing Link: rel=describedBy probe:

  2. Sibling URLs by convention (qsv profile follow-ups §5.2). Four
     candidates tried in order:
       - <url>.metadata.json     (qsv profile's own output naming)
       - <url>.dcat.json         (common DCAT-JSON convention)
       - <dirname>/datapackage.json (Frictionless Data Package spec)
       - <host>/.well-known/data.json (DCAT-US site catalog)

  3. HTML JSON-LD <script type="application/ld+json"> blocks in the
     URL's parent (landing-page) HTML. Open-data portals typically
     host the dataset page one level above the raw CSV download.

Implementation:
- New `discover_via_sibling_urls` + `sibling_candidates` helper.
  Hand-rolled .metadata.json/.dcat.json suffixing preserves query
  strings (textual append); url::Url-based construction for the
  datapackage.json and /.well-known/data.json variants drops query
  & fragment since they're host-relative, not input-relative.
- New `discover_via_html_jsonld` + `extract_jsonld_blocks` helper.
  Pure-string HTML scan (no parser dep): locate <script ...> tags,
  case-insensitive type-attribute check for application/ld+json,
  parse the body as JSON, run through extract_dcat_dataset (which
  already handles @graph envelopes + bare-object shape fallback).
  Skips response if neither Content-Type nor body sniff suggests
  HTML — avoids wasted scans on PDFs or binary blobs served with
  no Content-Type.
- New `fetch_json_and_extract` shared GET helper, mirroring
  discover_via_link_header's 4 MiB body cap.

Module doc comment updated: the §5.2 "follow-up" markers are
replaced with the new active descriptions.

Nine unit tests added (sibling_candidates × 3,
extract_jsonld_blocks × 6) — pure-string, no network. Covers
typical CSV URL, query+fragment stripping, host-only URLs, basic
<script> match, mixed-case type attribute, walking past
non-dataset blocks, no-match negative, unrelated <script> tags,
and the @graph envelope variant.

Verified: 108 profile unit tests pass (was 99, +9 new); 15
integration tests pass under both -F all_features and -F
datapusher_plus. cargo +nightly fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the scheming spec declares one or more `validators` on any
field (dataset_fields or resource_fields), invoke `qsv validate`
against the input and merge any RFC4180 failures into
dcat_warnings. The presence of validators is the trigger; their
string content isn't interpreted yet — auto-generating a JSON
Schema from declared types + CKAN validators is a future
enhancement, but the architectural hook is in place.

Implementation:
- `Spec::has_validators()` walks both dataset_fields and
  resource_fields, returns true if any field's extras carry a
  non-empty, non-whitespace `validators` string. Whitespace-only
  entries are intentionally treated as "not declared" so empty
  but present entries don't accidentally trigger.
- `run_profile_validation(input_path) -> Vec<DcatWarning>` spawns
  `qsv validate <input>` directly (not via util::run_qsv_cmd,
  which errors on non-zero exit — the validate path needs to
  succeed when the subprocess fails). Best-effort: spawn errors,
  missing binary, or non-UTF-8 stderr all silently degrade to
  "no warnings". Emits a `qsv profile: ran `validate`` status
  line on stderr, mirroring the existing `ran `frequency`` /
  `ran `count`` markers so the helper's invocation is observable.
- Wired into the existing dcat_warnings merge block in
  profile.rs::run, alongside the build-time warning filter and
  --validate-dcat schema-violation path. Independent of
  --validate-dcat (which validates the emitted dcat block, not
  the input CSV).

Failures land as DcatWarning entries with:
  field = "qsv:validation"
  severity = Required
  message = "input failed `qsv validate` (RFC4180): <detail>"

Tests:
- Four unit tests on Spec::has_validators: dataset-side trigger,
  resource-side trigger, none-declared negative, whitespace-only
  negative.
- Two integration tests on the trigger plumbing:
  profile_runs_validation_when_spec_declares_validators (clean
  CSV + druf spec → validate spawns, no qsv:validation warning),
  profile_skips_validation_when_spec_has_no_validators (spec-less
  → validate must NOT spawn).

Verified: 112 profile unit tests pass (was 108, +4); 17 integration
tests pass under both -F all_features and -F datapusher_plus.
cargo +nightly fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The handoff suggested vendoring the full GSA dcat-us JSON Schema
suite as a drop-in replacement for embedded_minimal_schema. While
investigating, hit a fundamental shape mismatch the original plan
didn't account for: the GSA bundle is written against the
**unprefixed** JSON-LD-expanded form (`otherIdentifier`, `@type:
"Dataset"`) while `dcat::build` emits the **prefixed JSON-LD-compact**
form (`dct:identifier`, `@type: "dcat:Dataset"`). Naïvely vendoring
the bundle and pointing the validator at it would flag every key
as missing.

Updated the dcat_validate module-level doc comment to spell out
the three real paths forward (JSON-LD expansion, key translation
layer, refactor dcat::build to emit expanded form) and why each
is bigger scope than a vendor-and-swap. Embedded minimal schema
stays in place — it catches the mandatory-field class of mistake
cheaply.

No code changes; doc-only commit so the next maintainer doesn't
re-do the same investigation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the existing `{value, force: true}` plumbing with real
merge-time effect for `dataset_info` JSON-Pointer entries.
Discovered DCAT (Link header / sibling URL / JSON-LD <script>)
will no longer overlay paths the user marked forced — even when
the inferred projection left them absent.

Use case: declare a field "intentionally absent" and prevent
publisher DCAT discovery from silently filling it in. Example:
  {"dataset_info":
     {"/dcat/dct:rights": {"value": null, "force": true}}}
yields literal `null` at `/dcat/dct:rights` AND blocks any
discovered `dct:rights` from being merged in.

Implementation:
- `collect_forced_dataset_info_paths(raw)` walks the `dataset_info`
  subtree BEFORE `normalize_value_force` strips the wrappers and
  collects pointer paths whose value matched the exact two-key
  `{"value": ..., "force": true}` shape. `force: false` and plain
  values aren't collected.
- `load_initial_context` signature extended: returns
  `(package, resource, dataset_info, forced_dcat_paths)`. The
  previous wrapper-stripping behavior is unchanged.
- `AnalysisContext` gains `forced_dcat_paths: Vec<String>` so the
  orchestrator can hand it to `merge_discovered`.
- `merge_discovered(inferred, discovered, &forced_dcat_paths)` now
  skips each discovered top-level key whose translated path
  (`/dcat/<key>`) equals or prefixes any forced path. Nested
  forces (e.g. `/dcat/dcat:contactPoint/vcard:fn`) block the
  whole-object overlay since `merge_discovered` operates at the
  top level — nested-leaf force is satisfied by the later
  pointer-override pass.

Scope-limit: force on `package` / `resource` initial-context
entries is still accepted and stripped but NOT honored at merge
time — that needs a CKAN→DCAT JSON-Pointer mapping table
(documented in `load_initial_context`'s comment as a deferred
follow-up). USAGE is updated to spell out the new dataset_info
behavior and the package/resource gap.

Tests:
- 3 unit tests on `collect_forced_dataset_info_paths`:
  dataset_info collection with mixed wrapper / plain / force:false
  / null-value-force shapes, no-dataset_info, pathological
  non-object dataset_info.
- 4 unit tests on `merge_discovered`: forced top-level key blocks
  overlay; forced nested path blocks the whole-object overlay;
  unrelated discovered keys still fill when one is forced; forced
  paths outside the /dcat subtree are ignored.
- 1 integration test exercising the full flow against the qsv
  binary: initial-context with `{value: "MIT IRI", force: true}`
  for dct:license (lands via pointer override) and `{value: null,
  force: true}` for dct:rights (null round-trips, force blocks
  hypothetical discovery overlay).

Verified: 119 profile unit tests pass (was 112, +7); 18
integration tests pass under both -F all_features and -F
datapusher_plus (was 17, +1). cargo +nightly fmt + clippy clean,
docs/help regenerated, docs-drift-check reports no drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…borev #2469)

One Medium finding on the §5.4 commit: the candidate JSON-Pointer
path built from each discovered DCAT key was interpolated directly
without RFC 6901 token escaping. A user wanting to force a JSON-LD
property whose key contains `/` or `~` (full IRIs like
`http://purl.org/dc/terms/title`, the rare CURIE-with-tilde) would
write the path in its escaped form
(`/dcat/http:~1~1purl.org~1dc~1terms~1title`), but our candidate
construction produced the un-escaped raw form
(`/dcat/http://purl.org/dc/terms/title`) — too many pointer
segments, never matches, force is silently ignored.

Fix:
- New `escape_json_pointer_token` helper that applies RFC 6901
  section 4 escaping (`~` → `~0`, `/` → `~1`) in the correct order
  (`~` first, otherwise the `~1` from a `/` would get
  double-escaped to `~01`).
- `merge_discovered` builds `candidate = format!("/dcat/{}",
  escape_json_pointer_token(k))` so the comparison stays in the
  canonical escaped JSON-Pointer space.

Tests (3 new in src/cmd/profile.rs::tests):
- merge_force_match_handles_full_iri_keys_via_rfc6901_escaping:
  forced path `/dcat/http:~1~1purl.org~1dc~1terms~1title` correctly
  blocks the discovered `http://purl.org/dc/terms/title` overlay.
- merge_force_does_not_match_unrelated_keys_after_escaping:
  regression check that the same escaping doesn't over-eagerly
  match an unrelated `dct:identifier` key.
- escape_json_pointer_token_matches_rfc6901: unit-level matrix —
  plain, /-only, ~-only, the tricky `~/` ordering trap (must yield
  `~0~1`, not `~01`), and the full-IRI case.

Verified: 122 profile unit tests pass (was 119, +3); 17 integration
tests pass under both -F all_features and -F datapusher_plus.
cargo +nightly fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented May 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/cmd/profile/dcat_discover.rs Outdated
Comment thread src/cmd/profile.rs Outdated
jqnatividad and others added 2 commits May 26, 2026 09:47
…orwarding

* dcat_discover::sibling_candidates: build all four candidates via
  `url::Url` parsing so query strings and fragments on the input URL
  don't get baked into the appended suffix. An input like
  `snapshot.csv?token=abc#frag` was producing
  `snapshot.csv?token=abc.metadata.json`, which servers interpreted as
  a GET on the CSV with a polluted query value rather than a fetch of
  the sibling JSON. Falls back to textual append only when the URL
  fails to parse. Updated the corresponding test to assert the new
  behavior for all four candidate slots.

* profile::run_profile_validation: forward `--no-headers` and
  `--delimiter` to `qsv validate` so it parses the input the same way
  the rest of the profile pipeline (stats/frequency/count) does.
  Without this, non-default CSV options would yield spurious or
  missed RFC4180 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(roborev #2471)

Roborev flagged the new `--no-headers` / `--delimiter` forwarding path in
`run_profile_validation` as uncovered: the existing validation test only
exercised default comma-delimited input with headers, so it would still
pass if the forwarded args were dropped or misordered.

The new test uses a `;`-delimited CSV whose rows contain unquoted
commas. When parsed as the default `,`-delimited, field counts mismatch
the 1-field header and `qsv validate` emits an RFC4180 record-length
failure. When parsed with `;`, the six fields per row line up and
validation passes. Asserting the absence of a `qsv:validation` warning
on this input proves the `--delimiter ;` flag was forwarded to the
spawned `qsv validate`.

Verified by running `qsv validate` directly on the same content with
and without `--delimiter ;` — exit 1 vs exit 0 respectively, confirming
the test would fail if the forwarding were ever removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 1667038 into master May 26, 2026
16 of 18 checks passed
@jqnatividad jqnatividad deleted the profile-followups branch May 26, 2026 13:58
Comment thread src/cmd/profile.rs Dismissed
Comment thread src/cmd/profile.rs Dismissed
Comment thread src/cmd/profile.rs Dismissed
Comment thread src/cmd/profile.rs Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants