Skip to content

validate: correctness, perf cleanup, and polish from code review#3779

Merged
jqnatividad merged 5 commits into
masterfrom
validate-review-2026-04-28
Apr 29, 2026
Merged

validate: correctness, perf cleanup, and polish from code review#3779
jqnatividad merged 5 commits into
masterfrom
validate-review-2026-04-28

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

Summary

A multi-pass review of src/cmd/validate.rs (~2,650 lines) surfaced a mix of correctness bugs, perf-leaning cleanups, and polish items. This PR lands all three groups as separate commits so each can be read and reverted independently.

Commit 1 — fix(validate): correctness fixes in JSON Schema validation

  • UniqueCombinedWithValidator::is_valid always returns false, forcing callers through validate() (which atomically check-and-inserts under lock). The previous impl read seen_combinations but never inserted, so any caller invoking Validator::is_valid() against a uniqueCombinedWith schema would let duplicates pass silently.
  • Substring schema-feature detection replaced with a tree walker (detect_custom_schema_features). The substring approach was whitespace-sensitive ("format" : "currency" silently disabled the custom format) and could false-match on description/title text containing literal keyword names. The tree walk runs once after the schema is parsed.
  • load_json now checks response.status().is_success() and propagates response.text() errors instead of .unwrap_or_default() flowing empty bodies into the JSON parser as confusing parse errors.
  • Email options apply display_text and domain_literal toggles explicitly in both directions (no more reliance on jsonschema crate defaults that may flip in future versions).
  • valid_flags is now extended by batch.len() rather than batch_size, so the last partial batch doesn't leave trailing valid flags for nonexistent rows.
  • Documented the implicit serde_json/preserve_order dependency at the uniqueCombinedWith index-resolution site.
  • Updated validate_schema_subcommand_with_invalid_url_schema to accept the new "HTTP error" substring.

Commit 2 — refactor(validate): perf and cleanup tweaks

  • UniqueCombinedWithValidator: RwLock<HashSet>Mutex<HashSet> (validate path always takes the write lock; no read-only path benefits from RwLock).
  • BitVec::set_unchecked (unsafe) → BitVec::set (safe). This path is dominated by validator work, so the micro-opt is irrelevant.
  • Documented the row-number-as-extra-ByteRecord-field convention used to pass the row index into the parallel closure.
  • Added a doc comment on split_invalid_records explaining the intentional re-read (page cache + all-valid hot path) instead of streaming.
  • Extracted a shared load_dynenum_set helper; lite and non-lite factories now delegate column resolution + drain to one function instead of duplicating ~60 lines.

Commit 3 — chore(validate): polish — re-entrant safety, arg checks, test hygiene

  • OnceLock::set(...).unwrap() / ? replaced with get_or_init / let _ = .set() for NULL_TYPE, DELIMITER, QSV_CACHE_DIR, CKAN_API, and CKAN_TOKEN. The previous code panics or returns Err on re-entry (e.g. qsv used as a library, or under qsvmcp).
  • flag_no_headers rejection moved earlier in run() so we don't open a CSV reader only to immediately drop it on user error.
  • Warn when a .json file appears earlier in the input list — schema detection only inspects the last positional, so a misordered argument silently fell into RFC 4180 mode.
  • test_load_json_via_url marked #[ignore] so CI/offline runs are deterministic; run with --ignored to exercise it.

Test plan

  • cargo build --bin qsv -F all_features
  • cargo build --bin qsvlite -F lite
  • cargo clippy --bin qsv -F all_features -- -D warnings (clean)
  • cargo test -F all_features2715 passed, 0 failed, 29 ignored
  • Each commit independently builds and passes cargo test validate -F all_features

🤖 Generated with Claude Code

- UniqueCombinedWithValidator::is_valid now always returns false, forcing
  callers through validate() which atomically check-and-inserts under lock.
  The previous impl read seen_combinations but never inserted, so any caller
  using Validator::is_valid() against a uniqueCombinedWith schema would let
  duplicates pass silently.
- Replaced four substring searches against the raw schema text with a
  detect_custom_schema_features tree walker. The substring approach was
  whitespace-sensitive and could false-match on description/title text
  containing literal keyword names; the tree walk runs once after the
  schema is parsed.
- load_json now checks response.status().is_success() and propagates the
  response.text() error properly, instead of unwrap_or_default() flowing
  empty bodies into the JSON parser as confusing parse errors.
- Email options apply display_text and domain_literal toggles explicitly
  in both directions (no more reliance on lib defaults that may flip).
- valid_flags is now extended by batch.len() rather than batch_size, so
  the last partial batch doesn't leave trailing valid flags for nonexistent
  rows.
- Documented the serde_json/preserve_order dependency at the
  uniqueCombinedWith index-resolution site.
- Test now accepts the new "HTTP error" substring for invalid URL schema.
- UniqueCombinedWithValidator: replace RwLock<HashSet> with Mutex<HashSet>.
  validate() always takes the write lock; there is no read-only path that
  benefits from an RwLock, so Mutex is the correct (and cheaper) primitive.
- Replace unsafe BitVec::set_unchecked with safe BitVec::set; this code
  path is dominated by validator work, so the micro-optimization is
  irrelevant.
- Document the row-number-as-extra-ByteRecord-field convention used to
  pass the row index into the parallel closure.
- Add a doc comment on split_invalid_records explaining why the input is
  intentionally re-read rather than streaming during the validation loop
  (page cache + all-valid hot path).
- Extract a shared load_dynenum_set helper; lite and non-lite factories
  now delegate the column-resolution + drain loop to a single function
  instead of duplicating ~60 lines.
- Replace OnceLock::set(...).unwrap()/? with get_or_init / let _ = .set()
  for NULL_TYPE, DELIMITER, QSV_CACHE_DIR, CKAN_API, and CKAN_TOKEN. The
  previous code panics or returns Err when run() is invoked twice in the
  same process (e.g. qsv used as a library, or under qsvmcp).
- Move the no-headers rejection earlier in run() so we don't open a CSV
  reader only to immediately drop it on a user-facing error.
- Warn when a .json file appears earlier in the input list — the schema
  detection only inspects the last positional, so a misordered argument
  used to silently fall into RFC 4180 mode.
- Mark test_load_json_via_url as #[ignore] so CI/offline runs are
  deterministic; run explicitly with --ignored to exercise it.
@codacy-production

codacy-production Bot commented Apr 28, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 12 complexity

Metric Results
Complexity 12

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the validate command’s JSON Schema validation by fixing correctness issues (notably around stateful uniqueCombinedWith), tightening network/file error handling for schema loading, and applying a set of performance/maintainability cleanups.

Changes:

  • Fix uniqueCombinedWith correctness and concurrency behavior; improve schema feature detection via a parsed-schema tree walk.
  • Refactor/cleanup validation hot-path code (batch flag handling, removal of unchecked bitvec writes, shared dynamicEnum loader).
  • Harden operational behavior (re-entrant OnceLock usage, earlier arg rejection, deterministic test behavior).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/cmd/validate.rs Correctness fixes for custom validators, safer schema feature detection, error-handling improvements, and perf/cleanup refactors.
tests/test_validate.rs Adjust test to accept the new HTTP error wording when schema fetch returns a non-success status.

Comment thread src/cmd/validate.rs Outdated
Comment thread src/cmd/validate.rs Outdated
Comment thread src/cmd/validate.rs Outdated
jqnatividad and others added 2 commits April 28, 2026 17:59
…al paths

In the qsvlite variant of dyn_enum_validator_factory:

- Move NamedTempFile::new() inside the URL branch so local-file dynamicEnum
  validation no longer pays for an unused temp file on the hot path.
- Replace temp_download.path().to_str().unwrap() with a fallible match that
  returns a ValidationError on non-UTF-8 paths instead of panicking.
- Same treatment for uri_path.to_str().unwrap() on local paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The L8 refactor unified the lite and non-lite dyn_enum_validator_factory
variants behind load_dynenum_set, which emits a single error message:
"Column '...' not found in lookup table". The test still had a feature-
gated assertion expecting the old lite-only wording, which broke CI under
the lite feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit bd9faf8 into master Apr 29, 2026
17 checks passed
@jqnatividad jqnatividad deleted the validate-review-2026-04-28 branch April 29, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants