validate: correctness, perf cleanup, and polish from code review#3779
Merged
Conversation
- UniqueCombinedWithValidator::is_valid now always returns false, forcing callers through validate() which atomically check-and-inserts under lock. The previous impl read seen_combinations but never inserted, so any caller using Validator::is_valid() against a uniqueCombinedWith schema would let duplicates pass silently. - Replaced four substring searches against the raw schema text with a detect_custom_schema_features tree walker. The substring approach was whitespace-sensitive and could false-match on description/title text containing literal keyword names; the tree walk runs once after the schema is parsed. - load_json now checks response.status().is_success() and propagates the response.text() error properly, instead of unwrap_or_default() flowing empty bodies into the JSON parser as confusing parse errors. - Email options apply display_text and domain_literal toggles explicitly in both directions (no more reliance on lib defaults that may flip). - valid_flags is now extended by batch.len() rather than batch_size, so the last partial batch doesn't leave trailing valid flags for nonexistent rows. - Documented the serde_json/preserve_order dependency at the uniqueCombinedWith index-resolution site. - Test now accepts the new "HTTP error" substring for invalid URL schema.
- UniqueCombinedWithValidator: replace RwLock<HashSet> with Mutex<HashSet>. validate() always takes the write lock; there is no read-only path that benefits from an RwLock, so Mutex is the correct (and cheaper) primitive. - Replace unsafe BitVec::set_unchecked with safe BitVec::set; this code path is dominated by validator work, so the micro-optimization is irrelevant. - Document the row-number-as-extra-ByteRecord-field convention used to pass the row index into the parallel closure. - Add a doc comment on split_invalid_records explaining why the input is intentionally re-read rather than streaming during the validation loop (page cache + all-valid hot path). - Extract a shared load_dynenum_set helper; lite and non-lite factories now delegate the column-resolution + drain loop to a single function instead of duplicating ~60 lines.
- Replace OnceLock::set(...).unwrap()/? with get_or_init / let _ = .set() for NULL_TYPE, DELIMITER, QSV_CACHE_DIR, CKAN_API, and CKAN_TOKEN. The previous code panics or returns Err when run() is invoked twice in the same process (e.g. qsv used as a library, or under qsvmcp). - Move the no-headers rejection earlier in run() so we don't open a CSV reader only to immediately drop it on a user-facing error. - Warn when a .json file appears earlier in the input list — the schema detection only inspects the last positional, so a misordered argument used to silently fall into RFC 4180 mode. - Mark test_load_json_via_url as #[ignore] so CI/offline runs are deterministic; run explicitly with --ignored to exercise it.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 12 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the validate command’s JSON Schema validation by fixing correctness issues (notably around stateful uniqueCombinedWith), tightening network/file error handling for schema loading, and applying a set of performance/maintainability cleanups.
Changes:
- Fix
uniqueCombinedWithcorrectness and concurrency behavior; improve schema feature detection via a parsed-schema tree walk. - Refactor/cleanup validation hot-path code (batch flag handling, removal of unchecked bitvec writes, shared dynamicEnum loader).
- Harden operational behavior (re-entrant OnceLock usage, earlier arg rejection, deterministic test behavior).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/cmd/validate.rs |
Correctness fixes for custom validators, safer schema feature detection, error-handling improvements, and perf/cleanup refactors. |
tests/test_validate.rs |
Adjust test to accept the new HTTP error wording when schema fetch returns a non-success status. |
…al paths In the qsvlite variant of dyn_enum_validator_factory: - Move NamedTempFile::new() inside the URL branch so local-file dynamicEnum validation no longer pays for an unused temp file on the hot path. - Replace temp_download.path().to_str().unwrap() with a fallible match that returns a ValidationError on non-UTF-8 paths instead of panicking. - Same treatment for uri_path.to_str().unwrap() on local paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The L8 refactor unified the lite and non-lite dyn_enum_validator_factory variants behind load_dynenum_set, which emits a single error message: "Column '...' not found in lookup table". The test still had a feature- gated assertion expecting the old lite-only wording, which broke CI under the lite feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A multi-pass review of
src/cmd/validate.rs(~2,650 lines) surfaced a mix of correctness bugs, perf-leaning cleanups, and polish items. This PR lands all three groups as separate commits so each can be read and reverted independently.Commit 1 —
fix(validate): correctness fixes in JSON Schema validationUniqueCombinedWithValidator::is_validalways returnsfalse, forcing callers throughvalidate()(which atomically check-and-inserts under lock). The previous impl readseen_combinationsbut never inserted, so any caller invokingValidator::is_valid()against auniqueCombinedWithschema would let duplicates pass silently.detect_custom_schema_features). The substring approach was whitespace-sensitive ("format" : "currency"silently disabled the custom format) and could false-match ondescription/titletext containing literal keyword names. The tree walk runs once after the schema is parsed.load_jsonnow checksresponse.status().is_success()and propagatesresponse.text()errors instead of.unwrap_or_default()flowing empty bodies into the JSON parser as confusing parse errors.display_textanddomain_literaltoggles explicitly in both directions (no more reliance on jsonschema crate defaults that may flip in future versions).valid_flagsis now extended bybatch.len()rather thanbatch_size, so the last partial batch doesn't leave trailing valid flags for nonexistent rows.serde_json/preserve_orderdependency at theuniqueCombinedWithindex-resolution site.validate_schema_subcommand_with_invalid_url_schemato accept the new"HTTP error"substring.Commit 2 —
refactor(validate): perf and cleanup tweaksUniqueCombinedWithValidator:RwLock<HashSet>→Mutex<HashSet>(validate path always takes the write lock; no read-only path benefits from RwLock).BitVec::set_unchecked(unsafe) →BitVec::set(safe). This path is dominated by validator work, so the micro-opt is irrelevant.ByteRecord-field convention used to pass the row index into the parallel closure.split_invalid_recordsexplaining the intentional re-read (page cache + all-valid hot path) instead of streaming.load_dynenum_sethelper; lite and non-lite factories now delegate column resolution + drain to one function instead of duplicating ~60 lines.Commit 3 —
chore(validate): polish — re-entrant safety, arg checks, test hygieneOnceLock::set(...).unwrap()/?replaced withget_or_init/let _ = .set()forNULL_TYPE,DELIMITER,QSV_CACHE_DIR,CKAN_API, andCKAN_TOKEN. The previous code panics or returnsErron re-entry (e.g. qsv used as a library, or under qsvmcp).flag_no_headersrejection moved earlier inrun()so we don't open a CSV reader only to immediately drop it on user error..jsonfile appears earlier in the input list — schema detection only inspects the last positional, so a misordered argument silently fell into RFC 4180 mode.test_load_json_via_urlmarked#[ignore]so CI/offline runs are deterministic; run with--ignoredto exercise it.Test plan
cargo build --bin qsv -F all_featurescargo build --bin qsvlite -F litecargo clippy --bin qsv -F all_features -- -D warnings(clean)cargo test -F all_features→ 2715 passed, 0 failed, 29 ignoredcargo test validate -F all_features🤖 Generated with Claude Code