feat: new scoresql cmd#3612
Merged
Merged
Conversation
…execution Analyzes SQL queries against stats, moarstats, and frequency caches of input CSV files to produce a performance score (0-100) with actionable optimization suggestions. Supports both Polars (default) and DuckDB query plan analysis. Scoring covers type optimization, join cardinality, filter selectivity, data distribution, and query anti-pattern detection (SELECT *, ORDER BY without LIMIT, cartesian joins, etc.). Caches are auto-generated when missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ustness - Fix test feature gate to require both `polars` and `feature_capable` (matching main.rs registration), preventing test failures on polars-only builds - Remove unused SqlInfo fields (_has_group_by, _referenced_tables, _select_columns) and the extract_select_columns function - Improve subquery detection: check for SELECT inside parentheses instead of counting all SELECT occurrences, avoiding false positives from string literals - Fix DuckDB plan table name substitution: sort replacements longest-first to prevent partial matches (e.g., "data" matching inside "data2") - Surface user-visible warnings (wwarn!) when cache generation fails, not just log::warn, so users know scoring may be inaccurate Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Track single-quote state in subquery detector to avoid false positives from SELECT appearing inside string literals (e.g. WHERE col = '(SELECT ...') - Add word-boundary check after SELECT to prevent matching identifiers like SELECTIVITY - Deduplicate table-name replacements to avoid redundant substitutions when a table name happens to equal an alias Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…handling Add a preceding word-boundary check so identifiers like PRESELECT are not falsely detected as subqueries. Also add a comment explaining why SQL '' escaped quotes work correctly via toggle symmetry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new scoresql command to analyze a SQL query against CSV-derived caches (stats/frequency/index) and emit a performance score plus optimization suggestions, aligning with qsv’s “smart command” approach to pre-flight analysis.
Changes:
- Adds a new
scoresqlCLI command (help text, command enum wiring, module export) gated behindpolars+feature_capable. - Implements
src/cmd/scoresql.rsto generate query plans (Polars/DuckDB), load or generate caches, compute a score breakdown, and output human-readable or JSON reports. - Adds an integration test suite for
scoresqland wires it into the test harness under the appropriate feature gates.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/tests.rs | Registers the new test_scoresql module under feature gating. |
| tests/test_scoresql.rs | Adds integration tests validating text/JSON output, suggestions, joins, LIMIT behavior, and output file support. |
| src/main.rs | Exposes scoresql in the command list, command enum, and dispatcher. |
| src/cmd/mod.rs | Exports the new scoresql module under polars + feature gating. |
| src/cmd/scoresql.rs | Implements the scoring command: cache handling, plan generation, heuristic SQL parsing, scoring, and formatting. |
…SQL parsing, regex * Remove incorrect first-line skip in load_stats_cache (stats JSONL has no metadata header) * Sort alias replacements by length descending to prevent partial matches (_t_1 inside _t_10) * Add split_on_operators() to correctly parse column names from `col=value` patterns * Use word-boundary regex in get_duckdb_plan instead of naive String::replace Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new feature-gated qsv scoresql command that analyzes a SQL query against CSV-derived caches to produce a performance score, query plan, and optimization suggestions before execution.
Changes:
- Introduces
scoresqlcommand implementation (Polars plan analysis by default, optional DuckDB plan mode). - Registers the new command in the CLI command enum/help output under
polars+feature_capable. - Adds an integration test suite covering JSON/human output structure and several scoring heuristics.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/tests.rs | Wires test_scoresql module into the integration test harness behind feature flags. |
| tests/test_scoresql.rs | New integration tests for scoresql output structure and key suggestion heuristics. |
| src/main.rs | Registers scoresql as an enabled command and routes execution to cmd::scoresql::run. |
| src/cmd/scoresql.rs | Implements query parsing heuristics, cache loading/generation, plan extraction, scoring, and output formatting. |
| src/cmd/mod.rs | Exposes the scoresql module behind feature gates. |
…Polars plan - Use labeled loop (`'on_clause`) in `extract_join_columns` so that stop-keywords (WHERE, ORDER, etc.) break the outer token loop, not just the inner operator-split loop. - Apply word-boundary regex replacement in `get_polars_plan` (matching the existing `get_duckdb_plan` strategy) to prevent alias partial matches (e.g., alias "data" inside "metadata_col"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, cache freshness, dedup * Use >= in is_cache_fresh to avoid churn on coarse-timestamp filesystems * Deduplicate extracted join columns to prevent double-counting * Pass --delimiter to qsv stats/frequency when generating caches * Use canonical path for cache generation (matches cache lookup path) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ter assertions - Use sort_unstable_by/dedup_by with case-insensitive comparison to match the existing comment's claim of case-insensitive deduplication - Add debug_assert!(delim.is_ascii()) guards in stats/freq cache generation to document the ASCII delimiter assumption Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.