Add a simple forward fill command.#4

jqnatividad · 2020-12-27T18:01:39Z

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

All 24 individual tests pass (all checkmarks). The "FAIL" is a pre-existing vitest configuration issue ("No test suite found") unrelated to my changes — all actual test cases pass. Changes: - In `translateSql`, only the first standalone `_t_1` occurrence gets the `AS _t_1` alias; subsequent standalone occurrences get the bare `readExpr` to avoid duplicate alias issues in UNION/subquery contexts - Added test case for multiple standalone `_t_1` references verifying alias-on-first-only behavior Address review findings (job 115) All 274 tests pass. The changes compile cleanly and tests are green. Changes: - Include Parquet conversion warning in the error path (not just success), so users see the full chain of failures when both DuckDB and sqlp fail - Log when Parquet warning cannot be prepended due to unexpected content structure Address review findings (job 117) All changes are complete. Build passes and all 274 tests pass. Changes: - Fix SQL injection via `compression`/codec parameter: whitelist valid codecs (`zstd`, `snappy`, `gzip`, `lz4`, `uncompressed`) and reject anything else - Fix SQL injection via `delimiter` parameter: validate exactly 1 character and escape single quotes - Fix race condition in `ensureParquet()`: add in-memory lock map to deduplicate concurrent conversions for the same file - Add error checking after each qsv subprocess call in `ensureParquet()`: wrap each step in try/catch with descriptive error messages identifying which step failed - Verify Parquet output file exists after conversion before returning success Address review findings (job 118) No tests reference the codec list. The change is minimal and the build succeeds. Changes: - Add `brotli` and `lzo` to `VALID_PARQUET_CODECS` set to match DuckDB's supported compression codecs Address review findings (job 120) All changes complete. Build succeeds and all 276 tests pass. Changes: - Remove unused `runQsvSimple` import from `duckdb.ts` (finding #1) - Add trust assumption comment documenting that `sql` parameter comes from trusted MCP agent in `tryDuckDbExecution` (finding #2) - Clean up partial `.parquet` file on conversion failure in `doParquetConversion` to prevent stale corrupted files (finding #4) - Detect multi-table SQL queries (`_t_2`, `_t_3`, etc.) and fall back to sqlp preemptively instead of sending untranslated references to DuckDB (finding #5) - Add tests for multi-table reference detection regex and `_t_2+` non-translation behavior Address review findings (job 121) Everything looks clean. Here's a summary: Changes: - Fix multi-table sqlp fallback: stop setting `SKIP_INPUT` for multi-table queries so sqlp receives the original input file and can resolve `_t_2+` references natively - Extract `MULTI_TABLE_PATTERN` regex to a shared constant exported from `duckdb.ts`, eliminating duplication between `mcp-tools.ts` and `duckdb.test.ts` - Clean up asymmetric brace style in the DuckDB/sqlp interception block into a clean if/else Address review findings (job 122) All 276 tests pass. The changes are minimal and focused: Changes: - Remove case-insensitive `/i` flag from `MULTI_TABLE_PATTERN` regex since sqlp table references (`_t_N`) are case-sensitive - Add test cases verifying uppercase `_T_2` and `_T_10` do not match the multi-table pattern Address review findings (job 124) All changes applied, build and tests pass (276/276). Changes: - Add `.sz` compressed file variants to `isCsvLikeFile()` so Snappy-compressed CSV/TSV/TAB/SSV files are recognized and converted to Parquet (finding 1) - Move `ensureParquet()` call after `MULTI_TABLE_PATTERN` check to avoid wasted Parquet conversion on multi-table queries that fall through to sqlp (finding 2) - Restore case-insensitive flag on `MULTI_TABLE_PATTERN` regex for consistency with case-insensitive `_t_1` replacement in `translateSql` (finding 3) - Update test assertions to expect uppercase `_T_2`/`_T_10` to match the now case-insensitive pattern Address review findings (job 125) All 280 tests pass. Here's the summary: Changes: - Reverted `MULTI_TABLE_PATTERN` regex to case-sensitive (removed `/i` flag) since sqlp table references (`_t_N`) are case-sensitive and uppercase `_T_2` would cause query errors - Reverted duckdb tests to assert uppercase `_T_N` does NOT match the pattern - Exported `isCsvLikeFile` function from `mcp-tools.ts` to enable testing - Added unit tests for `isCsvLikeFile` covering standard extensions, `.sz` Snappy-compressed variants, case-insensitivity, and non-CSV rejection Address review findings (job 127) Changes: - `getParquetPath` now uses `basename()` for extension matching, preventing false matches when directory paths contain CSV-like extensions (e.g., `/data/CSV_FILES/test.csv`) - `MULTI_TABLE_PATTERN` made case-insensitive (`/i` flag) to be consistent with the `_t_1` replacement pattern, preventing confusing DuckDB errors when agents send uppercase `_T_2` - `translateSql` now logs a warning for unknown file extensions (e.g., `.xlsx`) that are treated as CSV, helping diagnose cryptic DuckDB errors - Secondary SIGKILL timer in `executeDuckDbQuery` is now properly cleared in both `close` and `error` handlers, preventing unnecessary SIGKILL on already-exited processes - Updated `duckdb.test.ts` to match the case-insensitive `MULTI_TABLE_PATTERN` behavior Address review findings (job 128) All changes are complete. TypeScript compiles cleanly and all 286 tests pass. Changes: - Fix `MULTI_TABLE_PATTERN` flip-flop: make regex case-sensitive again, add `normalizeTableRefs()` to lowercase `_T_N` → `_t_N` before routing, so sqlp receives consistent lowercase refs - Extract `CSV_LIKE_EXTENSIONS` shared constant in `duckdb.ts`, reuse in both `isCsvLikeFile()` and `translateSql()` to eliminate duplicated extension lists - Export `getParquetPath()` and add regression tests for directory-path false-match bug (e.g., `/data/csv_files/test.json`) - Add tests for `normalizeTableRefs()` and its interaction with `MULTI_TABLE_PATTERN` Address review findings (job 129) All 286 tests pass. The diagnostics about `qsv_bin` are pre-existing Rust issues unrelated to these changes. Changes: - Added comment explaining `.txt`/`.dat` are warning-only suppressions, not in `CSV_LIKE_EXTENSIONS` (no Parquet conversion) - Simplified `params.sql` assignment by removing conditional check — always assign normalized SQL Address review findings (job 131) All 286 tests pass. Here's the summary: Changes: - Remove `params.output` fallback in `tryDuckDbExecution` to use only the pre-validated `outputFile` parameter, closing a potential path validation bypass - Add clarifying comment to `getParquetPath` explaining why the mixed-case logic (lowercase matching + original-path slicing) is correct - Add comment to `translateSql` documenting that `_t_1` replacement also applies inside SQL string literals (acceptable limitation for MCP agent usage) Address review findings (job 133) All 287 tests pass, 0 failures. Changes: - Consolidated `ChildProcess` type import with `spawn` into a single `child_process` import in `duckdb.ts` - Extracted inline `maxStderrSize` magic number to named constant `MAX_STDERR_SIZE` alongside `DUCKDB_VALIDATION_TIMEOUT_MS` - Exported `ensureParquet` and added tests for its early-return paths (non-CSV file passthrough) Address review findings (job 135) All 287 tests pass. Changes: - Made `MULTI_TABLE_PATTERN` case-insensitive (`/i` flag) for defense-in-depth, so uppercase `_T_N` references are caught even without `normalizeTableRefs` being called first - Fixed TOCTOU race in `ensureParquet` by checking the lock map before any async `stat()` calls - Added zero-byte file validation after Parquet conversion to catch corrupted/empty output files - Updated test assertions to match the case-insensitive `MULTI_TABLE_PATTERN` behavior Address review findings (job 136) All 287 tests pass. The diagnostics about `qsv_bin` and `lesson_*.rs` are pre-existing issues unrelated to this change. Changes: - Restructured zero-byte file validation in `doParquetConversion` to use separate try/catch for `stat()` and a plain `if` for the size check, eliminating brittle string-matching error re-throw logic Address review findings (job 138) All changes are complete. Build passes and all 289 tests pass. Changes: - Fix `translateSql` to skip `_t_1` replacements inside single-quoted SQL string literals using a quote-aware regex - Fix `normalizeTableRefs` regex to handle both `_T_N` and `_t_N` variants (was only matching uppercase `_T_`) - Prefix delimiter validation error with `[DuckDB]` for clearer error attribution - Document `getParquetPath` behavior for non-CSV files (appends `.parquet` suffix) - Add tests for string literal protection (`SELECT '_t_1'` preserved) and escaped quotes - Add test for `normalizeTableRefs` idempotency on already-lowercase refs Address review findings (job 140) All 290 tests pass. The changes are minimal and focused on the review findings. Changes: - Added comment documenting that multi-table queries (`_t_2+`) don't benefit from Parquet auto-conversion, directing users to manually convert with `qsv_to_parquet` - Added comment clarifying delimiter validation: actual tab character (`\t`, length 1) works correctly; only the 2-char string literal `\\t` would be rejected - Added test verifying multi-character delimiter strings are rejected by `translateSql` Address review findings (job 142) All 290 tests pass, build succeeds. Here's a summary: Changes: - Fix `translateSql` to use `(read_expr) AS _t_1` alias so qualified column refs like `_t_1.column` remain valid (only standalone `_t_1` not followed by `.` is replaced) - Reuse `translateSql()` in sqlp fallback path instead of duplicating the `_t_1` → `read_parquet()` regex logic inline - Add early `inputFile` existence check in `doParquetConversion` with clear "Input file not found" error message - Simplify the stats/schema freshness check by removing unnecessary nested try/catch (input file is now validated upfront) - Update all test assertions to match the new aliased read expression format

All 421 tests pass, 0 failures. The change is correct. Changes: - Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit Address review findings (job 606) All 72 tests pass (0 failures), including the 3 new tests for missing params. Changes: - Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1) - Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3) - Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5) Address review findings (job 607) All 418 tests pass, including the new one. Changes: - Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string Address review findings (job 609) All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests. Changes: - Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing - Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags - Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces") - Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior Address review findings (job 610) All 420 tests pass. Changes: - Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails - Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path Address review findings (job 611) All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change. Changes: - Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer Address review findings (job 613) All 420 tests pass, 0 failures. The changes are verified. Changes: - Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding) - Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4) - Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error Address review findings (job 615) No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass. Changes: - Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags - Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages - Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough") - Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory Address review findings (job 616) The change looks correct. The length check and truncation now both operate on codepoints consistently. Changes: - Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()` Address review findings (job 618) All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test. Changes: - Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1) - Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3) - Updated test assertions to match the new error message wording

* feat(mcp): add qsv_log core tool for agent-initiated reproducibility logging Enable agents to write structured entries (user_prompt, agent_reasoning, agent_action, result_summary, note) to the qsv audit log (qsvmcp.log) with u- prefixed UUIDs, distinct from automatic s-/e- audit entries. Automatic audit logging is skipped for qsv_log calls to avoid recursion. Messages are truncated at 4096 chars and logging failures never break the workflow. Server instructions updated to guide agents on when/how to log for third-party reproducibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review findings (job 619) All 421 tests pass, 0 failures. The change is correct. Changes: - Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit Address review findings (job 606) All 72 tests pass (0 failures), including the 3 new tests for missing params. Changes: - Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1) - Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3) - Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5) Address review findings (job 607) All 418 tests pass, including the new one. Changes: - Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string Address review findings (job 609) All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests. Changes: - Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing - Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags - Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces") - Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior Address review findings (job 610) All 420 tests pass. Changes: - Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails - Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path Address review findings (job 611) All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change. Changes: - Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer Address review findings (job 613) All 420 tests pass, 0 failures. The changes are verified. Changes: - Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding) - Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4) - Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error Address review findings (job 615) No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass. Changes: - Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags - Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages - Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough") - Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory Address review findings (job 616) The change looks correct. The length check and truncation now both operate on codepoints consistently. Changes: - Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()` Address review findings (job 618) All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test. Changes: - Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1) - Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3) - Updated test assertions to match the new error message wording * fix(mcp): address Copilot review findings for qsv_log - Move skipAuditLog from "Key Constants" to a behavior note in CLAUDE.md (it's a local variable, not a module-level constant) - Reorder enum and LOG_ENTRY_TYPES Set to match description order (reasoning before action) - Add unique temp dir + cleanup to coercion test to prevent log file accumulation in OS temp root Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ls (#3734) * feat(generators): detect required options from Usage: line Both help_markdown_gen.rs and mcp_skills_gen.rs now identify options shown outside [options]/[...] groups in the USAGE's `Usage:` section (e.g. `qsv implode [options] -k <keys> -v <value>`) and mark them accordingly. - `docs/help/*.md`: required options get ` **(required)**` appended to their description column in the options table. - `.claude/skills/qsv/*.json`: option entries gain `"required": true` when the flag is required. Optional options continue to emit nothing (the field is skipped when absent). A small wrinkle worth noting: qsv-docopt's Parser does not always emit Atom::Short entries paired with the Long atom for the `-k, --keys` declaration style, so we can't rely on its pairing to expand short↔long forms. Both generators do their own pairing pass by scanning the options sections for `-X, --xxx` declarations. Closes roborev review #1618 findings #3 and #4 at a project-wide (generator) level. Findings #1 (empty positional arg descriptions) and #2 (unfenced CSV examples) remain as separate work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: regenerate help markdown and MCP skills for required-option markers Commands with required options in their Usage: line now show them: - applydp: --new-column, --replacement, --formatstr - apply, describegpt, fetchpost, implode, joinp, luau, py, split: various required options previously unmarked All regenerated via `qsv --generate-help-md` and `qsv --update-mcp-skills`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators): only mark an option required when it's required in all Usage variants The previous heuristic produced seven false positives flagged by roborev review #1624. The detector now: 1. Computes a per-Usage-variant required set (tokens outside `[...]` AND outside any `(A | B)` alternative group), then takes the intersection across all non-`--help` Usage lines. An option must be required in every variant to be marked globally required. 2. Handles `(A | B | C)` alternative groups by masking them out entirely — inside alternatives no individual token is required. 3. Expands short→long aliases per Usage line, before intersection, so `-n`'s use as a positional-style flag on one Usage variant doesn't leak into `--no-headers` as globally required. Fixes false-positive required markers on: - split: `--size` / `--chunks` / `--kb-size` (alternative group) - joinp: `--cross` / `--non-equi` (separate Usage lines) - apply / applydp: `--new-column` / `--replacement` / `--formatstr` (subcommand-scoped, not global) - describegpt: `--prepare-context` / `--process-response` (separate Usage lines) - fetchpost: `--payload-tpl` (alternative inside `(A | B)`) - luau / py: `--no-headers` (the `-n <main-script>` Usage role only appears in one variant; intersection excludes it) - py: `--helper` (only required on one Usage variant) Implode's `--keys` / `--value` markers are preserved (genuinely required in the single Usage variant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(generators): share required-options detection in a common module Addresses roborev review #1625. Extracts the previously-duplicated required-option detection (and its helpers) out of both generators and into a new crate::generators_common module. Both mcp_skills_gen and help_markdown_gen now delegate to it, keeping their detection semantics in lockstep. Fixes additional review findings while consolidating: - Bidirectional short↔long expansion via a new FlagPairs type, so Usage lines that mention only the long form also surface the short form in the required set (and vice versa). - Bracket-depth is now u32, and uses u32::saturating_sub so an unbalanced `]` cannot underflow the counter (on i32 it would have saturated at i32::MIN, silently dropping later required tokens). - The pair regex now matches long-first (`--keys, -k`) declarations in addition to short-first (`-k, --keys`). - The pair regex scans only the options-declaration portion of the USAGE string (after the Usage: block), so a future quirk in the Usage: block can't introduce a bogus pair. Adds 12 unit tests covering: single-variant required expansion, alternative groups `(A|B|C)`, multi-variant intersection, subcommand- scoped options, short-role collision (luau/py), plain `(X)` grouping without a pipe, nested optional inside an alt group, long-first declarations, long-only Usage mentions, no Usage block, unbalanced brackets, and Usage-block scoping of the pair regex. Regenerator outputs (docs/help and .claude/skills) are unchanged by this refactor (confirmed with --generate-help-md / --update-mcp-skills producing no diffs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): narrow options-section scope, cache regexes, handle continuations Addresses roborev review #1626: 1. Options-section scope: replace the blank-line-after-Usage heuristic with a proper line-anchored, case-insensitive `options:` / `Options:` header regex. The previous logic landed on description paragraphs (most qsv USAGEs have a description between the Usage block and the options section), so the pair scan was broader than intended. 2. Fallback for minimal fixtures: if no options header is found, fall back to scanning the whole USAGE so short↔long pairs declared in non-standard layouts (or small test fixtures) still register. 3. Compiled regexes (short-first pair, long-first pair, flag scanner, options header) are now cached via `std::sync::OnceLock`, removing per-call recompilation overhead. 4. Usage-block collection now terminates only on a blank line and merges docopt continuation lines (indented lines within the block that do not begin with `qsv`) into their parent variant, preventing a wrapped Usage line from showing up as a standalone variant and silently narrowing the intersection. Adds three more unit tests covering: continuation-line joining (`continuation_line_does_not_truncate_usage_block`), scope narrowing (`pair_regex_scans_only_the_options_section_not_description`), and the whole-text fallback (`fallback_to_whole_text_when_no_options_section`). 15/15 unit tests pass; regenerator output is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): tighten options-header regex and indentation-aware continuation Addresses roborev review #1627: 1. options_section regex: the leading class is now `[ \t\w-]` instead of `[\s\w-]`, so it can't straddle a newline. Trailing `[ \t]*` matches only the same-line whitespace, keeping the "line-anchored" claim in the doc comment literally true. 2. collect_usage_lines: drop the dead `if trimmed.is_empty()` branch — the `take_while(!blank)` already filters blank lines out. 3. collect_usage_lines: continuation detection now prefers indentation depth (leading-whitespace count strictly greater than the parent variant's) with the `qsv`-prefix rule as a tiebreaker. A continuation line whose positional begins with `qsv` (e.g. `qsv-input`) would previously have been treated as a new variant. 4. collect_usage_lines: inline `Usage: qsv foo ...` header variants are now retained. Previously the `Usage:` line was unconditionally dropped, silently losing the only variant for that style of help text. Three additional unit tests lock this in: `Common options:` / `map options:` prefix-word headers both matched; a tab-indented `\toptions:` header; and an inline `Usage: qsv foo ...` variant. 18/18 unit tests pass; full `cargo test` passes; generator outputs unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): baseline-indent continuation detection + cleanups Addresses roborev review #1628: 1. Indentation comparison now uses a single *baseline indent* (the leading whitespace of the first non-blank line in the Usage block) instead of comparing each line against the previous variant's indent. This fixes the inline-vs-non-inline asymmetry: an inline `Usage: qsv foo --bar` header was stored trimmed (leading_ws=0), so a following standard- column variant like ` qsv foo --baz` (leading_ws=7) was wrongly merged as a continuation. Inline variants are now synthesized at the baseline indent for consistent comparison. 2. With the baseline-indent rule, indentation genuinely *outranks* any prefix test: continuation == leading_ws > baseline. No more confused comment claiming "tiebreaker" while the code actually OR'd. The `qsv`-prefix check is gone — indentation is the sole signal. 3. `if let Some(last) = variants.last_mut()` replaces the `match + later unwrap()` pattern, removing the SAFETY comment. Three new unit tests lock this in: - `indented_wrap_line_merges_into_parent_variant` - `continuation_starting_with_qsv_prefix_is_still_a_continuation` (a deeper-indented `qsv-foo` continuation must fold, regression for the indentation-outranks-prefix intent) - `inline_usage_plus_indented_second_variant_stays_separate` (regression for the storage asymmetry from finding #1) 21/21 unit tests pass; generator outputs unchanged; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(generators_common): derive baseline indent from min leading-ws Addresses roborev review #1629: 1. Baseline-indent is now the *minimum* leading whitespace across all non-blank lines in the Usage block (not the leading_ws of raw[0]). In well-formed docopt, continuation lines are always indented deeper than their variants, so min reliably picks the variant column — including when raw[0] happens to be a wrapped continuation of an inline `Usage:` variant (previously baseline would have drifted to the continuation's indent, causing real variants at the standard column to be misclassified). 2. The continuation branch now fails loudly (debug_assert) when hit with an empty variants vec rather than silently promoting the line, so a future refactor of the baseline derivation can't regress unnoticed. 3. Replaced `.map().next().unwrap_or(0)` with `.iter().map(...).min()` — both a cleaner expression and the fix for the baseline-drift bug. 4. Added `inline_usage_with_wrapped_continuation_and_second_variant` to lock in the exact corner case: inline `Usage: qsv foo --bar`, an indented wrap line, and a second variant at the standard column — baseline must land on the standard column, the wrap line must merge into variant 1, and the second variant must stay separate. 5. Doc-comment reworded to describe the actual invariant: "at or below the baseline indent start new variants; strictly deeper lines are merged." 22/22 unit tests pass; generator outputs unchanged; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(skills): add optional Option.required to TypeScript schema The Rust generator emits `"required": true` on required options in the MCP skill JSON (via the newly-added Option_.required field in mcp_skills_gen.rs). The TypeScript Option interface didn't declare it, so consumers using the typed view wouldn't see the field even though the JSON carries it. Adds `required?: boolean` with a doc comment matching the generator's semantics: emitted only when true, omitted for optional options. Addresses Copilot review on PR #3734. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fetch/fetchpost: review-driven cache, panic and safety fixes - fix disk-cache cache_remove using bare URL instead of the composite key built by the io_cached convert macro, so non-200 responses were never evicted when --cache-error was unset (fetch.rs Disk path; fetchpost.rs InMemory/Disk/Redis paths — all three were broken) - reorder the sel.len() != 1 check before sel.iter().next().unwrap() so an empty URL-column selection returns a friendly error instead of panicking - replace HeaderValue::to_str().unwrap() with .unwrap_or("") in retry-after and ratelimit-* parsing so non-ASCII header bytes can't crash the process - bump the post-error 10ns progress-bar drain sleep to 100ms (one 5Hz cycle) - use match on first byte of --report instead of two to_lowercase() allocs - convert expand_tilde(dir).unwrap() to a graceful fail_clierror! and add // safety: comments to the remaining infallible unwraps per qsv style - update test_fetch::fetch_simple_diskcache: it was asserting cache_hit=1 for error rows on the second pass, encoding the disk-cache bug — now asserts cache_hit=0, matching the corrected eviction behavior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fetch/fetchpost: address roborev 1680 Low findings - extract cross_session_cache_key() helper in both files so the disk/redis cache-key format string lives in one place; the convert macros and the cache_remove call sites now both go through it, preventing the kind of drift that introduced the original eviction bug - replace the misleading "we only need url in the cache key" comment above fetchpost::get_cached_response with an accurate NOTE that the in-memory cache is keyed on form_body_jsonmap only and flags the URL-collision edge case (pre-existing; widening deferred as a behavior change) - add ASCII-only-by-design comment on the --report first-byte match in both files so future readers don't assume Unicode case folding Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fetch/fetchpost: document JSON response normalization Address review finding #4 by adding a JSON RESPONSE HANDLING section to the USAGE block of both commands. Without --jaq, every successful response is parsed by serde_json and re-serialized, which silently normalizes object key order, integer/float representation (e.g. 1e2 -> 100), and whitespace. Users who need byte-exact server output now have a documented escape hatch (post-process or use --jaq). Help-md regenerated via `qsv --generate-help-md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fetch/fetchpost: correct JSON RESPONSE HANDLING note Address roborev 1682 findings: - Medium: replace the incorrect "JSON object key order is sorted" claim with "object key order is preserved" — qsv enables the preserve_order feature on serde_json (Cargo.toml:272), so round-tripping through serde_json::Value preserves insertion order - Low: expand "trailing whitespace is stripped" to "all insignificant whitespace is removed (compact) or re-indented (--pretty)" — the prior wording understated CompactFormatter's behavior - Low: add the two normalizations users actually hit on this path — duplicate JSON object keys are collapsed (last value wins), and responses that are not valid JSON are written as an empty cell (or the parse error if --store-error is set) Help-md regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…panic (#3758) * py: review-driven cleanup, hoist Python module setup, fix jagged-row panic, propagate filter errors Match the style of the recent dedup/sort/sortcheck/foreach review-cleanup PRs. Correctness: - Remove `assert!(record.len() == headers_len)` that panicked on jagged CSVs; short rows already yield "" via `unwrap_or_default`, long rows are ignored beyond `headers_len` via `.take()`. - Replace silent `let _ = batch_locals.set_item(...).map_err(...)` (which kept evaluating with stale locals after a counted error) with `?`. - Replace `result.extract().unwrap_or(false)` on the filter cast with `?` so internal extraction failures aren't silently dropped. Performance: - Build helpers/user_helpers PyModules, builtins/math/random/datetime imports, and the QSVRow instance ONCE up front. Hold them as `Py<...>` across `Python::attach` calls and `.bind(py)` per batch. Previously these were rebuilt every batch (default 50k rows) -- 20x wasted work on a 1M-row file. - Replace the per-row `Vec<&str>` argument buffer with an inline `PyList::new` from a `0..headers_len` iterator: one Python allocation, no intermediate Rust Vec, and the `&str` borrows on `record` are released before `push_field`/`write_record`. Cleanups: - Delete duplicate `let debug_flag` shadow. - Remove stale `#[allow(unused_assignments)]` on `batch_record`. - Apply `intern!` to `__builtins__`/`math`/`random`/`datetime`/`col` keys for consistency with the existing `intern!` on `qsv_uh`. - Unify the three "Cannot load ..." error messages to `"Cannot read Python {expression,helper} file '{path}': {e}"` so the user sees which file failed. Fix moved-value shadowing on `helper_file` so the path is available in the Err arm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * py: address roborev #1721, single-pass row loop, hoist header_vec before new column - Compute header_vec from the input headers BEFORE the map command appends its new output column to `headers`, so header_vec.len() == headers_len always. Drops the `.take(headers_len)` guard in the per-row loop, which was masking the off-by-one in map mode. - Combine the two per-row iterations into a single pass that both sets the per-column local in `batch_locals` and appends to a fresh PyList. Eliminates the duplicate `record.get(i).unwrap_or_default()` calls. Roborev review 1721 also flagged the stricter `?`-based error semantics on `set_item`/`extract` (Low #1) and the up-front module build for empty inputs (Low #4); both are intentional per the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: build QSVRow from input headers only, add jagged-row regression test Copilot pointed out that QSVRow was being initialized with `headers.iter()` *after* the map output column had been appended, so `col["<new-column>"]` would map to an out-of-range index and IndexError. Pass only the input headers (`headers.iter().take(headers_len)`) so the QSVRow mapping lines up with the actual row data. Also adds a regression test that locks in the csv-reader-level rejection of jagged rows (with `flexible: false`, the default in python.rs). If someone later flips the reader to `flexible(true)` without thinking through the per-row eval path, this test will fail and force them to revisit the unwrap_or_default behavior added in d12591d. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which duplicated the port and could drift) with a single SAMPLE_TEST_PORT + SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from the same source — no more brittle .split(':').next().unwrap() that would also panic on IPv6 hosts. - (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The server now stops in Drop, so a panic inside read_stdout / stdout doesn't leak the port and cascade into "Address already in use" on the next #[serial] test. - (#4) Call wrk.assert_success(&mut cmd) before reading stdout in the success-path tests, so a regression that makes qsv exit non-zero surfaces qsv's stderr instead of a generic CSV-parse error. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3775) * test/sample: add integration tests for streaming Bernoulli URL path Closes the test-coverage gap flagged in PR #3774. Stands up a local actix-web fixture (port 8082, distinct from test_fetch's 8081) and exercises the boundary detection and validation guards added there: - sample_bernoulli_url_quoted_newline_header: header field 0 contains a literal `\n` inside an RFC-4180 quote. Asserts the header arrives intact (3 fields, embedded newline preserved) and that every emitted data row also has 3 fields. Old code would have split on the raw byte and corrupted every following record. - sample_bernoulli_url_max_size_truncation: serves a ~1.2 MiB CSV with fixed 100-byte records so `--max-size 1` cuts deterministically inside record 10486. Asserts max id <= 10485 (no half-record at the cap) and that every emitted row is well-formed. - sample_bernoulli_url_404_fails_fast: hits an unmapped path on the fixture server. Asserts qsv exits with error instead of feeding the HTML 404 body into the csv parser (regression for the missing `error_for_status()`). - sample_bernoulli_url_custom_delimiter: serves a TSV and passes `--delimiter '\t'`. Reads raw stdout and splits on tab (the writer also honors --delimiter, so read_stdout's comma parser would collapse rows). Asserts header and data rows split into 3 fields. Tests use #[serial] so they don't race on the port. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * typo: mis-split->split incorrectly * test/sample: address review feedback on streaming Bernoulli tests - (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which duplicated the port and could drift) with a single SAMPLE_TEST_PORT + SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from the same source — no more brittle .split(':').next().unwrap() that would also panic on IPv6 hosts. - (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The server now stops in Drop, so a panic inside read_stdout / stdout doesn't leak the port and cascade into "Address already in use" on the next #[serial] test. - (#4) Call wrk.assert_success(&mut cmd) before reading stdout in the success-path tests, so a regression that makes qsv exit non-zero surfaces qsv's stderr instead of a generic CSV-parse error. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test/sample: address Copilot review on PR #3775 — single-run cmd, server start timeout - Replace the assert_success-then-read_stdout double-run pattern with a single capture-and-parse helper. The previous shape ran qsv twice per test, doubling fixture-server requests (and the ~1.2 MiB max-size download) and meaning the parsed stdout came from a different execution than the one whose status was asserted. - Added run_and_assert_success(): runs once, asserts status, returns Output (with stderr surfaced on failure). - Added parse_csv_stdout(): mirrors wrk.read_stdout's Vec<Vec<String>> shape but reads from a captured buffer. - All three success-path tests (quoted newline header, max-size truncation, custom delimiter) now use these helpers. - Switch the SampleWebServer startup channel to send Result<ServerHandle, String> and use recv_timeout(10s) instead of recv(). A failed bind (e.g., port already in use) used to leave start() blocked forever; it now panics fast with the bind error surfaced from the server thread. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) * feat(stats,frequency): auto-enable DataSketches estimators on OOM When --memcheck is set and util::mem_file_check returns OutOfMemory, stats and frequency now auto-enable their DataSketches-backed estimators (t-digest + HyperLogLog for stats; Misra-Gries Frequent Items for frequency) in addition to the existing auto-index fallback. Conflict guards mirror the explicit-validation rejections so the auto-enable only flips methods that would have passed validation if set by hand. A wwarn! lists the auto-enabled estimators; the original OOM is only propagated when neither fallback engages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(stats,frequency): address roborev #2028 findings - Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and test_frequency share one source of truth (Low #1). - Document the pre-indexed + OOM → sketch fallback path in --memcheck USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2). - Drop the dead flag_sketch_method='frequent_items' assignment before run_frequent_items — confirmed run_frequent_items does not consult flag_sketch_method (Low #3). - Tighten the stats and frequency OOM wwarn messages to "Re-run with explicit ... exact to disable the auto-enable" — matches the established frequency wording and removes the misleading "override" phrasing (Low #4). Verified Low #5 separately: which_stats() already gates mad on !approx_quantiles regardless of flag_everything/flag_mad, so the auto-disable promised by the wwarn is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address Copilot review: stream OOM fixture, drop drift-prone line refs, add -- prefix - tests/workdir.rs: rewrite build_large_oom_csv to stream rows directly to a csv::Writer instead of building a 10M-row Vec in memory first. Avoids ~1.5 GB of String allocation that would OOM the test harness itself on memory-constrained hosts, defeating the purpose of the ignored OOM tests. - src/cmd/stats.rs, src/cmd/frequency.rs: replace hard-coded intra-file line-number references in the try_enable_approx_sketches and can_enable_frequent_items doc-comments with descriptive references to the validator/dispatch blocks they mirror. - src/cmd/frequency.rs: add the missing -- prefix to "sketch-method frequent_items" in the --memcheck USAGE help text; regenerate docs/help/frequency.md and .claude/skills/qsv/qsv-frequency.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(stats,frequency): clarify OOM fallback fires in both NORMAL and CONSERVATIVE mode The OOM auto-fallback to DataSketches estimators is gated on the result of util::mem_file_check, NOT on whether --memcheck is set. The in-memory load check runs unconditionally on the non-parallel path; --memcheck only switches the check from NORMAL mode (vs. total memory) to the stricter CONSERVATIVE mode (vs. available + swap × platform factor). The fallback can therefore trigger without --memcheck too — just much less often, since NORMAL mode only trips when the file is larger than ~80% of total RAM. Rewrote --memcheck USAGE in stats.rs and frequency.rs to: - Lead with what --memcheck actually does (CONSERVATIVE vs. NORMAL). - Reference QSV_MEMORY_CHECK as the env-var equivalent. - Describe the OOM fallback as a behavior of the load check itself, not of --memcheck specifically. Updated CHANGELOG.md and docs/STATS_DEFINITIONS.md to match. Regenerated docs/help/{stats,frequency}.md and the corresponding MCP skill JSONs from the new USAGE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address Copilot review: real opt-out for OOM auto-enable + assert command success in tests Two issues raised by Copilot on the OOM auto-fallback: (1) The wwarn promised users could re-run with explicit --quantile-method exact / --cardinality-method exact / --sketch-method exact to disable the auto-enable, but the code only checked the parsed flag value. docopt fills in the default "exact" either way, so an explicit --foo-method exact was indistinguishable from omitting the flag — making the documented opt-out a no-op. Fix: scan argv for the literal flag names ("--foo-method" or "--foo-method=...") to detect explicit user intent. Thread that through try_enable_approx_sketches / can_enable_frequent_items via new user_set_* parameters; the auto-enable is suppressed when the user explicitly provided the flag (regardless of value). Documented in STATS_DEFINITIONS.md. (2) The new OOM tests used wrk.output_stderr, which returns stderr regardless of exit status — a command that errored out after printing the auto-enable wwarn would still pass the test. Fix: add a wrk.stderr_on_success helper that asserts status.success() before returning stderr (with the same diagnostic-rich panic format as assert_success). Migrate the 6 new stats/frequency OOM tests to use it. Other call sites of output_stderr left untouched — they test failure paths where non-success is intentional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(frequency): restore can_enable_frequent_items doc comment Address roborev #2032: in the previous commit (d9fe03e), `argv_has_flag` and its doc comment were inserted *between* `can_enable_frequent_items`'s doc comment and the function body. Because Rust doc comments attach to the next item, the entire docstring block (including the `user_set_sketch_method` paragraph and the trailing "Returns false if any conflicting flag is set..." line) bound to `argv_has_flag`, leaving `can_enable_frequent_items` with no doc comment at all. Move `argv_has_flag` (with its own 4-line doc) to live *after* `can_enable_frequent_items`, mirroring the layout in stats.rs where the ordering was correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a simple forward fill command.

9432729

jqnatividad merged commit 31a13ac into dathere:master Dec 27, 2020

jqnatividad added a commit that referenced this pull request Dec 27, 2025

docs: frequency add doc string per GH Copilot suggestion #4

839a018

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jqnatividad added a commit that referenced this pull request Dec 27, 2025

docs: accept GH Copilot review #4

e8cc5ff

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jqnatividad added a commit that referenced this pull request Jan 1, 2026

rename test to reflect new xsd_gdata_scan mode names #4

f7e4905

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a simple forward fill command.#4

Add a simple forward fill command.#4
jqnatividad merged 1 commit into
dathere:masterfrom
alexrudy:cmd-fill

jqnatividad commented Dec 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented Dec 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants