Skip to content

Add a simple forward fill command.#4

Merged
jqnatividad merged 1 commit into
dathere:masterfrom
alexrudy:cmd-fill
Dec 27, 2020
Merged

Add a simple forward fill command.#4
jqnatividad merged 1 commit into
dathere:masterfrom
alexrudy:cmd-fill

Conversation

@jqnatividad

Copy link
Copy Markdown
Collaborator

@jqnatividad jqnatividad merged commit 31a13ac into dathere:master Dec 27, 2020
jqnatividad added a commit that referenced this pull request Dec 27, 2025
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jqnatividad added a commit that referenced this pull request Dec 27, 2025
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jqnatividad added a commit that referenced this pull request Jan 1, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jqnatividad added a commit that referenced this pull request Feb 16, 2026
All 24 individual tests pass (all checkmarks). The "FAIL" is a pre-existing vitest configuration issue ("No test suite found") unrelated to my changes — all actual test cases pass.
Changes:
- In `translateSql`, only the first standalone `_t_1` occurrence gets the `AS _t_1` alias; subsequent standalone occurrences get the bare `readExpr` to avoid duplicate alias issues in UNION/subquery contexts
- Added test case for multiple standalone `_t_1` references verifying alias-on-first-only behavior

Address review findings (job 115)

All 274 tests pass. The changes compile cleanly and tests are green.
Changes:
- Include Parquet conversion warning in the error path (not just success), so users see the full chain of failures when both DuckDB and sqlp fail
- Log when Parquet warning cannot be prepended due to unexpected content structure

Address review findings (job 117)

All changes are complete. Build passes and all 274 tests pass.
Changes:
- Fix SQL injection via `compression`/codec parameter: whitelist valid codecs (`zstd`, `snappy`, `gzip`, `lz4`, `uncompressed`) and reject anything else
- Fix SQL injection via `delimiter` parameter: validate exactly 1 character and escape single quotes
- Fix race condition in `ensureParquet()`: add in-memory lock map to deduplicate concurrent conversions for the same file
- Add error checking after each qsv subprocess call in `ensureParquet()`: wrap each step in try/catch with descriptive error messages identifying which step failed
- Verify Parquet output file exists after conversion before returning success

Address review findings (job 118)

No tests reference the codec list. The change is minimal and the build succeeds.
Changes:
- Add `brotli` and `lzo` to `VALID_PARQUET_CODECS` set to match DuckDB's supported compression codecs

Address review findings (job 120)

All changes complete. Build succeeds and all 276 tests pass.
Changes:
- Remove unused `runQsvSimple` import from `duckdb.ts` (finding #1)
- Add trust assumption comment documenting that `sql` parameter comes from trusted MCP agent in `tryDuckDbExecution` (finding #2)
- Clean up partial `.parquet` file on conversion failure in `doParquetConversion` to prevent stale corrupted files (finding #4)
- Detect multi-table SQL queries (`_t_2`, `_t_3`, etc.) and fall back to sqlp preemptively instead of sending untranslated references to DuckDB (finding #5)
- Add tests for multi-table reference detection regex and `_t_2+` non-translation behavior

Address review findings (job 121)

Everything looks clean. Here's a summary:
Changes:
- Fix multi-table sqlp fallback: stop setting `SKIP_INPUT` for multi-table queries so sqlp receives the original input file and can resolve `_t_2+` references natively
- Extract `MULTI_TABLE_PATTERN` regex to a shared constant exported from `duckdb.ts`, eliminating duplication between `mcp-tools.ts` and `duckdb.test.ts`
- Clean up asymmetric brace style in the DuckDB/sqlp interception block into a clean if/else

Address review findings (job 122)

All 276 tests pass. The changes are minimal and focused:
Changes:
- Remove case-insensitive `/i` flag from `MULTI_TABLE_PATTERN` regex since sqlp table references (`_t_N`) are case-sensitive
- Add test cases verifying uppercase `_T_2` and `_T_10` do not match the multi-table pattern

Address review findings (job 124)

All changes applied, build and tests pass (276/276).
Changes:
- Add `.sz` compressed file variants to `isCsvLikeFile()` so Snappy-compressed CSV/TSV/TAB/SSV files are recognized and converted to Parquet (finding 1)
- Move `ensureParquet()` call after `MULTI_TABLE_PATTERN` check to avoid wasted Parquet conversion on multi-table queries that fall through to sqlp (finding 2)
- Restore case-insensitive flag on `MULTI_TABLE_PATTERN` regex for consistency with case-insensitive `_t_1` replacement in `translateSql` (finding 3)
- Update test assertions to expect uppercase `_T_2`/`_T_10` to match the now case-insensitive pattern

Address review findings (job 125)

All 280 tests pass. Here's the summary:
Changes:
- Reverted `MULTI_TABLE_PATTERN` regex to case-sensitive (removed `/i` flag) since sqlp table references (`_t_N`) are case-sensitive and uppercase `_T_2` would cause query errors
- Reverted duckdb tests to assert uppercase `_T_N` does NOT match the pattern
- Exported `isCsvLikeFile` function from `mcp-tools.ts` to enable testing
- Added unit tests for `isCsvLikeFile` covering standard extensions, `.sz` Snappy-compressed variants, case-insensitivity, and non-CSV rejection

Address review findings (job 127)

Changes:
- `getParquetPath` now uses `basename()` for extension matching, preventing false matches when directory paths contain CSV-like extensions (e.g., `/data/CSV_FILES/test.csv`)
- `MULTI_TABLE_PATTERN` made case-insensitive (`/i` flag) to be consistent with the `_t_1` replacement pattern, preventing confusing DuckDB errors when agents send uppercase `_T_2`
- `translateSql` now logs a warning for unknown file extensions (e.g., `.xlsx`) that are treated as CSV, helping diagnose cryptic DuckDB errors
- Secondary SIGKILL timer in `executeDuckDbQuery` is now properly cleared in both `close` and `error` handlers, preventing unnecessary SIGKILL on already-exited processes
- Updated `duckdb.test.ts` to match the case-insensitive `MULTI_TABLE_PATTERN` behavior

Address review findings (job 128)

All changes are complete. TypeScript compiles cleanly and all 286 tests pass.
Changes:
- Fix `MULTI_TABLE_PATTERN` flip-flop: make regex case-sensitive again, add `normalizeTableRefs()` to lowercase `_T_N` → `_t_N` before routing, so sqlp receives consistent lowercase refs
- Extract `CSV_LIKE_EXTENSIONS` shared constant in `duckdb.ts`, reuse in both `isCsvLikeFile()` and `translateSql()` to eliminate duplicated extension lists
- Export `getParquetPath()` and add regression tests for directory-path false-match bug (e.g., `/data/csv_files/test.json`)
- Add tests for `normalizeTableRefs()` and its interaction with `MULTI_TABLE_PATTERN`

Address review findings (job 129)

All 286 tests pass. The diagnostics about `qsv_bin` are pre-existing Rust issues unrelated to these changes.
Changes:
- Added comment explaining `.txt`/`.dat` are warning-only suppressions, not in `CSV_LIKE_EXTENSIONS` (no Parquet conversion)
- Simplified `params.sql` assignment by removing conditional check — always assign normalized SQL

Address review findings (job 131)

All 286 tests pass. Here's the summary:
Changes:
- Remove `params.output` fallback in `tryDuckDbExecution` to use only the pre-validated `outputFile` parameter, closing a potential path validation bypass
- Add clarifying comment to `getParquetPath` explaining why the mixed-case logic (lowercase matching + original-path slicing) is correct
- Add comment to `translateSql` documenting that `_t_1` replacement also applies inside SQL string literals (acceptable limitation for MCP agent usage)

Address review findings (job 133)

All 287 tests pass, 0 failures.
Changes:
- Consolidated `ChildProcess` type import with `spawn` into a single `child_process` import in `duckdb.ts`
- Extracted inline `maxStderrSize` magic number to named constant `MAX_STDERR_SIZE` alongside `DUCKDB_VALIDATION_TIMEOUT_MS`
- Exported `ensureParquet` and added tests for its early-return paths (non-CSV file passthrough)

Address review findings (job 135)

All 287 tests pass.
Changes:
- Made `MULTI_TABLE_PATTERN` case-insensitive (`/i` flag) for defense-in-depth, so uppercase `_T_N` references are caught even without `normalizeTableRefs` being called first
- Fixed TOCTOU race in `ensureParquet` by checking the lock map before any async `stat()` calls
- Added zero-byte file validation after Parquet conversion to catch corrupted/empty output files
- Updated test assertions to match the case-insensitive `MULTI_TABLE_PATTERN` behavior

Address review findings (job 136)

All 287 tests pass. The diagnostics about `qsv_bin` and `lesson_*.rs` are pre-existing issues unrelated to this change.
Changes:
- Restructured zero-byte file validation in `doParquetConversion` to use separate try/catch for `stat()` and a plain `if` for the size check, eliminating brittle string-matching error re-throw logic

Address review findings (job 138)

All changes are complete. Build passes and all 289 tests pass.
Changes:
- Fix `translateSql` to skip `_t_1` replacements inside single-quoted SQL string literals using a quote-aware regex
- Fix `normalizeTableRefs` regex to handle both `_T_N` and `_t_N` variants (was only matching uppercase `_T_`)
- Prefix delimiter validation error with `[DuckDB]` for clearer error attribution
- Document `getParquetPath` behavior for non-CSV files (appends `.parquet` suffix)
- Add tests for string literal protection (`SELECT '_t_1'` preserved) and escaped quotes
- Add test for `normalizeTableRefs` idempotency on already-lowercase refs

Address review findings (job 140)

All 290 tests pass. The changes are minimal and focused on the review findings.
Changes:
- Added comment documenting that multi-table queries (`_t_2+`) don't benefit from Parquet auto-conversion, directing users to manually convert with `qsv_to_parquet`
- Added comment clarifying delimiter validation: actual tab character (`\t`, length 1) works correctly; only the 2-char string literal `\\t` would be rejected
- Added test verifying multi-character delimiter strings are rejected by `translateSql`

Address review findings (job 142)

All 290 tests pass, build succeeds. Here's a summary:
Changes:
- Fix `translateSql` to use `(read_expr) AS _t_1` alias so qualified column refs like `_t_1.column` remain valid (only standalone `_t_1` not followed by `.` is replaced)
- Reuse `translateSql()` in sqlp fallback path instead of duplicating the `_t_1` → `read_parquet()` regex logic inline
- Add early `inputFile` existence check in `doParquetConversion` with clear "Input file not found" error message
- Simplify the stats/schema freshness check by removing unnecessary nested try/catch (input file is now validated upfront)
- Update all test assertions to match the new aliased read expression format
jqnatividad added a commit that referenced this pull request Mar 2, 2026
All 421 tests pass, 0 failures. The change is correct.
Changes:
- Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit

Address review findings (job 606)

All 72 tests pass (0 failures), including the 3 new tests for missing params.
Changes:
- Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1)
- Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3)
- Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5)

Address review findings (job 607)

All 418 tests pass, including the new one.
Changes:
- Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string

Address review findings (job 609)

All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests.
Changes:
- Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing
- Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags
- Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces")
- Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior

Address review findings (job 610)

All 420 tests pass.
Changes:
- Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails
- Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path

Address review findings (job 611)

All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change.
Changes:
- Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer

Address review findings (job 613)

All 420 tests pass, 0 failures. The changes are verified.
Changes:
- Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding)
- Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4)
- Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error

Address review findings (job 615)

No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass.
Changes:
- Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags
- Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages
- Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough")
- Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory

Address review findings (job 616)

The change looks correct. The length check and truncation now both operate on codepoints consistently.
Changes:
- Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()`

Address review findings (job 618)

All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test.
Changes:
- Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1)
- Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3)
- Updated test assertions to match the new error message wording
jqnatividad added a commit that referenced this pull request Mar 2, 2026
* feat(mcp): add qsv_log core tool for agent-initiated reproducibility logging

Enable agents to write structured entries (user_prompt, agent_reasoning,
agent_action, result_summary, note) to the qsv audit log (qsvmcp.log)
with u- prefixed UUIDs, distinct from automatic s-/e- audit entries.
Automatic audit logging is skipped for qsv_log calls to avoid recursion.
Messages are truncated at 4096 chars and logging failures never break
the workflow. Server instructions updated to guide agents on when/how
to log for third-party reproducibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address review findings (job 619)

All 421 tests pass, 0 failures. The change is correct.
Changes:
- Fix Unicode truncation fast-path to use UTF-16 length as a cheap guard (strings shorter in UTF-16 are guaranteed shorter in codepoints), only performing expensive `Array.from()` codepoint conversion when the string exceeds the limit

Address review findings (job 606)

All 72 tests pass (0 failures), including the 3 new tests for missing params.
Changes:
- Check for `null`/`undefined` params explicitly before string coercion in `handleLogCall`, returning clear "is required" error messages (finding #1)
- Trim and strip newlines from log messages before writing, preventing multi-line log entries and inconsistent whitespace (findings #2, #3)
- Added tests for missing `entry_type`, missing `message`, and entirely empty params (finding #5)

Address review findings (job 607)

All 418 tests pass, including the new one.
Changes:
- Add test for newline-only message (`'\n\n'`) confirming it's rejected as non-empty string

Address review findings (job 609)

All 74 tests pass (0 failures), including all the new and existing `handleLogCall` tests.
Changes:
- Log `catch` block now writes error details to stderr via `console.error` instead of silently swallowing
- Added `--` separator before the message argument in `qsv log` CLI call to prevent messages starting with `-` from being misinterpreted as flags
- Documented newline collapsing behavior in the tool description ("Newlines are collapsed to spaces")
- Added test for non-string type coercion (`{ entry_type: 123, message: true }`) confirming `String()` coercion behavior

Address review findings (job 610)

All 420 tests pass.
Changes:
- Include truncated error message in the success result returned to the agent (not just stderr), so the agent has actionable context when `qsv_log` write fails
- Add test for non-string message coercion with valid `entry_type` to verify `String()` coercion works for the message path

Address review findings (job 611)

All 420 tests pass. The Rust diagnostics are pre-existing and unrelated to this change.
Changes:
- Added `assert.ok(!result.isError)` to the `handleLogCall` non-string message coercion test to explicitly verify the result is not an error, making the test intent clearer

Address review findings (job 613)

All 420 tests pass, 0 failures. The changes are verified.
Changes:
- Added comment on `--` separator in `handleLogCall` args explaining it guards against messages starting with `-` being parsed as flags (addresses medium finding)
- Added `config.qsvValidation.valid` skip guard to `handleLogCall coerces non-string message` test so it properly tests the success path instead of passing accidentally via error swallowing (addresses low finding #4)
- Added assertion that success response doesn't contain "warning" to confirm actual success vs swallowed error

Address review findings (job 615)

No CLAUDE.md changes needed for the `--` removal. All changes are complete and tests pass.
Changes:
- Remove unnecessary `--` end-of-options sentinel from `qsv log` args — `qsv log` uses docopt variadic `[<message>...]` which handles this correctly, and messages always start with `[entry_type]` so they can never be misinterpreted as flags
- Fix Unicode-safe truncation using `Array.from()` instead of `String.slice()` to avoid splitting surrogate pairs in non-ASCII messages
- Add throttling guidance to server instructions ("Avoid excessive logging — for simple interactions, a single user_prompt + result_summary pair is enough")
- Add test for the `handleLogCall` error-swallowing catch path using a non-existent working directory

Address review findings (job 616)

The change looks correct. The length check and truncation now both operate on codepoints consistently.
Changes:
- Fix Unicode truncation length mismatch: use codepoint count (`Array.from(sanitized).length`) for both the gate condition and the truncation, avoiding inconsistency between UTF-16 `.length` and codepoint-aware `Array.from().slice()`

Address review findings (job 618)

All 421 tests pass, 0 failures. All `handleLogCall` tests pass including the updated write-failure test.
Changes:
- Reworded catch-path message from misleading `"Logged ... (warning: write failed: ...)"` to clearer `"Log write failed (non-fatal): ... Workflow continues."` (issue 1)
- Added fast-path optimization for Unicode truncation: only call `Array.from()` when `sanitized.length > MAX_LOG_MESSAGE_LEN`, avoiding unnecessary codepoint conversion on short messages (issue 3)
- Updated test assertions to match the new error message wording

* fix(mcp): address Copilot review findings for qsv_log

- Move skipAuditLog from "Key Constants" to a behavior note in CLAUDE.md
  (it's a local variable, not a module-level constant)
- Reorder enum and LOG_ENTRY_TYPES Set to match description order
  (reasoning before action)
- Add unique temp dir + cleanup to coercion test to prevent log
  file accumulation in OS temp root

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 23, 2026
…ls (#3734)

* feat(generators): detect required options from Usage: line

Both help_markdown_gen.rs and mcp_skills_gen.rs now identify options
shown outside [options]/[...] groups in the USAGE's `Usage:` section
(e.g. `qsv implode [options] -k <keys> -v <value>`) and mark them
accordingly.

- `docs/help/*.md`: required options get ` **(required)**` appended to
  their description column in the options table.
- `.claude/skills/qsv/*.json`: option entries gain `"required": true`
  when the flag is required. Optional options continue to emit nothing
  (the field is skipped when absent).

A small wrinkle worth noting: qsv-docopt's Parser does not always emit
Atom::Short entries paired with the Long atom for the `-k, --keys`
declaration style, so we can't rely on its pairing to expand short↔long
forms. Both generators do their own pairing pass by scanning the
options sections for `-X, --xxx` declarations.

Closes roborev review #1618 findings #3 and #4 at a project-wide
(generator) level. Findings #1 (empty positional arg descriptions) and
#2 (unfenced CSV examples) remain as separate work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regenerate help markdown and MCP skills for required-option markers

Commands with required options in their Usage: line now show them:
- applydp: --new-column, --replacement, --formatstr
- apply, describegpt, fetchpost, implode, joinp, luau, py, split:
  various required options previously unmarked

All regenerated via `qsv --generate-help-md` and `qsv --update-mcp-skills`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators): only mark an option required when it's required in all Usage variants

The previous heuristic produced seven false positives flagged by roborev
review #1624. The detector now:

1. Computes a per-Usage-variant required set (tokens outside `[...]` AND
   outside any `(A | B)` alternative group), then takes the intersection
   across all non-`--help` Usage lines. An option must be required in
   every variant to be marked globally required.
2. Handles `(A | B | C)` alternative groups by masking them out entirely —
   inside alternatives no individual token is required.
3. Expands short→long aliases per Usage line, before intersection, so
   `-n`'s use as a positional-style flag on one Usage variant doesn't
   leak into `--no-headers` as globally required.

Fixes false-positive required markers on:
- split: `--size` / `--chunks` / `--kb-size` (alternative group)
- joinp: `--cross` / `--non-equi` (separate Usage lines)
- apply / applydp: `--new-column` / `--replacement` / `--formatstr`
  (subcommand-scoped, not global)
- describegpt: `--prepare-context` / `--process-response`
  (separate Usage lines)
- fetchpost: `--payload-tpl` (alternative inside `(A | B)`)
- luau / py: `--no-headers` (the `-n <main-script>` Usage role only
  appears in one variant; intersection excludes it)
- py: `--helper` (only required on one Usage variant)

Implode's `--keys` / `--value` markers are preserved (genuinely required
in the single Usage variant).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(generators): share required-options detection in a common module

Addresses roborev review #1625. Extracts the previously-duplicated
required-option detection (and its helpers) out of both generators and
into a new crate::generators_common module. Both mcp_skills_gen and
help_markdown_gen now delegate to it, keeping their detection semantics
in lockstep.

Fixes additional review findings while consolidating:

- Bidirectional short↔long expansion via a new FlagPairs type, so Usage
  lines that mention only the long form also surface the short form in
  the required set (and vice versa).
- Bracket-depth is now u32, and uses u32::saturating_sub so an
  unbalanced `]` cannot underflow the counter (on i32 it would have
  saturated at i32::MIN, silently dropping later required tokens).
- The pair regex now matches long-first (`--keys, -k`) declarations in
  addition to short-first (`-k, --keys`).
- The pair regex scans only the options-declaration portion of the
  USAGE string (after the Usage: block), so a future quirk in the
  Usage: block can't introduce a bogus pair.

Adds 12 unit tests covering: single-variant required expansion,
alternative groups `(A|B|C)`, multi-variant intersection, subcommand-
scoped options, short-role collision (luau/py), plain `(X)` grouping
without a pipe, nested optional inside an alt group, long-first
declarations, long-only Usage mentions, no Usage block, unbalanced
brackets, and Usage-block scoping of the pair regex.

Regenerator outputs (docs/help and .claude/skills) are unchanged by
this refactor (confirmed with --generate-help-md / --update-mcp-skills
producing no diffs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators_common): narrow options-section scope, cache regexes, handle continuations

Addresses roborev review #1626:

1. Options-section scope: replace the blank-line-after-Usage heuristic
   with a proper line-anchored, case-insensitive `options:` / `Options:`
   header regex. The previous logic landed on description paragraphs
   (most qsv USAGEs have a description between the Usage block and the
   options section), so the pair scan was broader than intended.

2. Fallback for minimal fixtures: if no options header is found, fall
   back to scanning the whole USAGE so short↔long pairs declared in
   non-standard layouts (or small test fixtures) still register.

3. Compiled regexes (short-first pair, long-first pair, flag scanner,
   options header) are now cached via `std::sync::OnceLock`, removing
   per-call recompilation overhead.

4. Usage-block collection now terminates only on a blank line and merges
   docopt continuation lines (indented lines within the block that do
   not begin with `qsv`) into their parent variant, preventing a wrapped
   Usage line from showing up as a standalone variant and silently
   narrowing the intersection.

Adds three more unit tests covering: continuation-line joining
(`continuation_line_does_not_truncate_usage_block`), scope narrowing
(`pair_regex_scans_only_the_options_section_not_description`), and the
whole-text fallback (`fallback_to_whole_text_when_no_options_section`).
15/15 unit tests pass; regenerator output is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators_common): tighten options-header regex and indentation-aware continuation

Addresses roborev review #1627:

1. options_section regex: the leading class is now `[ \t\w-]` instead of
   `[\s\w-]`, so it can't straddle a newline. Trailing `[ \t]*` matches
   only the same-line whitespace, keeping the "line-anchored" claim in
   the doc comment literally true.

2. collect_usage_lines: drop the dead `if trimmed.is_empty()` branch —
   the `take_while(!blank)` already filters blank lines out.

3. collect_usage_lines: continuation detection now prefers indentation
   depth (leading-whitespace count strictly greater than the parent
   variant's) with the `qsv`-prefix rule as a tiebreaker. A continuation
   line whose positional begins with `qsv` (e.g. `qsv-input`) would
   previously have been treated as a new variant.

4. collect_usage_lines: inline `Usage: qsv foo ...` header variants are
   now retained. Previously the `Usage:` line was unconditionally
   dropped, silently losing the only variant for that style of help
   text.

Three additional unit tests lock this in: `Common options:` /
`map options:` prefix-word headers both matched; a tab-indented
`\toptions:` header; and an inline `Usage: qsv foo ...` variant. 18/18
unit tests pass; full `cargo test` passes; generator outputs unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators_common): baseline-indent continuation detection + cleanups

Addresses roborev review #1628:

1. Indentation comparison now uses a single *baseline indent* (the leading
   whitespace of the first non-blank line in the Usage block) instead of
   comparing each line against the previous variant's indent. This fixes
   the inline-vs-non-inline asymmetry: an inline `Usage: qsv foo --bar`
   header was stored trimmed (leading_ws=0), so a following standard-
   column variant like `       qsv foo --baz` (leading_ws=7) was wrongly
   merged as a continuation. Inline variants are now synthesized at the
   baseline indent for consistent comparison.

2. With the baseline-indent rule, indentation genuinely *outranks* any
   prefix test: continuation == leading_ws > baseline. No more confused
   comment claiming "tiebreaker" while the code actually OR'd. The
   `qsv`-prefix check is gone — indentation is the sole signal.

3. `if let Some(last) = variants.last_mut()` replaces the `match + later
   unwrap()` pattern, removing the SAFETY comment.

Three new unit tests lock this in:
- `indented_wrap_line_merges_into_parent_variant`
- `continuation_starting_with_qsv_prefix_is_still_a_continuation`
  (a deeper-indented `qsv-foo` continuation must fold, regression for
  the indentation-outranks-prefix intent)
- `inline_usage_plus_indented_second_variant_stays_separate`
  (regression for the storage asymmetry from finding #1)

21/21 unit tests pass; generator outputs unchanged; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators_common): derive baseline indent from min leading-ws

Addresses roborev review #1629:

1. Baseline-indent is now the *minimum* leading whitespace across all
   non-blank lines in the Usage block (not the leading_ws of raw[0]).
   In well-formed docopt, continuation lines are always indented deeper
   than their variants, so min reliably picks the variant column —
   including when raw[0] happens to be a wrapped continuation of an
   inline `Usage:` variant (previously baseline would have drifted to
   the continuation's indent, causing real variants at the standard
   column to be misclassified).

2. The continuation branch now fails loudly (debug_assert) when hit
   with an empty variants vec rather than silently promoting the line,
   so a future refactor of the baseline derivation can't regress
   unnoticed.

3. Replaced `.map().next().unwrap_or(0)` with `.iter().map(...).min()`
   — both a cleaner expression and the fix for the baseline-drift bug.

4. Added `inline_usage_with_wrapped_continuation_and_second_variant` to
   lock in the exact corner case: inline `Usage: qsv foo --bar`, an
   indented wrap line, and a second variant at the standard column —
   baseline must land on the standard column, the wrap line must merge
   into variant 1, and the second variant must stay separate.

5. Doc-comment reworded to describe the actual invariant: "at or below
   the baseline indent start new variants; strictly deeper lines are
   merged."

22/22 unit tests pass; generator outputs unchanged; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(skills): add optional Option.required to TypeScript schema

The Rust generator emits `"required": true` on required options in the
MCP skill JSON (via the newly-added Option_.required field in
mcp_skills_gen.rs). The TypeScript Option interface didn't declare it,
so consumers using the typed view wouldn't see the field even though
the JSON carries it.

Adds `required?: boolean` with a doc comment matching the generator's
semantics: emitted only when true, omitted for optional options.

Addresses Copilot review on PR #3734.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 26, 2026
* fetch/fetchpost: review-driven cache, panic and safety fixes

- fix disk-cache cache_remove using bare URL instead of the composite key
  built by the io_cached convert macro, so non-200 responses were never
  evicted when --cache-error was unset (fetch.rs Disk path; fetchpost.rs
  InMemory/Disk/Redis paths — all three were broken)
- reorder the sel.len() != 1 check before sel.iter().next().unwrap() so an
  empty URL-column selection returns a friendly error instead of panicking
- replace HeaderValue::to_str().unwrap() with .unwrap_or("") in retry-after
  and ratelimit-* parsing so non-ASCII header bytes can't crash the process
- bump the post-error 10ns progress-bar drain sleep to 100ms (one 5Hz cycle)
- use match on first byte of --report instead of two to_lowercase() allocs
- convert expand_tilde(dir).unwrap() to a graceful fail_clierror! and add
  // safety: comments to the remaining infallible unwraps per qsv style
- update test_fetch::fetch_simple_diskcache: it was asserting cache_hit=1
  for error rows on the second pass, encoding the disk-cache bug — now
  asserts cache_hit=0, matching the corrected eviction behavior

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fetch/fetchpost: address roborev 1680 Low findings

- extract cross_session_cache_key() helper in both files so the disk/redis
  cache-key format string lives in one place; the convert macros and the
  cache_remove call sites now both go through it, preventing the kind of
  drift that introduced the original eviction bug
- replace the misleading "we only need url in the cache key" comment above
  fetchpost::get_cached_response with an accurate NOTE that the in-memory
  cache is keyed on form_body_jsonmap only and flags the URL-collision
  edge case (pre-existing; widening deferred as a behavior change)
- add ASCII-only-by-design comment on the --report first-byte match in
  both files so future readers don't assume Unicode case folding

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fetch/fetchpost: document JSON response normalization

Address review finding #4 by adding a JSON RESPONSE HANDLING section to the
USAGE block of both commands. Without --jaq, every successful response is
parsed by serde_json and re-serialized, which silently normalizes object key
order, integer/float representation (e.g. 1e2 -> 100), and whitespace. Users
who need byte-exact server output now have a documented escape hatch
(post-process or use --jaq).

Help-md regenerated via `qsv --generate-help-md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fetch/fetchpost: correct JSON RESPONSE HANDLING note

Address roborev 1682 findings:

- Medium: replace the incorrect "JSON object key order is sorted" claim
  with "object key order is preserved" — qsv enables the preserve_order
  feature on serde_json (Cargo.toml:272), so round-tripping through
  serde_json::Value preserves insertion order
- Low: expand "trailing whitespace is stripped" to "all insignificant
  whitespace is removed (compact) or re-indented (--pretty)" — the prior
  wording understated CompactFormatter's behavior
- Low: add the two normalizations users actually hit on this path —
  duplicate JSON object keys are collapsed (last value wins), and
  responses that are not valid JSON are written as an empty cell (or
  the parse error if --store-error is set)

Help-md regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 27, 2026
…panic (#3758)

* py: review-driven cleanup, hoist Python module setup, fix jagged-row panic, propagate filter errors

Match the style of the recent dedup/sort/sortcheck/foreach review-cleanup PRs.

Correctness:
- Remove `assert!(record.len() == headers_len)` that panicked on jagged
  CSVs; short rows already yield "" via `unwrap_or_default`, long rows
  are ignored beyond `headers_len` via `.take()`.
- Replace silent `let _ = batch_locals.set_item(...).map_err(...)` (which
  kept evaluating with stale locals after a counted error) with `?`.
- Replace `result.extract().unwrap_or(false)` on the filter cast with `?`
  so internal extraction failures aren't silently dropped.

Performance:
- Build helpers/user_helpers PyModules, builtins/math/random/datetime
  imports, and the QSVRow instance ONCE up front. Hold them as `Py<...>`
  across `Python::attach` calls and `.bind(py)` per batch. Previously
  these were rebuilt every batch (default 50k rows) -- 20x wasted work
  on a 1M-row file.
- Replace the per-row `Vec<&str>` argument buffer with an inline
  `PyList::new` from a `0..headers_len` iterator: one Python allocation,
  no intermediate Rust Vec, and the `&str` borrows on `record` are
  released before `push_field`/`write_record`.

Cleanups:
- Delete duplicate `let debug_flag` shadow.
- Remove stale `#[allow(unused_assignments)]` on `batch_record`.
- Apply `intern!` to `__builtins__`/`math`/`random`/`datetime`/`col`
  keys for consistency with the existing `intern!` on `qsv_uh`.
- Unify the three "Cannot load ..." error messages to
  `"Cannot read Python {expression,helper} file '{path}': {e}"` so the
  user sees which file failed. Fix moved-value shadowing on
  `helper_file` so the path is available in the Err arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py: address roborev #1721, single-pass row loop, hoist header_vec before new column

- Compute header_vec from the input headers BEFORE the map command appends
  its new output column to `headers`, so header_vec.len() == headers_len
  always. Drops the `.take(headers_len)` guard in the per-row loop, which
  was masking the off-by-one in map mode.
- Combine the two per-row iterations into a single pass that both sets
  the per-column local in `batch_locals` and appends to a fresh PyList.
  Eliminates the duplicate `record.get(i).unwrap_or_default()` calls.

Roborev review 1721 also flagged the stricter `?`-based error semantics
on `set_item`/`extract` (Low #1) and the up-front module build for empty
inputs (Low #4); both are intentional per the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address review: build QSVRow from input headers only, add jagged-row regression test

Copilot pointed out that QSVRow was being initialized with `headers.iter()`
*after* the map output column had been appended, so `col["<new-column>"]`
would map to an out-of-range index and IndexError. Pass only the input
headers (`headers.iter().take(headers_len)`) so the QSVRow mapping lines
up with the actual row data.

Also adds a regression test that locks in the csv-reader-level rejection
of jagged rows (with `flexible: false`, the default in python.rs). If
someone later flips the reader to `flexible(true)` without thinking
through the per-row eval path, this test will fail and force them to
revisit the unwrap_or_default behavior added in d12591d.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 28, 2026
- (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which
  duplicated the port and could drift) with a single SAMPLE_TEST_PORT +
  SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from
  the same source — no more brittle .split(':').next().unwrap() that
  would also panic on IPv6 hosts.
- (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The
  server now stops in Drop, so a panic inside read_stdout / stdout
  doesn't leak the port and cascade into "Address already in use" on
  the next #[serial] test.
- (#4) Call wrk.assert_success(&mut cmd) before reading stdout in
  the success-path tests, so a regression that makes qsv exit non-zero
  surfaces qsv's stderr instead of a generic CSV-parse error.

77/77 sample tests pass; clippy --bin qsv clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request Apr 28, 2026
…3775)

* test/sample: add integration tests for streaming Bernoulli URL path

Closes the test-coverage gap flagged in PR #3774. Stands up a local
actix-web fixture (port 8082, distinct from test_fetch's 8081) and
exercises the boundary detection and validation guards added there:

- sample_bernoulli_url_quoted_newline_header: header field 0 contains a
  literal `\n` inside an RFC-4180 quote. Asserts the header arrives
  intact (3 fields, embedded newline preserved) and that every emitted
  data row also has 3 fields. Old code would have split on the raw
  byte and corrupted every following record.
- sample_bernoulli_url_max_size_truncation: serves a ~1.2 MiB CSV with
  fixed 100-byte records so `--max-size 1` cuts deterministically
  inside record 10486. Asserts max id <= 10485 (no half-record at the
  cap) and that every emitted row is well-formed.
- sample_bernoulli_url_404_fails_fast: hits an unmapped path on the
  fixture server. Asserts qsv exits with error instead of feeding the
  HTML 404 body into the csv parser (regression for the missing
  `error_for_status()`).
- sample_bernoulli_url_custom_delimiter: serves a TSV and passes
  `--delimiter '\t'`. Reads raw stdout and splits on tab (the writer
  also honors --delimiter, so read_stdout's comma parser would
  collapse rows). Asserts header and data rows split into 3 fields.

Tests use #[serial] so they don't race on the port. 77/77 sample tests
pass; clippy --bin qsv clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* typo: mis-split->split incorrectly

* test/sample: address review feedback on streaming Bernoulli tests

- (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which
  duplicated the port and could drift) with a single SAMPLE_TEST_PORT +
  SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from
  the same source — no more brittle .split(':').next().unwrap() that
  would also panic on IPv6 hosts.
- (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The
  server now stops in Drop, so a panic inside read_stdout / stdout
  doesn't leak the port and cascade into "Address already in use" on
  the next #[serial] test.
- (#4) Call wrk.assert_success(&mut cmd) before reading stdout in
  the success-path tests, so a regression that makes qsv exit non-zero
  surfaces qsv's stderr instead of a generic CSV-parse error.

77/77 sample tests pass; clippy --bin qsv clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test/sample: address Copilot review on PR #3775 — single-run cmd, server start timeout

- Replace the assert_success-then-read_stdout double-run pattern with a
  single capture-and-parse helper. The previous shape ran qsv twice per
  test, doubling fixture-server requests (and the ~1.2 MiB max-size
  download) and meaning the parsed stdout came from a different
  execution than the one whose status was asserted.
  - Added run_and_assert_success(): runs once, asserts status, returns
    Output (with stderr surfaced on failure).
  - Added parse_csv_stdout(): mirrors wrk.read_stdout's Vec<Vec<String>>
    shape but reads from a captured buffer.
  - All three success-path tests (quoted newline header, max-size
    truncation, custom delimiter) now use these helpers.
- Switch the SampleWebServer startup channel to send
  Result<ServerHandle, String> and use recv_timeout(10s) instead of
  recv(). A failed bind (e.g., port already in use) used to leave
  start() blocked forever; it now panics fast with the bind error
  surfaced from the server thread.

77/77 sample tests pass; clippy --bin qsv clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 10, 2026
- Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and
  test_frequency share one source of truth (Low #1).
- Document the pre-indexed + OOM → sketch fallback path in --memcheck
  USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2).
- Drop the dead flag_sketch_method='frequent_items' assignment before
  run_frequent_items — confirmed run_frequent_items does not consult
  flag_sketch_method (Low #3).
- Tighten the stats and frequency OOM wwarn messages to "Re-run with
  explicit ... exact to disable the auto-enable" — matches the
  established frequency wording and removes the misleading "override"
  phrasing (Low #4).

Verified Low #5 separately: which_stats() already gates mad on
!approx_quantiles regardless of flag_everything/flag_mad, so the
auto-disable promised by the wwarn is honored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 11, 2026
)

* feat(stats,frequency): auto-enable DataSketches estimators on OOM

When --memcheck is set and util::mem_file_check returns OutOfMemory,
stats and frequency now auto-enable their DataSketches-backed estimators
(t-digest + HyperLogLog for stats; Misra-Gries Frequent Items for
frequency) in addition to the existing auto-index fallback. Conflict
guards mirror the explicit-validation rejections so the auto-enable only
flips methods that would have passed validation if set by hand. A wwarn!
lists the auto-enabled estimators; the original OOM is only propagated
when neither fallback engages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(stats,frequency): address roborev #2028 findings

- Dedupe build_large_oom_csv into tests/workdir.rs so test_stats and
  test_frequency share one source of truth (Low #1).
- Document the pre-indexed + OOM → sketch fallback path in --memcheck
  USAGE text, CHANGELOG, and docs/STATS_DEFINITIONS.md (Low #2).
- Drop the dead flag_sketch_method='frequent_items' assignment before
  run_frequent_items — confirmed run_frequent_items does not consult
  flag_sketch_method (Low #3).
- Tighten the stats and frequency OOM wwarn messages to "Re-run with
  explicit ... exact to disable the auto-enable" — matches the
  established frequency wording and removes the misleading "override"
  phrasing (Low #4).

Verified Low #5 separately: which_stats() already gates mad on
!approx_quantiles regardless of flag_everything/flag_mad, so the
auto-disable promised by the wwarn is honored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address Copilot review: stream OOM fixture, drop drift-prone line refs, add -- prefix

- tests/workdir.rs: rewrite build_large_oom_csv to stream rows directly to
  a csv::Writer instead of building a 10M-row Vec in memory first. Avoids
  ~1.5 GB of String allocation that would OOM the test harness itself on
  memory-constrained hosts, defeating the purpose of the ignored OOM tests.
- src/cmd/stats.rs, src/cmd/frequency.rs: replace hard-coded intra-file
  line-number references in the try_enable_approx_sketches and
  can_enable_frequent_items doc-comments with descriptive references to
  the validator/dispatch blocks they mirror.
- src/cmd/frequency.rs: add the missing -- prefix to "sketch-method
  frequent_items" in the --memcheck USAGE help text; regenerate
  docs/help/frequency.md and .claude/skills/qsv/qsv-frequency.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(stats,frequency): clarify OOM fallback fires in both NORMAL and CONSERVATIVE mode

The OOM auto-fallback to DataSketches estimators is gated on the result
of util::mem_file_check, NOT on whether --memcheck is set. The
in-memory load check runs unconditionally on the non-parallel path;
--memcheck only switches the check from NORMAL mode (vs. total memory)
to the stricter CONSERVATIVE mode (vs. available + swap × platform
factor). The fallback can therefore trigger without --memcheck too —
just much less often, since NORMAL mode only trips when the file is
larger than ~80% of total RAM.

Rewrote --memcheck USAGE in stats.rs and frequency.rs to:
- Lead with what --memcheck actually does (CONSERVATIVE vs. NORMAL).
- Reference QSV_MEMORY_CHECK as the env-var equivalent.
- Describe the OOM fallback as a behavior of the load check itself,
  not of --memcheck specifically.

Updated CHANGELOG.md and docs/STATS_DEFINITIONS.md to match.
Regenerated docs/help/{stats,frequency}.md and the corresponding MCP
skill JSONs from the new USAGE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address Copilot review: real opt-out for OOM auto-enable + assert command success in tests

Two issues raised by Copilot on the OOM auto-fallback:

(1) The wwarn promised users could re-run with explicit
    --quantile-method exact / --cardinality-method exact / --sketch-method
    exact to disable the auto-enable, but the code only checked the
    parsed flag value. docopt fills in the default "exact" either way,
    so an explicit --foo-method exact was indistinguishable from
    omitting the flag — making the documented opt-out a no-op.

    Fix: scan argv for the literal flag names ("--foo-method" or
    "--foo-method=...") to detect explicit user intent. Thread that
    through try_enable_approx_sketches / can_enable_frequent_items via
    new user_set_* parameters; the auto-enable is suppressed when the
    user explicitly provided the flag (regardless of value). Documented
    in STATS_DEFINITIONS.md.

(2) The new OOM tests used wrk.output_stderr, which returns stderr
    regardless of exit status — a command that errored out after
    printing the auto-enable wwarn would still pass the test.

    Fix: add a wrk.stderr_on_success helper that asserts status.success()
    before returning stderr (with the same diagnostic-rich panic format
    as assert_success). Migrate the 6 new stats/frequency OOM tests to
    use it. Other call sites of output_stderr left untouched — they
    test failure paths where non-success is intentional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(frequency): restore can_enable_frequent_items doc comment

Address roborev #2032: in the previous commit (d9fe03e), `argv_has_flag`
and its doc comment were inserted *between* `can_enable_frequent_items`'s
doc comment and the function body. Because Rust doc comments attach to
the next item, the entire docstring block (including the
`user_set_sketch_method` paragraph and the trailing "Returns false if any
conflicting flag is set..." line) bound to `argv_has_flag`, leaving
`can_enable_frequent_items` with no doc comment at all.

Move `argv_has_flag` (with its own 4-line doc) to live *after*
`can_enable_frequent_items`, mirroring the layout in stats.rs where the
ordering was correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants