move --help output from stderr to stdout#2138

jqnatividad · 2024-09-13T10:57:12Z

as per gnu guidelines

Also convert match with two arms to if let for CSV error from() impl

as per gnu guidelines fixes #2105 Also convert match with two arms to if let for CSV error from() impl

…sage text those still go to stderr

The prior commit hoisted `let mut value = String::new()` outside the loop under the banner of "reuse String buffer", but `content_type_to_value` returns a freshly-allocated `String` each call (the underlying fake-rs fakers always materialize an owned `String`). Reassigning the hoisted variable just drops the previous allocation and moves in the new one — no allocations are amortized. The `#[allow(unused_assignments)]` and the extra block were also pure noise. Dropped the hoist, the attribute, and the block. Behavior unchanged: let value = faker_map::content_type_to_value(content_type, rng)?; if seen.insert(value.clone()) { pool.push(value); } Actual buffer reuse would require changing `content_type_to_value` to write into a `&mut String`, but that wouldn't help either — fake-rs's `fake_with_rng::<String, _>` still allocates internally on every call. Tests: 12 unit + 9 integration, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(synthesize): add `synthesize` command to generate statistically-faithful synthetic CSVs `qsv synthesize <input.csv>` runs `stats` + `frequency` + `count` on the source and emits N rows of synthetic data that reproduce the source's per-column attributes: * Categorical / low-cardinality columns are reproduced by frequency-weighted sampling of the *real* value set — cardinality, weights and repetition structure preserved exactly. * Numeric and date/datetime columns use quartile-bucketed generation so the *shape* of the distribution is preserved, not just `[min, max]`. * Null ratios are reproduced per column. * `--seed` makes output fully reproducible (single master `StdRng` threads through both selection logic and every faker call). * `--dictionary <file>` layers in semantic Content Types from `describegpt --dictionary --infer-content-type` — each token maps to a `fake-rs` faker (40-token vocabulary covers names, emails, addresses, UUIDs, etc.). Bounded-cardinality faker columns sample from a fixed pre-generated pool of distinct fake values (the consistency mechanism, so a given logical value maps consistently). * `--infer-content-type` runs `describegpt` internally to build the dictionary on the fly (needs `QSV_LLM_APIKEY`). Cross-column correlation is explicitly out of scope for v1; columns are generated independently. Implementation notes: * New module `src/cmd/synthesize/{mod,dictionary,faker_map,generator}.rs`. Reuses `util::run_qsv_cmd` to shell out to `stats` / `frequency` / `count` (the same primitive `describegpt::perform_analysis` uses) — no refactor of describegpt itself was needed. * Bumped a handful of `pub(super)` items in `src/cmd/describegpt/dictionary.rs` to `pub(crate)` (`StatsRecord`, `FrequencyRecord`, `parse_stats_csv`, `parse_frequency_csv`, `CONTENT_TYPE_VOCAB`) and made the `dictionary` submodule `pub(crate)` so `synthesize` can consume them. * `fake = "5"` (features: `derive`, `chrono`, `uuid`, `random_color`) added as an optional dep. The new `synthesize` feature gates the dep and is wired into `distrib_features` and `qsvmcp` (not `lite`, not `datapusher_plus`). `fake` v5.1.0 uses `rand 0.10` (same as qsv), so its `fake_with_rng` accepts qsv's seeded `StdRng` directly — no RNG version-bridging needed. * Worked around a determinism bug in fake-rs's `Dummy<IP<L>> for String` (ignores the passed RNG); `ip_address` uses `IPv4` directly instead. * Made the help-markdown and MCP-skills generators support module-dir commands (`src/cmd/<name>/mod.rs`), not just flat `src/cmd/<name>.rs`. * Regen also picked up genuine doc drift in `qsv-stats.json`, `qsv-frequency.json`, and `docs/help/frequency.md` (Apache DataSketches big-endian notes) — included. v1 is English-only; `--locale` is reserved for future multi-locale support (each fake-rs locale is a distinct type, so multi-locale dispatch needs macro expansion). Tests: 12 unit tests + 9 integration tests, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: suppress DevSkim false positives in synthesize 15 DevSkim code-scanning alerts on the synthesize PR, all false positives: * DS148264 ("weak/non-cryptographic RNG") — 12 hits on `StdRng::seed_from_u64` in *test code*. The whole point of these tests is determinism with a fixed seed; synthesize generates *fake data*, not security tokens. Added `// DevSkim: ignore DS148264` inline (matches the existing pattern in `select.rs:142`, `sample.rs:406`, `sort.rs:202-214`, `pragmastat.rs:668`). * DS126858 ("weak/broken hash algorithm") — 3 hits on the variable name `cmd2` in `tests/test_synthesize.rs`. The substring `md2` matched DevSkim's legacy-hash regex (md2/md4/md5/sha0/sha1). Renamed `cmd1`/`cmd2` → `first_cmd`/`second_cmd` — clearer anyway, no behavior change. Tests: 12 unit + 9 integration, all still passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address review: Copilot fixes — count/--jobs, index fallback, date sampling Three valid issues caught by Copilot's PR review: * `--jobs` was passed to `qsv count`, which doesn't accept it. Split the analysis args into `delim_only_args` (for `count`, which accepts `-d` but not `--jobs`) and `analysis_common_args` (for `stats` / `frequency`, which accept both). `index` continues to get no extra args. * Best-effort indexing was effectively dead code: `Config::index_files()` returns `Ok(None)` (not `Err`) when no index exists, so `.is_err()` was false for the common unindexed case and the `qsv index` subprocess never fired. Switched to `!matches!(..., Ok(Some(_)))` so we now actually create the index when one is missing. Confirmed end-to-end — the "Indexed input" status line now appears for unindexed inputs. * `Date` columns severely under-sampled the max date: `stats` min/max/q* values for `Date` are at midnight, so sampling uniformly over epoch seconds put only a single tick of the max-day in the last bucket (the rest of that day fell outside any bucket). Now `parse_epoch` returns *whole days since the UNIX epoch* for `Date` columns (and still seconds for `DateTime`); the `DateQuantile::next` arm samples a whole day uniformly within the bucket and multiplies back to seconds for formatting. `DateTime` sampling is unchanged. Tests: 12 unit + 9 integration, all passing. Verified `qsv synthesize ... --jobs 4` works end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address roborev #2134: NaN guard, cardinality caveat, example descriptions 5 findings on the initial commit (4d706e1) — 1 HIGH already addressed in 899de97 (index_files() Ok(None) check), 4 still open: * MEDIUM (generator.rs `parse_f64`): `parse_f64("NaN")` returned `Some(f64::NAN)`, which slipped past both the `hi < lo` short-circuit and the quartile sanity check, and would then panic in `rng.random_range`. Added `.filter(|v| v.is_finite())` so non-finite endpoints fail at parse time. Same path now rejects `±Infinity`. * LOW (USAGE / docs): "logical value maps consistently" was unqualified, but for cardinality > CARDINALITY_POOL_CAP (100k) we generate a fresh fake per row. Spelled this out in the USAGE. * LOW (mod.rs USAGE Examples block): the second example had two `$ qsv` lines under one comment, so the MCP-skill generator gave both the same description. Split into two comments — "First, generate the Data Dictionary with describegpt" / "Then layer in semantic fakers from the dictionary" — and regenerated qsv-synthesize.json + synthesize.md. * LOW (qsv-synthesize.json missing `--jobs`): not addressed by design. `--jobs` is intentionally skipped from MCP skills as "infrastructure setting" by mcp_skills_gen.rs:367 — verified the same flag is omitted from qsv-stats.json (29 options, no jobs), qsv-frequency.json (35 options, no jobs). This is a project-wide convention, not a synthesize-specific bug. Tests: 12 unit + 9 integration, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(synthesize): Reuse String buffer and simplify Option handling Declare a mutable String outside the loop and reuse it to avoid repeated allocations, adding #[allow(unused_assignments)] to suppress warnings. Replace the match on faker_map::content_type_to_value(...) with a direct call using the ? operator to return None early on failure, reducing nesting and simplifying control flow in build_faker_pool. * address roborev #2138: drop fake "buffer reuse" in build_faker_pool The prior commit hoisted `let mut value = String::new()` outside the loop under the banner of "reuse String buffer", but `content_type_to_value` returns a freshly-allocated `String` each call (the underlying fake-rs fakers always materialize an owned `String`). Reassigning the hoisted variable just drops the previous allocation and moves in the new one — no allocations are amortized. The `#[allow(unused_assignments)]` and the extra block were also pure noise. Dropped the hoist, the attribute, and the block. Behavior unchanged: let value = faker_map::content_type_to_value(content_type, rng)?; if seen.insert(value.clone()) { pool.push(value); } Actual buffer reuse would require changing `content_type_to_value` to write into a `&mut String`, but that wouldn't help either — fake-rs's `fake_with_rng::<String, _>` still allocates internally on every call. Tests: 12 unit + 9 integration, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad added 2 commits September 13, 2024 06:56

move --help output from stderr to stdout

2b7dbdc

as per gnu guidelines fixes #2105 Also convert match with two arms to if let for CSV error from() impl

differentiate --help invocations from other docopt errors that show u…

c8ef543

…sage text those still go to stderr

jqnatividad merged commit b997c76 into master Sep 13, 2024

BrewTestBot mentioned this pull request Sep 24, 2024

qsv 0.135.0 Homebrew/homebrew-core#191718

Merged

jqnatividad deleted the 2105-move-help-to-stdout branch October 22, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move --help output from stderr to stdout#2138

move --help output from stderr to stdout#2138
jqnatividad merged 2 commits into
masterfrom
2105-move-help-to-stdout

jqnatividad commented Sep 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jqnatividad commented Sep 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant