Skip to content

Releases: dathere/qsv

21.0.0

07 Jun 15:08

Choose a tag to compare

[21.0.0] - 2026-06-08 🌐 The "F-AI-Rification" Release 📇

FAIR Data is AI-Ready Data. It is the perfect context for AI applications - as its compact, token-efficient and vastly improves an Agent's understanding of your Data. A few hundred kilobytes of FAIR metadata is often all it takes to comprehensively describe giga/terabyte level data.

This major release builds even further on qsv's existing FAIRification capabilities, with two new commands and expanded geocoding:

  1. profile - extracts standards-compliant dataset FAIR metadata (DCAT-US v3, DCAT-AP v3, Croissant 1.1 and Geoconnex);
  2. get - fetches tabular data from HTTP(S), cloud object stores (AWS S3/Google Cloud/Microsoft Azure) and CKAN portals into a content-addressed local cache - making it even easier to FAIRify remote data and/or to use these remote data to enrich your data corpus;
  3. geocode goes online with OpenCage. Geospatially contextualize and normalize your location data with OpenCage. Unlike other geocoders - OpenCage is built from open data; has the most permissive licensing - allowing displaying data on ANY Map and indefinite caching; and is much cheaper to boot!

qsv 21.0.0 raises the minimum supported Rust version to 1.96 and upgrades to Polars 0.54, which is why this is a major version bump - existing pipelines are otherwise source-compatible.

Highlights

  • profile - generate standards-compliant dataset metadata.
    A new command that profiles a dataset and projects it into open metadata standards — DCAT-US v3, DCAT-AP v3, Croissant 1.1, and Geoconnex — via configurable YAML-driven MiniJinja-powered projection engine, with optional SHACL/mlcroissant/pyshacl validation and embedded descriptive statistics & frequency tables so you can further customize the metadata schema mappings (#3898, #3901, #3908, #3912, #3916, #3918).
  • get - fetch tabular data from anywhere into a local cache.
    A new command (issue #2263) that retrieves CSV/TSV and other tabular data from HTTP(S) URLs, cloud object stores (s3://, gs://, az://), and CKAN portals (ckan://), then stores it in a content-addressed disk cache. Cached entries are addressable via a dc:<name> input prefix usable by any other qsv command, carry BLAKE3 + ETag provenance, support TTL/policy controls, and revalidate conditionally (HTTP If-None-Match / 304 Not Modified). Cloud sources are gated behind the opt-in get_cloud sub-feature; streaming, ranged/parallel downloads and a dc: stats cache landed in Phase 3 (#3953, #3958).
  • geocode with OpenCage support.
    New geocode subcommands call the OpenCage geocoding API for forward and reverse geocoding, with a persistent on-disk result cache and %dyncols: support (issue #1295, #3876, #3878).
  • describegpt describes meaning, not just types.
    A richer Semantic Markdown Data Dictionary format for optimized agents & data catalogs; a JSON Schema (draft 2020-12) output format; and LLM-inferred date/datetime content types round out describegpt's semantic-description capabilities (#3933, #3935, #3871, #3884).
  • Mergeable / variance-bounded sampling in sample
    two new sampling modes plus a sketch-IO surface that lets users sample sharded inputs and combine the results without re-reading the whole corpus. Both modes are native Rust implementations written from the original algorithm papers. The Apache DataSketches project's Sampling family implements the same family of algorithms in C++/Java/Python — qsv does not bind to or depend on that code (the datasketches Rust crate doesn't expose Sampling-family sketches), so the on-disk format is qsv-specific and not interoperable with DataSketches serialized sketches.
    • --varopt <col>
      variance-bounded weighted reservoir sampling using the A-ExpJ keying scheme of Efraimidis & Spirakis (2006). Each record gets a key u^(1/w) and the top-k keys are retained. Unlike --weighted (which is single-pass acceptance-rejection requiring a max_weight from the stats cache), --varopt is a true reservoir sampler — no stats cache required, single pass, bounded memory, and mergeable across partitions.
    • --mergeable-reservoir
      uniform reservoir using Vitter's Algorithm R. Same statistical distribution as the default RESERVOIR method, but the resulting sampler state is mergeable.
    • --sketch-out <file> / --sketch-in <file1,file2,...>
      serialize the sampler state to a binary blob and merge across runs. Sketches embed the source CSV header so --sketch-in re-emits a schema-bearing CSV without consulting the source files. Sampler-kind mismatch (mixing a reservoir blob with a varopt blob) is rejected. Works with both new sampling modes.

Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.


Added

  • sample: --varopt <col> flag for variance-bounded weighted reservoir sampling (A-ExpJ keying, Efraimidis & Spirakis 2006). See Headline above.
  • sample: --mergeable-reservoir flag for a uniform reservoir sampler whose state is mergeable across runs (same distribution as the default RESERVOIR method). See Headline above.
  • sample: --sketch-out <file> / --sketch-in <files> for serializing and merging sampler state across runs. Sketches carry their source CSV header so merged output is schema-bearing.
  • geocode: new cache-clear, cache-prune & cache-info subcommands to manage the persistent on-disk OpenCage result cache. cache-clear wipes the cache, cache-prune --older-than <val> deletes entries older than an absolute date or a relative age (e.g. 30d, 2w), and cache-info reports the cache directory, entry count, on-disk size and oldest/newest entry timestamps.
  • profile: new bundled geoconnex projection profile + pyshacl validator wired to the Internet of Water's Geoconnex SHACL shapes (vendored under resources/geoconnex/shacl/, embedded in the qsv binary). Phase 1 is dataset-level only — DatasetShape / ProviderShape / PublisherShape / DistributionShape coverage; the row-per-feature LocationOrientedShape (with mandatory gsp:asWKT geometry synthesis from lat/lon columns) is deferred to a follow-up. Gated behind a new geoconnex cargo feature — present in qsv (via distrib_features) and as an opt-in for qsvdp (-F datapusher_plus,geoconnex); not available in qsvlite / qsvmcp.
  • 🆕 get: new command for fetching tabular data from HTTP(S) URLs, cloud object stores (s3:///gs:///az://) and CKAN portals (ckan://) into a content-addressed local disk cache. Cached entries are reusable by any other qsv command via the dc:<name> input prefix, carry BLAKE3/ETag provenance plus record-count and TTL metadata, and revalidate conditionally over HTTP (If-None-Match304 Not Modified). Subcommands include cache-set-ttl, cache-set-policy and cache-list --verify. Cloud sources are gated behind the opt-in get_cloud sub-feature (via object_store, no new transitive crates). Available in qsv/qsvmcp/qsvdp (not qsvlite). Issue #2263 (#3953, #3958).
  • 🆕 profile: new command for profiling a dataset and projecting it into open metadata standards — DCAT-US v3, DCAT-AP v3 and Croissant — through a YAML-driven projection engine, with optional external validation (mlcroissant for Croissant, pyshacl for DCAT-AP/Geoconnex SHACL shapes) and embedded descriptive statistics & frequency tables. Accepts local files, URL inputs and stdin. Available in qsv, qsvmcp and qsvdp; not in qsvlite. (The bundled geoconnex projection profile is the only part gated further — to qsv/qsvdp via the geoconnex feature.) (#3898, #3901, #3904, #3908, #3910, #3911, #3912, #3918).
  • geocode: new OpenCage online geocoding subcommands for forward and reverse geocoding via the OpenCage API, including %dyncols: support to materialize multiple result fields as new columns. Issue #1295 ([#3876](https://gith...
Read more

20.1.0

18 May 03:45

Choose a tag to compare

[20.1.0] - 2026-05-18 🤖 The "Synthetic Data" Release 🎲

A feature-packed minor release headlined by a brand-new synthesize command for generating realistic fake CSV data, a much smarter describegpt that can now describe what your columns mean (not just their data types), and new "approximate stats" modes that let stats and frequency keep working on files that are much bigger than your computer's memory. No breaking changes — pipelines built on 20.0.0 will upgrade in place.

Highlights

  • 🆕 synthesize — generate realistic fake CSVs from a real one. Point it at a source file and it produces a new CSV of any size whose columns look and behave like the original — same value mix, same distribution shape, same null rate — but without any of the original records. Useful for sharing test data, populating staging environments, or building demos without leaking real customer data.

    • Categorical columns (e.g. country, status) are rebuilt by sampling the real values in the same proportions they appear.
    • Numeric and date columns preserve the shape of the distribution, not just the min/max — so the synthetic data has realistic clusters, not a flat random spread.
    • Null rates are matched per column.
    • --seed makes output fully reproducible — same seed, same file, every time.
    • --dictionary / --infer-content-type plugs in the new describegpt Content Types (see next bullet) so columns recognized as e.g. email, phone, city, or credit_card are filled with realistic-looking fakes instead of generic random strings. --locale picks from 14 regional flavors (US, FR, JP, etc.) so the fakes match your audience.
    • Cross-column correlations (e.g. keeping cityzip_code consistent within a row) aren't modeled by default — but turning on describegpt's --two-pass option (see next bullet) lets the LLM detect related fields, and synthesize will then keep those relationships consistent in the generated rows.
  • 🧠 describegpt got a lot smarter — it can now label what your columns mean. In addition to qsv's existing type detection (Integer, Float, Date, etc.), describegpt can now ask an LLM to classify each column with a semantic label from a 47-token vocabulary covering people, addresses, companies, technical identifiers, and more — so a column of strings isn't just "String", it's email, street_address, job_title, or credit_card. These labels are what powers synthesize's realistic fakes, but they're also useful on their own as auto-generated data dictionaries.

    • --two-pass runs the LLM a second time over the ENTIRE Data Dictionary so it can spot relationships between columns (e.g. "this is a state_abbr because the next column is a zip_code") and fix sloppy first-pass labels. This is also what unlocks cross-column consistency in synthesize (see previous bullet).
    • Deterministic unique_id tag — columns where every value is unique (like IDs and UUIDs) are tagged by qsv directly, before the LLM ever sees them. That means the label is 100% reproducible and doesn't drift between LLM versions.
    • Smarter time/duration handling — duration columns can carry a realistic upper bound (e.g. "0–1 hour") so synthetic latency or TTL values stay believable instead of ranging out to absurd numbers.
    • --markdown-template lets you customize the generated Data Dictionary's Markdown output — add your team's review checklist, restructure the per-column layout, whatever fits your docs.
    • Lower LLM costs — the default prompts were restructured to stop re-sending the dictionary on every step, measurably cutting token usage on multi-phase runs.
  • 📊 Approximate stats for huge files — stats and frequency no longer give up when a file is much bigger than your RAM. New opt-in modes use Apache DataSketches algorithms that compute approximate-but-bounded-error answers in a tiny fraction of the memory. Three new modes across two commands:

    • For stats: --quantile-method tdigest for approximate percentiles (t-digest) and --cardinality-method hll for approximate distinct counts (HyperLogLog).
    • For frequency: --sketch-method misra-gries for approximate top-K most-frequent values (Misra-Gries Frequent Items).
    • Automatic when you'd otherwise OOM: if qsv detects the file is too big to fit in memory, it now auto-switches to the approximate modes (and tells you which ones), instead of failing. Pass --quantile-method exact (etc.) to force the precise calculation regardless.
    • Cache stays correct: the stats cache key now includes the chosen mode, so switching between exact and approximate modes won't accidentally return stale results.
    • Note: these modes require a "little-endian" CPU, which covers all common hardware (Intel, AMD, Apple Silicon, ARM, etc.). Exotic platforms like IBM s390x get a clear error message instead.

Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.


Added

  • synthesize: new top-level command (see Headline) #3854
  • synthesize: --consistent-fakes for stable source→fake mapping #3865
  • synthesize: --locale option for 14 fake-rs locales #3860
  • describegpt: --two-pass cross-field Data Dictionary refinement #3863
  • describegpt: deterministic unique_id Content Type for ALL_UNIQUE fields #3862
  • describegpt,synthesize: infer Content Type for temporal fields with LLM-hinted duration cap #3861
  • describegpt,synthesize: 5 new Content Type tokens — street_name, license_plate, industry, profession, ipv6_address
  • describegpt: --markdown-template for customizable Markdown output #3834
  • pivotp: --agg quantile@<p> (alias q@<p>) with linear interpolation #3842
  • stats/frequency: opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K #3840
  • stats: widened BLAKE3 fingerprint to cover all streaming stats #3824

Changed

  • stats/frequency: auto-enable Apache DataSketches estimators (t-digest + HyperLogLog for stats; Misra-Gries Frequent Items for frequency) when util::mem_file_check reports OOM, in addition to the existing auto-index fallback. A wwarn! is emitted listing the auto-enabled estimators; explicit --quantile-method exact / --cardinality-method exact / --sketch-method exact still suppresses the auto-enable #3843
  • stats: three opt-in micro-optimizations — simdutf8 output, t-digest quantiles, mode-cardinality cap #3839
  • synthesize: use string-length stats for unstructured text columns #3864
  • describegpt: inline {{ dictionary }} in default description/tags prompts; skip redundant chat-message dictionary injection when the template already inlines it
  • synthesize: handle both describegpt-wrapped and raw dictionary JSON
  • refactor: adopt Rust 1.95 cfg_select! macro at platform-conditional sites #3846
  • perf: promote bytes_to_cow_str helper to util and sweep callsites
  • perf(moarstats): hint rare branches with core::hint::cold_path() #3823
  • perf(stats): mark non-UTF-8 branch cold
  • perf(frequency): hint UTF-8 failure as cold in the ignore-case hot loop #3821
  • refactor(stats): shrink and tidy WhichStats #3822
  • refactor(publish): fetch tags and enforce SemVer for debian package releases
  • refactor(benchmarks): harden benchmarks.sh error handling and cross-platform support #3814
  • deps: bump polars (latest upstream), calamine 0.34→0.35, csvlens fork with bumped arrow, sysinfo 0.38.4→0.39.2, rust_decimal 1.41→1.42, tokio 1.52.1→1.52.3, filetime 0.2.27→0.2.29, jsonschema 0.46.4→0.46.5, rand_xoshiro 0.8.0→0.8.1, redis 1.2.0→1.2.1, qsv-dateparser 0.14→0.15 (adds support for ISO 8601 T-separated datetimes without a timezone suffix — e.g. 2020-01-15T08:00:00, the form produced by Python's datetime.isoformat() without astimezone(); previously misclassified by qsv stats --infer-dates as String)
  • assorted clippy cleanups across stats, frequency, pivotp, partition

Fixed

  • stats: preserve length & lex stats when column type widens to String #3856
  • stats: remove duplicate big-endian TDigestStub/HllSketchStub defs #3857
  • stats: restore big-endian build by giving slot fallbacks an accessible .0 #3850
  • stats/frequency: gate Apache DataSketches behind little-endian targets #3847
  • apply/applydp: thousands negative fractions; scope <NULL> to regex_replace #3845
  • moarstats: retry on stats coverage mismatch + fsync joined CSV parent dir [#3838](https://github.com/da...
Read more

20.0.0

03 May 03:42

Choose a tag to compare

[20.0.0] - 2026-05-03 🧹 The "Spring Cleaning" Release 🌱

Over the past four weeks, we did the first end-to-end pass over the qsv codebase using Claude Code, roborev, Serena, Context7 and GitHub Copilot orchestrated using a multi-agent, adversarial review workflow - systematically auditing every command for correctness, safety, and performance. The result is the largest correctness-and-safety sweep in qsv's history: ALL commands were touched by review-driven cleanups, with dozens of latent bugs, panic paths, and performance cliffs swept out, while adding more than 250 new tests across the board.

This is a major version bump because that sweep also surfaced four user-visible behaviors that were demonstrably wrong and could not be fixed without breaking compatibility:

  • safenames verify-mode now correctly counts duplicate-suffix renames as unsafe (previously under-reported).
  • enum --hash is now collision-resistant across multi-column inputs (previously ["ab","c"] and ["a","bc"] hashed identically).
  • excel --metadata csv column ordering now actually matches its header row (previously the type, visible, and headers columns held each other's values).
  • util::safe_header_names now enforces its 60-char cap in bytes end-to-end (previously chars-based, allowing UTF-8 names up to 240 bytes — past Postgres' maximum identifier length).

Plus a few smaller but breaking corrections: headers --intersect is renamed to --union (the flag never computed an intersection), luau qsv_loadcsv headers are now 1-indexed per Lua convention, and MSRV is bumped to Rust 1.95.

Beyond the cleanup, this release adds one new top-level command:

  • NEW implode command: the inverse of explode. Groups rows by key column(s) and joins a value column into a single delimited string per group — useful for collapsing normalized output back into compact form.

And a notable performance win:

  • frequency: parallel tree-reduce of partial frequency tables delivers a ~1.3x speedup on multi-core machines. Smaller per-command perf wins also landed in fill (+22%), datefmt (+9%), cat, dedup, replace, search/searchset, and transpose.

Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.

Important

This is a major release with breaking changes. Pipelines that consume qsv excel --metadata csv by column position, store qsv enum --hash digests across versions, parse qsv safenames verify-mode output, or invoke qsv headers --intersect will need updates. See the Changed and Removed sections below for migration notes.


Added

  • implode: new command — inverse of explode #3733 (closes #917)
  • generators: mark required options in help markdown and MCP skills #3734
  • sortcheck: add --numeric and --natural flags; allocation-free streaming loop #3756
  • exclude: add stdin support and memcheck #3749

Changed

  • BREAKING excel: --metadata csv column ordering for type, visible, and headers is corrected. Previously the CSV header row declared type, visible, headers but the data rows pushed values in the order headers, typ, visible, so under each named column the wrong values appeared (the type column held the headers list, visible held the type, and headers held the visibility). The CSV output now matches the --metadata json (SheetMetadata struct) field order: index, sheet_name, type, visible, headers, column_count, …. Pipelines that consumed qsv excel --metadata csv and indexed by column position must shift those three columns; consumers that indexed by header name see corrected values automatically.
  • BREAKING enum: --hash digest values change. The hashed input now carries a u64 length prefix per field (to fix the multi-column collision bug above), so every --hash digest differs from earlier qsv versions — single-column hashes change identity values too, and stored hashes from earlier qsv versions will not match. Same input still hashes deterministically across rows and runs in ≥ this version.
  • BREAKING luau: qsv_loadcsv now returns the headers table 1-indexed (per Lua convention). Scripts that accessed headers[0] or iterated for i = 0, #headers - 1 must shift to headers[1] and for i = 1, #headers (or ipairs(headers)). Previously headers[1] returned the second header.
  • BREAKING headers: rename --intersect to --union. The flag has always computed a deduplicated union of headers across inputs, not a true set intersection — the name was a long-standing misnomer. --intersect is removed entirely (no alias) given the surrounding breaking-change window. Migration: replace qsv headers --intersect … with qsv headers --union …; output is unchanged.
  • BREAKING safenames: verify-mode (--mode v / V / j / J) outputs change. (1) Verify counts now include header positions that would be renamed by the duplicate-suffix pass — inputs containing duplicate column names will report higher unsafe counts than earlier qsv versions; the count now matches what --mode a would actually rewrite. (2) --mode V / j / J displays unsafe-header strings with leading/trailing whitespace and surrounding " already trimmed (matching what the safe-rename pass actually evaluates), and duplicate_headers is now sorted alphabetically rather than appearing in undefined HashMap iteration order. Pipelines that parsed verbose/JSON output and depended on the old ordering or untrimmed strings must update.
  • BREAKING util::safe_header_names: the 60-length cap is now enforced in bytes on the final name, including any duplicate-disambiguation suffix. Previously the truncation was chars-based (take(60).sum()) and only applied to the base, so non-ASCII headers could produce up to ~240-byte names and duplicate-disambiguated headers added _<n> after truncation, pushing past Postgres' NAMEDATALEN (63 bytes). Now the rewrite path lowercases and prepends the leading-_ prefix before truncating, then snaps to a UTF-8 char boundary at ≤60 bytes. ASCII-only inputs see the same output as before for non-suffixed cases. Long ASCII headers that previously generated 61–63-char suffixed variants will be 1–2 chars shorter at the boundary. Headers containing multibyte UTF-8 (CJK, accented chars, emoji) that previously produced names >60 bytes will now be aggressively trimmed to fit. Affects every caller (safenames, applydp, apply, fetch, python); stored mappings keyed on the old over-long forms will not match.
  • describegpt: split process_phase_output into per-branch helpers (dictionary context-only, full dictionary, JSON, TSV, TOON, Markdown). No behavior change — same output, smaller functions.
  • luau: qsv_coalesce now stringifies non-string values (numbers and booleans render via to_string; nil / arrays / objects are skipped). Previously, numbers and booleans were silently treated as missing values via as_str().unwrap_or_default(). Scripts relying on qsv_coalesce(some_bool, fallback) to skip booleans will now return "true"/"false" for the boolean.
  • describegpt: per-phase helper split, widened cache key, ~21% LOC reduction #3720 #3721 #3722
  • frequency: parallel tree-reduce of partial FTables (~1.3x speedup) #3728
  • moarstats: collapse duplicated outlier bivariate scan; safety/perf cleanup, unit tests #3718 #3719
  • validate: use cold_hint (stabilized in Rust 1.95) #3717; correctness, perf cleanup #3743 #3779
  • frequency: correctness, perf, refactor cleanup #3745
  • apply: review-driven cleanup, perf #3741
  • template: subdir bug fix, lookup perf, render-error visibility, helper extraction #3740
  • dedup: allocation-free ignore-case #3754
  • datefmt: ~9% perf #3753
  • fill: ~22% faster hot path #3762
  • replace: streaming parallel write; dead match-flag tracking #3777
  • search/searchset: parallel memory streaming; --quick fixes; USAGE alignment #3776
  • cat: rowskey speedup #3750
  • transpose: correctness, perf cleanup, polish #3781
  • cleanup: rename fail_oom_clierror; surface geocode update-check error #3806
  • applied select clippy lints

Fixed

  • excel: review-driven cleanup of src/cmd/excel.rs — fix four correctness bugs. (1) Negative --sheet indices that overshot the sheet count silently selected a wrong sheet because the abs_diff clamp "bounced" past zero (e.g. --sheet -4 on a 3-sheet workbook returned the 2nd sheet); now errors with usage error: negative sheet index N is out of range for K sheets. (2) get_requested_range l...
Read more

19.1.0

13 Apr 02:56

Choose a tag to compare

[19.1.0] - 2026-04-13

Note

Self-update for the pre-built binaries was broken in qsv 18.0.0 and 19.0.0. This was caused by a bug in the self-update crate that has since been fixed.
WORKAROUND: Download qsv 17.0.0 which predates the self-update bug, and use its --update or --updatenow options to upgrade to the latest release.

Detailed MCP Server and Claude Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.


Added

  • pivotp: add group-by mode #3698 (closes #3697)
  • pivotp: expand smart aggregation with 7 more statistics #3699

Changed

  • self_update: show actual error message when available if self_update errors out
  • moarstats: use fused multiply add for theil_sum (perf)
  • Switch to crates.io mimalloc, removing git override
  • Add HTML anchors to some stats definitions

Fixed

  • Fix 10 documentation-codebase drifts found by audit #3689
  • Fix 10 documentation-codebase drifts found by audit #3692
  • Document index support for describegpt and join
  • Use latest upstream self_update (our PR merged)
  • Homebrew qsv distribution enables more features now

Dependencies

  • Bump polars to latest upstream
  • Ensure all polars_sql features are enabled
  • Bump jsonschema from 0.45.1 to 0.46.0 #3695
  • Bump pragmastat from 12.0.1 to 12.1.0 #3693
  • Bump qsv-stats from 0.48.1 to 0.48.2 #3702
  • Bump rand from 0.10.0 to 0.10.1 #3700
  • Bump tokio from 1.51.0 to 1.51.1 #3691
  • Use nightly-2026-04-01 (same as polars)
  • bump indirect dependencies

Full Changelog: 19.0.0...19.1.0

19.0.0

06 Apr 15:12

Choose a tag to compare

[19.0.0] - 2026-04-07 🔐 The "FAIR Answers" Release 📐

The Reproducibility Crisis in Scientific Research is one of the principal motivators for FAIR Principles in Data Management.

With AI increasingly used in data pipelines, the need for reproducibility and auditability has become even more critical as "hallucinations" and non-deterministic outputs are inherent challenges in Generative AI.

That's why in this release, we instrumented qsv with several features to help users track, audit, and reproduce their AI-assisted data wrangling workflows more effectively. As FAIR Principles do not only apply to data, we also want "FAIR Answers" - with the last R for "Reproducible":

  • Enhanced Logging: The qsv_log tool now supports structured logging with JSON output, making it easier to parse and analyze logs for reproducibility audits (note that this is only available from the qsv MCP Server).
  • NEW blake3 Command: A new blake3 command computes BLAKE3 hashes of files or data streams, providing a fast and reliable way to verify data integrity and track file versions in workflows. Unlike the oft-used SHA-256 hash, BLAKE3 is up to 16 times faster without sacrificing security, making it ideal for large datasets and iterative processing.
  • Cowork Project Reproducibility Manifest: Building on the Cowork Project support released in 18.0.0, the qsv Cowork Plugin now creates a Project Reproducibility Manifest - a structured log of all prompts, commands, and outputs generated during a Cowork session. This manifest can be used for detailed audits of the data wrangling process, helping users understand how specific outputs were derived and enabling them to reproduce or modify the workflow with confidence.
  • Even Moarstats: The moarstats command gets even "moar" statistical tests and metrics (Trimean, Midhinge, Robust CV, Jarque-Bera, Theil Index, Mean Absolute Deviation and Simpson's Diversity Index), giving users deeper insights into their data distributions and relationships, which can be crucial for reproducibility in data analysis.
  • To Parquet Improvements: The to parquet command is re-added with a new implementation powered by Polars' LazyFrame API, providing faster and more reliable CSV-to-Parquet conversion with better schema inference and support for complex data types. New options like --infer-len and --try-parse-dates enhance the accuracy of type inference, further improving the fidelity of Parquet outputs for faster downstream analysis and reproducibility.

Detailed MCP Server and Cowork Plugin changes are documented in the MCP CHANGELOG.


Added

  • blake3: new BLAKE3 hashing command #3658
  • to parquet: re-add subcommand powered by Polars #3674
  • to parquet: pschema.json support, --infer-len and --try-parse-dates #3680
  • pivotp: totals support #3635
  • moarstats: even moar stats #3654

Changed

  • to parquet: use LazyFrame for parquet conversion #3679
  • tojsonl: implement proper JSONL writer instead of abusing CSV writer
  • Document first-N sampling; use to_string_lossy
  • help: suppress linebreaks for options by using non-breaking hyphens #3662
  • Switch default allocator from mimalloc to jemalloc - the default allocator of polars #3684
  • Add debug_assert! to moarstats map lookups
  • Remove some unwraps

Fixed

  • docs: fix 27 stale claims found in documentation audit #3637
  • docs: correct 5 documentation inaccuracies found during audit
  • typo: | character not escaped, prematurely truncating content

Dependencies

Full Changelog: 18.0.0...19.0.0

18.0.0

20 Mar 15:04

Choose a tag to compare

[18.0.0] - 2026-03-20 The "StatsSighting" Cowork Plugin Release

"StatsSighting" is like "VibeCoding" but for iterative, blazing-fast, deep data analysis. "Stats" for Statistics. "Sight" for Insight - doing a comprehensive statistical profile of datasets first to inform the analysis pipeline.

The Claude Cowork Plugin comes with several agents - the "Data Analyst Agent" for deep data exploration and analysis, the "Data Wrangler Agent" for transformation and cleaning, and the "Policy Analyst Agent" for helping with policy evaluation and decision-making. Each agent has a specific role and skill set, with a shared emphasis on leveraging the qsv MCP Server's profiling and querying capabilities to understand the data before acting on it.

The qsv MCP server received major enhancements - including session logging, DuckDB-powered Parquet conversion, SQL translation hardening, and interactive working directory elicitation.

The core qsv suite also gets significant updates in this release, including the new scoresql command for pre-query SQL analysis, smarter pragmastat with stats-cache integration and comparison mode, pivotp optimizations with moarstats awareness, and formatted table output for to.


Major Features

New scoresql Command

Analyze SQL queries against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions before running the query. Scoring factors include query plan analysis (EXPLAIN), type optimization, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins), and infrastructure checks (index files, cache freshness). Supports Polars and DuckDB modes, SQL file input, and JSON output. Integrates with describegpt for AI-assisted query review. #3612, #3616, #3624

Smarter pragmastat — Stats-Cache Aware with Comparison Mode

pragmastat now reads the stats cache to automatically skip non-numeric/non-date columns, and writes its own results back to the cache for downstream commands. New --compare1 and --compare2 options let you compare two distributions side-by-side. Multiple performance optimizations make it significantly faster. #3591, #3593, #3596, #3595, #3611

pivotp — Smarter Pivoting with moarstats

pivotp now integrates with moarstats to auto-validate pivot column cardinality before execution, preventing overly wide output (>1000 columns) and guiding users toward better pivot strategies. #3606

to — Named Table Support

The to command gains a --table option for CSV, XLSX and ODS output, letting you write data to a named sheet/table in workbook formats. #3572, #3580

Detailed MCP changes are documented in the MCP CHANGELOG.


Added

  • scoresql: new command — score SQL queries for safety, complexity and performance #3612
  • scoresql: SQL file support, DuckDB PATH fallback & QSV_DUCKDB_PATH rename #3616
  • to: add --table option for CSV, XLSX and ODS output #3572, #3580
  • searchset: ignore line comments in regexset files #3622
  • pragmastat: add --compare1 and --compare2 options #3591
  • pragmastat: use stats cache to only process numeric/date/datetime columns #3593
  • pragmastat: write results to stats cache #3596
  • pragmastat: multiple performance optimizations #3595, #3611
  • pivotp: smarter pivoting with moarstats integration #3606
  • describegpt: scoresql integration #3624

Changed

  • stats: reduce day-valued precision to 5 decimals #3607
  • frequency: use array_windows for pairwise comparisons
  • Use mul_add for numeric ops across the codebase for more accurate FMA
  • MSRV bumped to latest stable Rust 1.94
  • Switch csvlens dependency to upstream
  • Polars bumped to 0.53.0 (py-1.39.x series)

Fixed

  • stats: fixed big performance regression caused by memory-aware chunking logic error #3598
  • help: fine-tune markdown generation of docopt usage text #3600

Dependencies

  • Polars 0.53.0 (py-1.39.3)
  • pragmastat 11.1.0 → 12.0.0 #3589
  • qsv-stats 0.47.0 → 0.48.0 #3587
  • jsonschema 0.44.0 → 0.45.0 #3592
  • minijinja/minijinja-contrib 2.16.0 → 2.18.0
  • calamine 0.33 → 0.34
  • cached 0.58 #3594
  • Removed patched forks of self_update and pragmastat (upstream releases available)
  • Various other dependency bumps (toml, toon-format, tempfile, redis, libc, sysinfo, once_cell, spreadsheet-ods)

Full Changelog: 17.0.0...18.0.0

Note

qsv 18.0.0 is not published to crates.io. qsv depends on an unreleased git revision of Polars, and cargo publish strips [patch.crates-io] entries, causing dependency resolution to fail against the published Polars v0.53.0 on crates.io (which caps chrono <=0.4.41, incompatible with chrono 0.4.44). This will be resolved once Polars publishes a new crates.io release with updated chrono support. In the meantime, install qsv via the prebuilt binaries, various package managers, or by building from source.

17.0.0

03 Mar 14:56

Choose a tag to compare

[17.0.0] - 2026-03-03 "The User 🧑🏻 and Agent 🤖 Experience (UAX) Release"

This release is all about getting Human Users and AI Agents working together in harmony to wrangle data faster and more effectively - whether you're a solo analyst or a data team using Claude Desktop/Cowork/Code or Gemini.

The UAX theme introduced in 16.1.0 reaches full stride — the new qsvmcp binary variant gives AI agents a purpose-built, leaner binary; the MCP server levels up with better tool guidance, TSV output for token efficiency, reproducibility logging, DuckDB-powered Parquet conversion, automatic moarstats enrichment, SQL translation hardening, and interactive working directory elicitation. On the core CLI side, stats cache reliability improves across delimiters and output formats, sniff resolves symlinks correctly, and moarstats gets faster hot-path performance.


Major Features

New qsvmcp Binary Variant

A purpose-built binary optimized for use with the qsv MCP server, adding session logging while dispensing with unneeded features (like apply, fetch, fetchpost, foreach, to) for a faster, smaller build. The MCP server now prefers qsvmcp with automatic fallback to the full qsv binary. qsvmcp is now included in release distributions alongside qsv, qsvlite, and qsvdp.

qsv MCP Server: Agent-Native Enhancements

The MCP server (now v17.0.0) receives its biggest update yet, with features designed to make AI agents more effective at data wrangling:

  • TSV Output Format — Default output switched to TSV for ~30% token reduction in agent responses, configurable via QSV_MCP_OUTPUT_FORMAT
  • Session Logging — New qsv_log tool and automatic qsvmcp.log audit trail for reproducibility, with configurable log levels via QSV_MCP_LOG_LEVEL
  • DuckDB Parquet Conversion — When DuckDB is available, CSV-to-Parquet conversion uses DuckDB instead of sqlp for faster, more reliable conversion
  • Auto-moarstatsmoarstats automatically runs after stats execution for richer statistical context at minimal cost
  • SQL Translation Hardening — Major translateSql overhaul: unique table aliases (_tbl_N), string literal protection, user-provided alias preservation, and pre-scan qualified ref fixing
  • Working Directory Elicitation — Interactive directory picker via MCP Elicitation protocol for first-time setup
  • Reserved Cache Filename Guard — Prevents accidental --output overwrites of .stats.csv and .freq.csv cache files
  • Cache-Aware SQL Guidance — Server instructions now guide agents to leverage stats and frequency caches when composing sqlp, joinp, and pivotp queries
  • Polars SQL Engine Header — Clear engine indicator differentiates Polars SQL vs DuckDB query results
  • Absolute Path Resolution — All file-path arguments now resolved to absolute paths for robustness
  • Cowork CLAUDE.md Auto-Deploy — Automatically deploys project CLAUDE.md to Claude Cowork working folder on session start (cross-platform Node.js implementation)

Detailed MCP changes are documented in the MCP CHANGELOG.


Added

  • feat: qsvmcp binary variant — purpose-built for MCP server usage, included in release distributions

Changed

  • perf(moarstats): fix outlier key bug and optimize hot-path allocations
  • perf(stats): optimize to_record() output path and weighted_mad()
  • refactor(describegpt): simplify code for clarity and reduce redundancy
  • deps: bump pragmastat from 10.0 to 11.1.0
  • deps: bump polars to latest upstream (rev 802550b)
  • deps: bump Luau from 0.708 to 0.709
  • deps: bump chrono from 0.4.43 to 0.4.44
  • deps: bump csv-nose from 0.8.0 to 1.0.1
  • deps: bump jsonschema from 0.42 to 0.44.0
  • deps: bump strum/strum_macros from 0.27.2 to 0.28.0
  • deps: bump tempfile from 3.25.0 to 3.26.0
  • deps: bump serial_test from 3.3.1 to 3.4.0
  • deps: bump actions/upload-artifact from 6 to 7
  • deps: switch csvlens to patched fork using csv-nose 1.0.1
  • deps: update ort dependency to include tls-rustls feature (by @kulnor)
  • applied select clippy suggestions

Fixed

  • fix(stats): always write stats cache as CSV regardless of output format (Snappy, TSV, etc.)
  • fix(stats): decouple Snappy compression from cache — cache files always use comma delimiter
  • fix(sniff): resolve symlinks before MIME detection and metadata lookup (#3529)
  • fix(moarstats): harden outlier test assertion and fix comment inconsistency
  • fix(describegpt): restore error logging in Redis connection failure
  • docs: fix ~70 false claims found by documentation audits across qsv and MCP server

Full Changelog: 16.1.0...17.0.0

Note

qsv 17.0.0 is not published to crates.io. qsv depends on an unreleased git revision of Polars (rev = 802550b), and cargo publish strips [patch.crates-io] entries, causing dependency resolution to fail against the published Polars v0.53.0 on crates.io (which caps chrono <=0.4.41, incompatible with chrono 0.4.44). This will be resolved once Polars publishes a new crates.io release with updated chrono support. In the meantime, install qsv via the prebuilt binaries, Homebrew, or by building from source.

16.1.0

15 Feb 16:53

Choose a tag to compare

[16.1.0] - 2026-02-15 📊 "The Accelerated Civic Intelligence (ACI) Release" 📊

Statistical analysis gets faster and more robust; User & Agent Experience (UAX) improvements keep the CLI parser, docs, shell completions, and MCP tool definitions in sync from a single source; and the qsv MCP Server gets leaner and smarter.

With a properly configured environment, a User can team up with several AI Agents for accelerated analysis of large, real-world, messy data — raw datasets, presentations, reports, spreadsheets, etc. — without uploading it all to the cloud or manually wrangling it into shape first. Analyzing in a few minutes, what would otherwise take a few days, if not a few weeks to compile.


🌟 Major Features

New pragmastat Command

A pragmatic statistical toolkit by @AndreyAkinshin — Compute robust, median-of-pairwise statistics with the Pragmastat library. Designed for messy, heavy-tailed, or outlier-prone data where mean/stddev can mislead. See pragmastat.dev for details on the underlying algorithms and design philosophy.

Frequency Cache System

New --frequency-jsonl option for the frequency command creates a JSONL cache (analogous to stats --stats-jsonl) that accelerates repeated frequency analysis. Uses a hybrid strategy for high-cardinality columns with configurable thresholds.

Improved UAX: Unified Documentation & Shell Completions

A new docopt-based parsing system now generates markdown documentation, shell completions, and MCP tool definitions from the same USAGE text that powers qsv's CLI parsing. Everything stays in sync automatically — no more drift between help text, docs, completions and AI tooling.

  • --generate-help-md flag produces polished markdown docs with section navigation, emoji legends, clickable URLs, and argument/option tables that are both Human and Agent-friendly.
  • Shell completions are now auto-generated, replacing 68 manually maintained completion files.

qsv MCP Server: Leaner Architecture

The qsv_pipeline tool has been removed in favor of direct sequential command execution. In practice, agents were already calling commands one at a time, and removing the pipeline abstraction made the server simpler, more predictable, and easier to debug. Additional MCP improvements include:

  • Extended AI agent guidance to take advantage of frequency and stats caches
  • Seamless support for Google Gemini CLI thanks to @kulnor's continuing contributions
  • Major codebase refactoring: deduplicated helpers, extracted filesystem tools, fixed any types, and various bug fixes

Detailed MCP changes are documented in the MCP CHANGELOG for full details.


Added

  • feat: pragmastat command — pragmatic statistical toolkit with parallelism, progress bar, and memcheck (by @AndreyAkinshin)
  • feat: frequency --frequency-jsonl — JSONL frequency cache with hybrid strategy for high-cardinality columns
  • feat: --generate-help-md flag — auto-generate markdown docs from USAGE text with section navigation, emoji legends, and clickable URLs
  • docs: add QSV_FREQ_HIGH_CARD_THRESHOLD and QSV_FREQ_HIGH_CARD_THRESHOLD_PCT env vars

Changed

  • perf: stats — skip redundant modes tracking, reduce allocations, optimize cache line layout, deterministic antimode sorting
  • perf: pragmastat — reduce redundant computations, add parallelism
  • perf: frequency — use sort_unstable_by for faster sorting; parallel computation for high-cardinality columns
  • refactor: shell completions auto-generated from USAGE text (removed 68 manual files)
  • refactor: describegpt — disambiguate "Other" bucket from literal "Other" in Data Dictionary Examples column
  • deps: bump anstream from 0.6.21 to 1.0.0
  • deps: bump futures to 0.3.32
  • deps: bump jsonschema from 0.41 to 0.42
  • deps: bump libc from 0.2.180 to 0.2.181
  • deps: bump memmap2 from 0.9.9 to 0.9.10
  • deps: bump polars to latest upstream
  • deps: bump pyo3 from 0.28.0 to 0.28.1
  • deps: bump quickcheck from 1.0.3 to 1.1.0
  • deps: bump rand from 0.9 to 0.10, rand_hc to 0.5, rand_xoshiro to 0.8
  • deps: bump sysinfo from 0.37.2 to 0.38.2
  • deps: bump tempfile from 3.24.0 to 3.25.0
  • deps: bump toml from 0.9.12 to 1.0.1
  • deps: bump uuid from 1.20.0 to 1.21.0
  • deps: bump zmij from 1.0.20 to 1.0.21
  • deps: update csv patched fork MSRV to 1.93

Fixed

  • fix: frequency — normalize delimiter for cache compatibility; deterministic output with secondary sort key; hybrid cache for high-cardinality columns
  • fix: stats — remove unsafe block; deterministic antimode sorting
  • fix(help): section detection, acronym casing, and option word-wrap in markdown generation

Removed

  • removed 68 manual shell completion files (now auto-generated from USAGE text)

Full Changelog: 16.0.0...16.1.0

16.0.0

09 Feb 04:29

Choose a tag to compare

[16.0.0] - 2026-02-08 🤖 "The AI-Native Release" 🤖

This release makes qsv deeply AI-native — from smarter date detection that flows through to Polars schemas, to a MCP Plugin layer that lets AI agents wield qsv as a first-class data tool.

Claude Desktop, Code, and Cowork users can now use qsv's powerful data-wrangling capabilities directly within their AI workflows, with intelligent guidance and seamless integration. Google Gemini is now also supported thanks to @kulnor.


🌟 Major Features

Smarter Date/DateTime Detection

qsv can now automatically detect date and datetime columns and carry that knowledge through the entire pipeline:

  • stats --dates-whitelist sniff is now the default — qsv sniffs the first 1000 rows to identify date/datetime field candidates for further guaranteed date/datetime type inferencing
  • schema auto-detects Date/DateTime columns when generating Polars schemas (.pschema.json)
  • DateTime type support in Polars schema parsing — temporal types are preserved through sqlp, joinp, and Parquet conversion

Hardened Stats Cache

The stats cache system that accelerates frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, and sample is now more robust:

  • Simplified API: Removed dataset_stats from get_stats_records(), streamlining all downstream consumers
  • Safe fallback: Corrupted or unparsable cache files are gracefully handled instead of erroring out
  • Auto-regeneration: Stats cache regenerates on parse error rather than failing

Enhanced MCP Server (16.0.0)

The qsv MCP Server receives its largest update yet — see MCP CHANGELOG for full details.


Breaking Changes

  1. diff command: --force option removed
    • Was used for short-circuiting diffs based on dataset_stats
    • No longer needed after stats cache API simplification
  2. to command: parquet subcommand removed
    • Use dedicated qsv_to_parquet MCP tool or sqlp for Parquet output

Added

  • feat: stats — add 'sniff' support for --dates-whitelist
  • feat: schema — auto-detect Date/DateTime columns for Polars schema via sniff
  • feat: Support DateTime type in Polars schema parsing

Changed

  • refactor: stats — make --dates-whitelist sniff the default
  • perf: Use foldhash HashMap/HashSet across codebase for faster hashing
    • Replaces std::collections with foldhash in 14 modules
    • foldhash is much faster than std::collections for non-crypto hashing
  • refactor: stats Remove dataset_stats from stats cache system
    • Simplified get_stats_records() API
    • Centralized rowcount handling in sample command
    • Adapted diff, pivotp, sample, and other commands to new API
  • refactor: stats Stats cache now regenerates on parse error (improved robustness)
  • refactor: stats Safe fallback on corrupted stats cache
  • refactor: pivotp use sparsity for suggestions and uniqueness_ratio for pivot heuristics
  • refactor: sample lazily compute row_count only for sampling methods that need it
  • deps: bump async-compression to 0.4.39
  • deps: bump bytes from 1.11.0 to 1.11.1
  • deps: bump calamine to 0.33
  • deps: bump csv-nose from 0.7.0 to 0.8.0
  • deps: bump csvlens to latest upstream (PR merged)
  • deps: bump geosuggest to latest upstream
  • deps: bump flate2 from 1.1.8 to 1.1.9
  • deps: bump jsonschema from 0.40.0 to 0.41 (latest upstream with unreleased perf improvements)
  • deps: bump polars from 0.52.0 at py-1.38.1 tag to 0.53
  • deps: bump pyo3 from 0.27.2 to 0.28.0
  • deps: bump redis from 1.0.2 to 1.0.3
  • deps: bump regex from 1.12.2 to 1.12.3
  • deps: bump reqwest from 0.13.1 to 0.13.2
  • deps: bump zerocopy from 0.8.35 to 0.8.36
  • deps: bump zip from 6 to 7
  • deps: bump zmij from 1.0.17 to 1.0.20
  • deps: we now bundle Luau 0.708 from 0.706
  • deps: bump @modelcontextprotocol/sdk (MCP)
  • applied several clippy lint suggestions
  • applied several GH Copilot and Claude review suggestions

Fixed

  • fix: frequency column selection when using --select option in different order
    • Now lookup cardinality by column name instead of index
    • Handles user-selected/reordered column subsets correctly
  • fix: sample handle missing min weight in stats cache
  • fix: validate adapt tests to jsonschema 0.40.2 error message format changes
  • fix: joinp switch pschema serialization to serde_json for compound type support
  • fix: excel adjust jsonl path usage caused by calamine 0.33 release
  • fix: stats return sentinel when sniff finds no date columns
  • fix: configQSV_NO_HEADERS environment variable being ignored; split no_headers into explicit setter and CLI flag method

Removed

  • removed to parquet subcommand in favor of dedicated qsv_to_parquet MCP tool and sqlp Parquet output support
  • removed cargo install instructions from README as qsv is rarely cargo installable as it uses patched forks on a regular basis and cargo install doesn't support git dependencies.

Full Changelog: 15.0.1...16.0.0

15.0.1

28 Jan 12:38

Choose a tag to compare

[15.0.1] - 2026-01-28

Ooops, we celebrated color and the magika-powered revamped sniff but forgot to actually enable them in the release prebuilts! 🤦🏻‍♂️
This patch enables the new color command, turns on magika, along with several fixes and dependency bumps.

Changed

  • deps: bump polars to latest upstream
  • deps: bump csv-nose from 0.6.0 to 0.7.0
  • deps: bump mlua from 0.11.5 to 0.11.6
  • deps: bump minijinja from 2.14.0 to 2.15.1
  • deps: bump minijinja-contrib from 2.14.0 to 2.15.1
  • deps: bump siphasher from 1.0.1 to 1.0.2
  • deps: bump iana-time-zone from 0.1.64 to 0.1.65
  • deps: bump hono from 4.11.4 to 4.11.7 (MCP)
  • build: add color feature to build and test workflows
  • build: add magika feature to publishing workflows
  • docs: updated luau documentation to reflect bundled Luau 0.706
  • docs: sniff is now also 🤖-powered with its use of Magika mime-type detection

Fixed

  • tests: fix flaky color test_get_theme test (now ignored due to environment dependencies)
  • tests: fix flaky search JSON test by using semantic rather than byte-by-byte compare

Full Changelog: 15.0.0...15.0.1