Releases: dathere/qsv
21.0.0
[21.0.0] - 2026-06-08 🌐 The "F-AI-Rification" Release 📇
FAIR Data is AI-Ready Data. It is the perfect context for AI applications - as its compact, token-efficient and vastly improves an Agent's understanding of your Data. A few hundred kilobytes of FAIR metadata is often all it takes to comprehensively describe giga/terabyte level data.
This major release builds even further on qsv's existing FAIRification capabilities, with two new commands and expanded geocoding:
profile- extracts standards-compliant dataset FAIR metadata (DCAT-US v3, DCAT-AP v3, Croissant 1.1 and Geoconnex);get- fetches tabular data from HTTP(S), cloud object stores (AWS S3/Google Cloud/Microsoft Azure) and CKAN portals into a content-addressed local cache - making it even easier to FAIRify remote data and/or to use these remote data to enrich your data corpus;geocodegoes online with OpenCage. Geospatially contextualize and normalize your location data with OpenCage. Unlike other geocoders - OpenCage is built from open data; has the most permissive licensing - allowing displaying data on ANY Map and indefinite caching; and is much cheaper to boot!
qsv 21.0.0 raises the minimum supported Rust version to 1.96 and upgrades to Polars 0.54, which is why this is a major version bump - existing pipelines are otherwise source-compatible.
Highlights
profile- generate standards-compliant dataset metadata.
A new command that profiles a dataset and projects it into open metadata standards — DCAT-US v3, DCAT-AP v3, Croissant 1.1, and Geoconnex — via configurable YAML-driven MiniJinja-powered projection engine, with optional SHACL/mlcroissant/pyshaclvalidation and embedded descriptive statistics & frequency tables so you can further customize the metadata schema mappings (#3898, #3901, #3908, #3912, #3916, #3918).get- fetch tabular data from anywhere into a local cache.
A new command (issue #2263) that retrieves CSV/TSV and other tabular data from HTTP(S) URLs, cloud object stores (s3://,gs://,az://), and CKAN portals (ckan://), then stores it in a content-addressed disk cache. Cached entries are addressable via adc:<name>input prefix usable by any other qsv command, carry BLAKE3 + ETag provenance, support TTL/policy controls, and revalidate conditionally (HTTPIf-None-Match/304 Not Modified). Cloud sources are gated behind the opt-inget_cloudsub-feature; streaming, ranged/parallel downloads and adc:stats cache landed in Phase 3 (#3953, #3958).geocodewith OpenCage support.
Newgeocodesubcommands call the OpenCage geocoding API for forward and reverse geocoding, with a persistent on-disk result cache and%dyncols:support (issue #1295, #3876, #3878).describegptdescribes meaning, not just types.
A richer Semantic Markdown Data Dictionary format for optimized agents & data catalogs; a JSON Schema (draft 2020-12) output format; and LLM-inferred date/datetime content types round out describegpt's semantic-description capabilities (#3933, #3935, #3871, #3884).- Mergeable / variance-bounded sampling in
sample
two new sampling modes plus a sketch-IO surface that lets users sample sharded inputs and combine the results without re-reading the whole corpus. Both modes are native Rust implementations written from the original algorithm papers. The Apache DataSketches project's Sampling family implements the same family of algorithms in C++/Java/Python — qsv does not bind to or depend on that code (thedatasketchesRust crate doesn't expose Sampling-family sketches), so the on-disk format is qsv-specific and not interoperable with DataSketches serialized sketches.--varopt <col>
variance-bounded weighted reservoir sampling using the A-ExpJ keying scheme of Efraimidis & Spirakis (2006). Each record gets a keyu^(1/w)and the top-kkeys are retained. Unlike--weighted(which is single-pass acceptance-rejection requiring amax_weightfrom the stats cache),--varoptis a true reservoir sampler — no stats cache required, single pass, bounded memory, and mergeable across partitions.--mergeable-reservoir
uniform reservoir using Vitter's Algorithm R. Same statistical distribution as the default RESERVOIR method, but the resulting sampler state is mergeable.--sketch-out <file>/--sketch-in <file1,file2,...>
serialize the sampler state to a binary blob and merge across runs. Sketches embed the source CSV header so--sketch-inre-emits a schema-bearing CSV without consulting the source files. Sampler-kind mismatch (mixing a reservoir blob with a varopt blob) is rejected. Works with both new sampling modes.
Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.
Added
sample:--varopt <col>flag for variance-bounded weighted reservoir sampling (A-ExpJ keying, Efraimidis & Spirakis 2006). See Headline above.sample:--mergeable-reservoirflag for a uniform reservoir sampler whose state is mergeable across runs (same distribution as the default RESERVOIR method). See Headline above.sample:--sketch-out <file>/--sketch-in <files>for serializing and merging sampler state across runs. Sketches carry their source CSV header so merged output is schema-bearing.geocode: newcache-clear,cache-prune&cache-infosubcommands to manage the persistent on-disk OpenCage result cache.cache-clearwipes the cache,cache-prune --older-than <val>deletes entries older than an absolute date or a relative age (e.g.30d,2w), andcache-inforeports the cache directory, entry count, on-disk size and oldest/newest entry timestamps.profile: new bundledgeoconnexprojection profile +pyshaclvalidator wired to the Internet of Water's Geoconnex SHACL shapes (vendored underresources/geoconnex/shacl/, embedded in the qsv binary). Phase 1 is dataset-level only —DatasetShape/ProviderShape/PublisherShape/DistributionShapecoverage; the row-per-featureLocationOrientedShape(with mandatorygsp:asWKTgeometry synthesis from lat/lon columns) is deferred to a follow-up. Gated behind a newgeoconnexcargo feature — present inqsv(viadistrib_features) and as an opt-in forqsvdp(-F datapusher_plus,geoconnex); not available inqsvlite/qsvmcp.- 🆕
get: new command for fetching tabular data from HTTP(S) URLs, cloud object stores (s3:///gs:///az://) and CKAN portals (ckan://) into a content-addressed local disk cache. Cached entries are reusable by any other qsv command via thedc:<name>input prefix, carry BLAKE3/ETag provenance plus record-count and TTL metadata, and revalidate conditionally over HTTP (If-None-Match→304 Not Modified). Subcommands includecache-set-ttl,cache-set-policyandcache-list --verify. Cloud sources are gated behind the opt-inget_cloudsub-feature (viaobject_store, no new transitive crates). Available inqsv/qsvmcp/qsvdp(notqsvlite). Issue #2263 (#3953, #3958). - 🆕
profile: new command for profiling a dataset and projecting it into open metadata standards — DCAT-US v3, DCAT-AP v3 and Croissant — through a YAML-driven projection engine, with optional external validation (mlcroissantfor Croissant,pyshaclfor DCAT-AP/Geoconnex SHACL shapes) and embedded descriptive statistics & frequency tables. Accepts local files, URL inputs and stdin. Available inqsv,qsvmcpandqsvdp; not inqsvlite. (The bundledgeoconnexprojection profile is the only part gated further — toqsv/qsvdpvia thegeoconnexfeature.) (#3898, #3901, #3904, #3908, #3910, #3911, #3912, #3918). geocode: new OpenCage online geocoding subcommands for forward and reverse geocoding via the OpenCage API, including%dyncols:support to materialize multiple result fields as new columns. Issue #1295 ([#3876](https://gith...
20.1.0
[20.1.0] - 2026-05-18 🤖 The "Synthetic Data" Release 🎲
A feature-packed minor release headlined by a brand-new synthesize command for generating realistic fake CSV data, a much smarter describegpt that can now describe what your columns mean (not just their data types), and new "approximate stats" modes that let stats and frequency keep working on files that are much bigger than your computer's memory. No breaking changes — pipelines built on 20.0.0 will upgrade in place.
Highlights
-
🆕
synthesize— generate realistic fake CSVs from a real one. Point it at a source file and it produces a new CSV of any size whose columns look and behave like the original — same value mix, same distribution shape, same null rate — but without any of the original records. Useful for sharing test data, populating staging environments, or building demos without leaking real customer data.- Categorical columns (e.g.
country,status) are rebuilt by sampling the real values in the same proportions they appear. - Numeric and date columns preserve the shape of the distribution, not just the min/max — so the synthetic data has realistic clusters, not a flat random spread.
- Null rates are matched per column.
--seedmakes output fully reproducible — same seed, same file, every time.--dictionary/--infer-content-typeplugs in the newdescribegptContent Types (see next bullet) so columns recognized as e.g.email,phone,city, orcredit_cardare filled with realistic-looking fakes instead of generic random strings.--localepicks from 14 regional flavors (US, FR, JP, etc.) so the fakes match your audience.- Cross-column correlations (e.g. keeping
city↔zip_codeconsistent within a row) aren't modeled by default — but turning ondescribegpt's--two-passoption (see next bullet) lets the LLM detect related fields, andsynthesizewill then keep those relationships consistent in the generated rows.
- Categorical columns (e.g.
-
🧠
describegptgot a lot smarter — it can now label what your columns mean. In addition to qsv's existing type detection (Integer, Float, Date, etc.),describegptcan now ask an LLM to classify each column with a semantic label from a 47-token vocabulary covering people, addresses, companies, technical identifiers, and more — so a column of strings isn't just "String", it'semail,street_address,job_title, orcredit_card. These labels are what powerssynthesize's realistic fakes, but they're also useful on their own as auto-generated data dictionaries.--two-passruns the LLM a second time over the ENTIRE Data Dictionary so it can spot relationships between columns (e.g. "this is astate_abbrbecause the next column is azip_code") and fix sloppy first-pass labels. This is also what unlocks cross-column consistency insynthesize(see previous bullet).- Deterministic
unique_idtag — columns where every value is unique (like IDs and UUIDs) are tagged by qsv directly, before the LLM ever sees them. That means the label is 100% reproducible and doesn't drift between LLM versions. - Smarter time/duration handling — duration columns can carry a realistic upper bound (e.g. "0–1 hour") so synthetic latency or TTL values stay believable instead of ranging out to absurd numbers.
--markdown-templatelets you customize the generated Data Dictionary's Markdown output — add your team's review checklist, restructure the per-column layout, whatever fits your docs.- Lower LLM costs — the default prompts were restructured to stop re-sending the dictionary on every step, measurably cutting token usage on multi-phase runs.
-
📊 Approximate stats for huge files —
statsandfrequencyno longer give up when a file is much bigger than your RAM. New opt-in modes use Apache DataSketches algorithms that compute approximate-but-bounded-error answers in a tiny fraction of the memory. Three new modes across two commands:- For
stats:--quantile-method tdigestfor approximate percentiles (t-digest) and--cardinality-method hllfor approximate distinct counts (HyperLogLog). - For
frequency:--sketch-method misra-griesfor approximate top-K most-frequent values (Misra-Gries Frequent Items). - Automatic when you'd otherwise OOM: if qsv detects the file is too big to fit in memory, it now auto-switches to the approximate modes (and tells you which ones), instead of failing. Pass
--quantile-method exact(etc.) to force the precise calculation regardless. - Cache stays correct: the
statscache key now includes the chosen mode, so switching between exact and approximate modes won't accidentally return stale results. - Note: these modes require a "little-endian" CPU, which covers all common hardware (Intel, AMD, Apple Silicon, ARM, etc.). Exotic platforms like IBM s390x get a clear error message instead.
- For
Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.
Added
synthesize: new top-level command (see Headline) #3854synthesize:--consistent-fakesfor stable source→fake mapping #3865synthesize:--localeoption for 14 fake-rs locales #3860describegpt:--two-passcross-field Data Dictionary refinement #3863describegpt: deterministicunique_idContent Type for ALL_UNIQUE fields #3862describegpt,synthesize: infer Content Type for temporal fields with LLM-hinted duration cap #3861describegpt,synthesize: 5 new Content Type tokens —street_name,license_plate,industry,profession,ipv6_addressdescribegpt:--markdown-templatefor customizable Markdown output #3834pivotp:--agg quantile@<p>(aliasq@<p>) with linear interpolation #3842stats/frequency: opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K #3840stats: widened BLAKE3 fingerprint to cover all streaming stats #3824
Changed
stats/frequency: auto-enable Apache DataSketches estimators (t-digest + HyperLogLog forstats; Misra-Gries Frequent Items forfrequency) whenutil::mem_file_checkreports OOM, in addition to the existing auto-index fallback. Awwarn!is emitted listing the auto-enabled estimators; explicit--quantile-method exact/--cardinality-method exact/--sketch-method exactstill suppresses the auto-enable #3843stats: three opt-in micro-optimizations — simdutf8 output, t-digest quantiles, mode-cardinality cap #3839synthesize: use string-length stats for unstructured text columns #3864describegpt: inline{{ dictionary }}in default description/tags prompts; skip redundant chat-message dictionary injection when the template already inlines itsynthesize: handle both describegpt-wrapped and raw dictionary JSONrefactor: adopt Rust 1.95cfg_select!macro at platform-conditional sites #3846perf: promotebytes_to_cow_strhelper toutiland sweep callsitesperf(moarstats): hint rare branches withcore::hint::cold_path()#3823perf(stats): mark non-UTF-8 branch coldperf(frequency): hint UTF-8 failure as cold in the ignore-case hot loop #3821refactor(stats): shrink and tidyWhichStats#3822refactor(publish): fetch tags and enforce SemVer for debian package releasesrefactor(benchmarks): hardenbenchmarks.sherror handling and cross-platform support #3814deps: bump polars (latest upstream), calamine 0.34→0.35, csvlens fork with bumped arrow, sysinfo 0.38.4→0.39.2, rust_decimal 1.41→1.42, tokio 1.52.1→1.52.3, filetime 0.2.27→0.2.29, jsonschema 0.46.4→0.46.5, rand_xoshiro 0.8.0→0.8.1, redis 1.2.0→1.2.1, qsv-dateparser 0.14→0.15 (adds support for ISO 8601T-separated datetimes without a timezone suffix — e.g.2020-01-15T08:00:00, the form produced by Python'sdatetime.isoformat()withoutastimezone(); previously misclassified byqsv stats --infer-datesasString)- assorted clippy cleanups across
stats,frequency,pivotp,partition
Fixed
stats: preserve length & lex stats when column type widens to String #3856stats: remove duplicate big-endianTDigestStub/HllSketchStubdefs #3857stats: restore big-endian build by giving slot fallbacks an accessible.0#3850stats/frequency: gate Apache DataSketches behind little-endian targets #3847apply/applydp: thousands negative fractions; scope<NULL>toregex_replace#3845moarstats: retry on stats coverage mismatch + fsync joined CSV parent dir [#3838](https://github.com/da...
20.0.0
[20.0.0] - 2026-05-03 🧹 The "Spring Cleaning" Release 🌱
Over the past four weeks, we did the first end-to-end pass over the qsv codebase using Claude Code, roborev, Serena, Context7 and GitHub Copilot orchestrated using a multi-agent, adversarial review workflow - systematically auditing every command for correctness, safety, and performance. The result is the largest correctness-and-safety sweep in qsv's history: ALL commands were touched by review-driven cleanups, with dozens of latent bugs, panic paths, and performance cliffs swept out, while adding more than 250 new tests across the board.
This is a major version bump because that sweep also surfaced four user-visible behaviors that were demonstrably wrong and could not be fixed without breaking compatibility:
safenamesverify-mode now correctly counts duplicate-suffix renames as unsafe (previously under-reported).enum --hashis now collision-resistant across multi-column inputs (previously ["ab","c"] and ["a","bc"] hashed identically).excel --metadata csvcolumn ordering now actually matches its header row (previously the type, visible, and headers columns held each other's values).util::safe_header_namesnow enforces its 60-char cap in bytes end-to-end (previously chars-based, allowing UTF-8 names up to 240 bytes — past Postgres' maximum identifier length).
Plus a few smaller but breaking corrections: headers --intersect is renamed to --union (the flag never computed an intersection), luau qsv_loadcsv headers are now 1-indexed per Lua convention, and MSRV is bumped to Rust 1.95.
Beyond the cleanup, this release adds one new top-level command:
- NEW
implodecommand: the inverse ofexplode. Groups rows by key column(s) and joins a value column into a single delimited string per group — useful for collapsing normalized output back into compact form.
And a notable performance win:
frequency: parallel tree-reduce of partial frequency tables delivers a ~1.3x speedup on multi-core machines. Smaller per-command perf wins also landed infill(+22%),datefmt(+9%),cat,dedup,replace,search/searchset, andtranspose.
Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.
Important
This is a major release with breaking changes. Pipelines that consume qsv excel --metadata csv by column position, store qsv enum --hash digests across versions, parse qsv safenames verify-mode output, or invoke qsv headers --intersect will need updates. See the Changed and Removed sections below for migration notes.
Added
implode: new command — inverse ofexplode#3733 (closes #917)generators: mark required options in help markdown and MCP skills #3734sortcheck: add--numericand--naturalflags; allocation-free streaming loop #3756exclude: add stdin support and memcheck #3749
Changed
- BREAKING
excel:--metadata csvcolumn ordering fortype,visible, andheadersis corrected. Previously the CSV header row declaredtype, visible, headersbut the data rows pushed values in the orderheaders, typ, visible, so under each named column the wrong values appeared (thetypecolumn held the headers list,visibleheld the type, andheadersheld the visibility). The CSV output now matches the--metadata json(SheetMetadatastruct) field order:index, sheet_name, type, visible, headers, column_count, …. Pipelines that consumedqsv excel --metadata csvand indexed by column position must shift those three columns; consumers that indexed by header name see corrected values automatically. - BREAKING
enum:--hashdigest values change. The hashed input now carries au64length prefix per field (to fix the multi-column collision bug above), so every--hashdigest differs from earlier qsv versions — single-column hashes change identity values too, and stored hashes from earlier qsv versions will not match. Same input still hashes deterministically across rows and runs in ≥ this version. - BREAKING
luau:qsv_loadcsvnow returns the headers table 1-indexed (per Lua convention). Scripts that accessedheaders[0]or iteratedfor i = 0, #headers - 1must shift toheaders[1]andfor i = 1, #headers(oripairs(headers)). Previouslyheaders[1]returned the second header. - BREAKING
headers: rename--intersectto--union. The flag has always computed a deduplicated union of headers across inputs, not a true set intersection — the name was a long-standing misnomer.--intersectis removed entirely (no alias) given the surrounding breaking-change window. Migration: replaceqsv headers --intersect …withqsv headers --union …; output is unchanged. - BREAKING
safenames: verify-mode (--mode v / V / j / J) outputs change. (1) Verify counts now include header positions that would be renamed by the duplicate-suffix pass — inputs containing duplicate column names will report higher unsafe counts than earlier qsv versions; the count now matches what--mode awould actually rewrite. (2)--mode V / j / Jdisplays unsafe-header strings with leading/trailing whitespace and surrounding"already trimmed (matching what the safe-rename pass actually evaluates), andduplicate_headersis now sorted alphabetically rather than appearing in undefined HashMap iteration order. Pipelines that parsed verbose/JSON output and depended on the old ordering or untrimmed strings must update. - BREAKING
util::safe_header_names: the 60-length cap is now enforced in bytes on the final name, including any duplicate-disambiguation suffix. Previously the truncation was chars-based (take(60).sum()) and only applied to the base, so non-ASCII headers could produce up to ~240-byte names and duplicate-disambiguated headers added_<n>after truncation, pushing past Postgres'NAMEDATALEN(63 bytes). Now the rewrite path lowercases and prepends the leading-_prefix before truncating, then snaps to a UTF-8 char boundary at ≤60 bytes. ASCII-only inputs see the same output as before for non-suffixed cases. Long ASCII headers that previously generated 61–63-char suffixed variants will be 1–2 chars shorter at the boundary. Headers containing multibyte UTF-8 (CJK, accented chars, emoji) that previously produced names >60 bytes will now be aggressively trimmed to fit. Affects every caller (safenames,applydp,apply,fetch,python); stored mappings keyed on the old over-long forms will not match. describegpt: splitprocess_phase_outputinto per-branch helpers (dictionary context-only, full dictionary, JSON, TSV, TOON, Markdown). No behavior change — same output, smaller functions.luau:qsv_coalescenow stringifies non-string values (numbers and booleans render viato_string; nil / arrays / objects are skipped). Previously, numbers and booleans were silently treated as missing values viaas_str().unwrap_or_default(). Scripts relying onqsv_coalesce(some_bool, fallback)to skip booleans will now return"true"/"false"for the boolean.describegpt: per-phase helper split, widened cache key, ~21% LOC reduction #3720 #3721 #3722frequency: parallel tree-reduce of partial FTables (~1.3x speedup) #3728moarstats: collapse duplicated outlier bivariate scan; safety/perf cleanup, unit tests #3718 #3719validate: usecold_hint(stabilized in Rust 1.95) #3717; correctness, perf cleanup #3743 #3779frequency: correctness, perf, refactor cleanup #3745apply: review-driven cleanup, perf #3741template: subdir bug fix, lookup perf, render-error visibility, helper extraction #3740dedup: allocation-free ignore-case #3754datefmt: ~9% perf #3753fill: ~22% faster hot path #3762replace: streaming parallel write; dead match-flag tracking #3777search/searchset: parallel memory streaming;--quickfixes; USAGE alignment #3776cat: rowskey speedup #3750transpose: correctness, perf cleanup, polish #3781cleanup: renamefail_oom_clierror; surfacegeocodeupdate-check error #3806- applied select clippy lints
Fixed
excel: review-driven cleanup ofsrc/cmd/excel.rs— fix four correctness bugs. (1) Negative--sheetindices that overshot the sheet count silently selected a wrong sheet because theabs_diffclamp "bounced" past zero (e.g.--sheet -4on a 3-sheet workbook returned the 2nd sheet); now errors withusage error: negative sheet index N is out of range for K sheets. (2)get_requested_rangel...
19.1.0
[19.1.0] - 2026-04-13
Note
Self-update for the pre-built binaries was broken in qsv 18.0.0 and 19.0.0. This was caused by a bug in the self-update crate that has since been fixed.
WORKAROUND: Download qsv 17.0.0 which predates the self-update bug, and use its --update or --updatenow options to upgrade to the latest release.
Detailed MCP Server and Claude Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.
Added
pivotp: add group-by mode #3698 (closes #3697)pivotp: expand smart aggregation with 7 more statistics #3699
Changed
self_update: show actual error message when available if self_update errors outmoarstats: use fused multiply add for theil_sum (perf)- Switch to crates.io mimalloc, removing git override
- Add HTML anchors to some stats definitions
Fixed
- Fix 10 documentation-codebase drifts found by audit #3689
- Fix 10 documentation-codebase drifts found by audit #3692
- Document index support for
describegptandjoin - Use latest upstream self_update (our PR merged)
- Homebrew qsv distribution enables more features now
Dependencies
- Bump polars to latest upstream
- Ensure all polars_sql features are enabled
- Bump jsonschema from 0.45.1 to 0.46.0 #3695
- Bump pragmastat from 12.0.1 to 12.1.0 #3693
- Bump qsv-stats from 0.48.1 to 0.48.2 #3702
- Bump rand from 0.10.0 to 0.10.1 #3700
- Bump tokio from 1.51.0 to 1.51.1 #3691
- Use nightly-2026-04-01 (same as polars)
- bump indirect dependencies
Full Changelog: 19.0.0...19.1.0
19.0.0
[19.0.0] - 2026-04-07 🔐 The "FAIR Answers" Release 📐
The Reproducibility Crisis in Scientific Research is one of the principal motivators for FAIR Principles in Data Management.
With AI increasingly used in data pipelines, the need for reproducibility and auditability has become even more critical as "hallucinations" and non-deterministic outputs are inherent challenges in Generative AI.
That's why in this release, we instrumented qsv with several features to help users track, audit, and reproduce their AI-assisted data wrangling workflows more effectively. As FAIR Principles do not only apply to data, we also want "FAIR Answers" - with the last R for "Reproducible":
- Enhanced Logging: The
qsv_logtool now supports structured logging with JSON output, making it easier to parse and analyze logs for reproducibility audits (note that this is only available from the qsv MCP Server). - NEW blake3 Command: A new
blake3command computes BLAKE3 hashes of files or data streams, providing a fast and reliable way to verify data integrity and track file versions in workflows. Unlike the oft-used SHA-256 hash, BLAKE3 is up to 16 times faster without sacrificing security, making it ideal for large datasets and iterative processing. - Cowork Project Reproducibility Manifest: Building on the Cowork Project support released in 18.0.0, the qsv Cowork Plugin now creates a Project Reproducibility Manifest - a structured log of all prompts, commands, and outputs generated during a Cowork session. This manifest can be used for detailed audits of the data wrangling process, helping users understand how specific outputs were derived and enabling them to reproduce or modify the workflow with confidence.
- Even Moarstats: The
moarstatscommand gets even "moar" statistical tests and metrics (Trimean, Midhinge, Robust CV, Jarque-Bera, Theil Index, Mean Absolute Deviation and Simpson's Diversity Index), giving users deeper insights into their data distributions and relationships, which can be crucial for reproducibility in data analysis. - To Parquet Improvements: The
to parquetcommand is re-added with a new implementation powered by Polars' LazyFrame API, providing faster and more reliable CSV-to-Parquet conversion with better schema inference and support for complex data types. New options like--infer-lenand--try-parse-datesenhance the accuracy of type inference, further improving the fidelity of Parquet outputs for faster downstream analysis and reproducibility.
Detailed MCP Server and Cowork Plugin changes are documented in the MCP CHANGELOG.
Added
blake3: new BLAKE3 hashing command #3658to parquet: re-add subcommand powered by Polars #3674to parquet: pschema.json support, --infer-len and --try-parse-dates #3680pivotp: totals support #3635moarstats: even moar stats #3654
Changed
to parquet: use LazyFrame for parquet conversion #3679tojsonl: implement proper JSONL writer instead of abusing CSV writer- Document first-N sampling; use to_string_lossy
help: suppress linebreaks for options by using non-breaking hyphens #3662- Switch default allocator from mimalloc to jemalloc - the default allocator of polars #3684
- Add debug_assert! to moarstats map lookups
- Remove some unwraps
Fixed
- docs: fix 27 stale claims found in documentation audit #3637
- docs: correct 5 documentation inaccuracies found during audit
- typo:
|character not escaped, prematurely truncating content
Dependencies
- bump atoi simd and sysinfo #3663
- bump cached from 0.58.0 to 0.59.0 by @dependabot[bot] in #3639
- bump file-format from 0.28.0 to 0.29.0 by @dependabot[bot] in #3649
- bump human-panic from 2.0.6 to 2.0.7 by @dependabot[bot] in #3661
- bump human-panic from 2.0.7 to 2.0.8 by @dependabot[bot] in #3670
- bump indexmap from 2.13.0 to 2.13.1 by @dependabot[bot] in #3671
- bump jaq from 2 to 3; jaq-json from 1 to 2 #3653
- bump jsonschema from 0.45.0 to 0.45.1 by @dependabot[bot] in #3685
- bump lodash from 4.17.23 to 4.18.1 in /.claude/skills by @dependabot[bot] in #3669
- bump minijinja from 2.18.0 to 2.19.0 by @dependabot[bot] in #3666
- bump minijinja-contrib from 2.18.0 to 2.19.0 by @dependabot[bot] in #3665
- bump path-to-regexp from 8.3.0 to 8.4.0 in /.claude/skills by @dependabot[bot] in #3652
- bump polars to latest upstream at the time of release (rev efe654e)
- bump pyo3 from 0.28.2 to 0.28.3 by @dependabot[bot] in #3667
- bump redis from 1.0.5 to 1.1.0 by @dependabot[bot] in #3636
- bump redis from 1.1.0 to 1.2.0 by @dependabot[bot] in #3677
- bump rust_decimal from 1.40.0 to 1.41.0 by @dependabot[bot] in #3648
- bump rustls-webpki from 0.103.9 to 0.103.10 by @dependabot[bot] in #3632
- bump self_update from 0.43.1 to 0.44.0 by @dependabot[bot] in #3683
- bump semver from 1.0.27 to 1.0.28 by @dependabot[bot] in #3678
- bump tokio from 1.50.0 to 1.51.0 by @dependabot[bot] in #3672
- bump toml from 1.0.7 to 1.1.0 by @dependabot[bot] in #3640
- bump toml from 1.1.0 to 1.1.1 by @dependabot[bot] in #3660
- bump toml from 1.1.1 to 1.1.2 by @dependabot[bot] in #3664
Full Changelog: 18.0.0...19.0.0
18.0.0
[18.0.0] - 2026-03-20 The "StatsSighting" Cowork Plugin Release
"StatsSighting" is like "VibeCoding" but for iterative, blazing-fast, deep data analysis. "Stats" for Statistics. "Sight" for Insight - doing a comprehensive statistical profile of datasets first to inform the analysis pipeline.
The Claude Cowork Plugin comes with several agents - the "Data Analyst Agent" for deep data exploration and analysis, the "Data Wrangler Agent" for transformation and cleaning, and the "Policy Analyst Agent" for helping with policy evaluation and decision-making. Each agent has a specific role and skill set, with a shared emphasis on leveraging the qsv MCP Server's profiling and querying capabilities to understand the data before acting on it.
The qsv MCP server received major enhancements - including session logging, DuckDB-powered Parquet conversion, SQL translation hardening, and interactive working directory elicitation.
The core qsv suite also gets significant updates in this release, including the new scoresql command for pre-query SQL analysis, smarter pragmastat with stats-cache integration and comparison mode, pivotp optimizations with moarstats awareness, and formatted table output for to.
Major Features
New scoresql Command
Analyze SQL queries against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions before running the query. Scoring factors include query plan analysis (EXPLAIN), type optimization, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins), and infrastructure checks (index files, cache freshness). Supports Polars and DuckDB modes, SQL file input, and JSON output. Integrates with describegpt for AI-assisted query review. #3612, #3616, #3624
Smarter pragmastat — Stats-Cache Aware with Comparison Mode
pragmastat now reads the stats cache to automatically skip non-numeric/non-date columns, and writes its own results back to the cache for downstream commands. New --compare1 and --compare2 options let you compare two distributions side-by-side. Multiple performance optimizations make it significantly faster. #3591, #3593, #3596, #3595, #3611
pivotp — Smarter Pivoting with moarstats
pivotp now integrates with moarstats to auto-validate pivot column cardinality before execution, preventing overly wide output (>1000 columns) and guiding users toward better pivot strategies. #3606
to — Named Table Support
The to command gains a --table option for CSV, XLSX and ODS output, letting you write data to a named sheet/table in workbook formats. #3572, #3580
Detailed MCP changes are documented in the MCP CHANGELOG.
Added
scoresql: new command — score SQL queries for safety, complexity and performance #3612scoresql: SQL file support, DuckDB PATH fallback &QSV_DUCKDB_PATHrename #3616to: add--tableoption for CSV, XLSX and ODS output #3572, #3580searchset: ignore line comments in regexset files #3622pragmastat: add--compare1and--compare2options #3591pragmastat: use stats cache to only process numeric/date/datetime columns #3593pragmastat: write results to stats cache #3596pragmastat: multiple performance optimizations #3595, #3611pivotp: smarter pivoting with moarstats integration #3606describegpt: scoresql integration #3624
Changed
stats: reduce day-valued precision to 5 decimals #3607frequency: usearray_windowsfor pairwise comparisons- Use
mul_addfor numeric ops across the codebase for more accurate FMA - MSRV bumped to latest stable Rust 1.94
- Switch csvlens dependency to upstream
- Polars bumped to 0.53.0 (py-1.39.x series)
Fixed
stats: fixed big performance regression caused by memory-aware chunking logic error #3598help: fine-tune markdown generation of docopt usage text #3600
Dependencies
- Polars 0.53.0 (py-1.39.3)
- pragmastat 11.1.0 → 12.0.0 #3589
- qsv-stats 0.47.0 → 0.48.0 #3587
- jsonschema 0.44.0 → 0.45.0 #3592
- minijinja/minijinja-contrib 2.16.0 → 2.18.0
- calamine 0.33 → 0.34
- cached 0.58 #3594
- Removed patched forks of self_update and pragmastat (upstream releases available)
- Various other dependency bumps (toml, toon-format, tempfile, redis, libc, sysinfo, once_cell, spreadsheet-ods)
Full Changelog: 17.0.0...18.0.0
Note
qsv 18.0.0 is not published to crates.io. qsv depends on an unreleased git revision of Polars, and cargo publish strips [patch.crates-io] entries, causing dependency resolution to fail against the published Polars v0.53.0 on crates.io (which caps chrono <=0.4.41, incompatible with chrono 0.4.44). This will be resolved once Polars publishes a new crates.io release with updated chrono support. In the meantime, install qsv via the prebuilt binaries, various package managers, or by building from source.
17.0.0
[17.0.0] - 2026-03-03 "The User 🧑🏻 and Agent 🤖 Experience (UAX) Release"
This release is all about getting Human Users and AI Agents working together in harmony to wrangle data faster and more effectively - whether you're a solo analyst or a data team using Claude Desktop/Cowork/Code or Gemini.
The UAX theme introduced in 16.1.0 reaches full stride — the new qsvmcp binary variant gives AI agents a purpose-built, leaner binary; the MCP server levels up with better tool guidance, TSV output for token efficiency, reproducibility logging, DuckDB-powered Parquet conversion, automatic moarstats enrichment, SQL translation hardening, and interactive working directory elicitation. On the core CLI side, stats cache reliability improves across delimiters and output formats, sniff resolves symlinks correctly, and moarstats gets faster hot-path performance.
Major Features
New qsvmcp Binary Variant
A purpose-built binary optimized for use with the qsv MCP server, adding session logging while dispensing with unneeded features (like apply, fetch, fetchpost, foreach, to) for a faster, smaller build. The MCP server now prefers qsvmcp with automatic fallback to the full qsv binary. qsvmcp is now included in release distributions alongside qsv, qsvlite, and qsvdp.
qsv MCP Server: Agent-Native Enhancements
The MCP server (now v17.0.0) receives its biggest update yet, with features designed to make AI agents more effective at data wrangling:
- TSV Output Format — Default output switched to TSV for ~30% token reduction in agent responses, configurable via
QSV_MCP_OUTPUT_FORMAT - Session Logging — New
qsv_logtool and automaticqsvmcp.logaudit trail for reproducibility, with configurable log levels viaQSV_MCP_LOG_LEVEL - DuckDB Parquet Conversion — When DuckDB is available, CSV-to-Parquet conversion uses DuckDB instead of
sqlpfor faster, more reliable conversion - Auto-moarstats —
moarstatsautomatically runs afterstatsexecution for richer statistical context at minimal cost - SQL Translation Hardening — Major
translateSqloverhaul: unique table aliases (_tbl_N), string literal protection, user-provided alias preservation, and pre-scan qualified ref fixing - Working Directory Elicitation — Interactive directory picker via MCP Elicitation protocol for first-time setup
- Reserved Cache Filename Guard — Prevents accidental
--outputoverwrites of.stats.csvand.freq.csvcache files - Cache-Aware SQL Guidance — Server instructions now guide agents to leverage stats and frequency caches when composing
sqlp,joinp, andpivotpqueries - Polars SQL Engine Header — Clear engine indicator differentiates Polars SQL vs DuckDB query results
- Absolute Path Resolution — All file-path arguments now resolved to absolute paths for robustness
- Cowork CLAUDE.md Auto-Deploy — Automatically deploys project
CLAUDE.mdto Claude Cowork working folder on session start (cross-platform Node.js implementation)
Detailed MCP changes are documented in the MCP CHANGELOG.
Added
- feat:
qsvmcpbinary variant — purpose-built for MCP server usage, included in release distributions
Changed
- perf(moarstats): fix outlier key bug and optimize hot-path allocations
- perf(stats): optimize
to_record()output path andweighted_mad() - refactor(describegpt): simplify code for clarity and reduce redundancy
- deps: bump pragmastat from 10.0 to 11.1.0
- deps: bump polars to latest upstream (rev 802550b)
- deps: bump Luau from 0.708 to 0.709
- deps: bump chrono from 0.4.43 to 0.4.44
- deps: bump csv-nose from 0.8.0 to 1.0.1
- deps: bump jsonschema from 0.42 to 0.44.0
- deps: bump strum/strum_macros from 0.27.2 to 0.28.0
- deps: bump tempfile from 3.25.0 to 3.26.0
- deps: bump serial_test from 3.3.1 to 3.4.0
- deps: bump actions/upload-artifact from 6 to 7
- deps: switch csvlens to patched fork using csv-nose 1.0.1
- deps: update ort dependency to include tls-rustls feature (by @kulnor)
- applied select clippy suggestions
Fixed
- fix(stats): always write stats cache as CSV regardless of output format (Snappy, TSV, etc.)
- fix(stats): decouple Snappy compression from cache — cache files always use comma delimiter
- fix(sniff): resolve symlinks before MIME detection and metadata lookup (#3529)
- fix(moarstats): harden outlier test assertion and fix comment inconsistency
- fix(describegpt): restore error logging in Redis connection failure
- docs: fix ~70 false claims found by documentation audits across qsv and MCP server
Full Changelog: 16.1.0...17.0.0
Note
qsv 17.0.0 is not published to crates.io. qsv depends on an unreleased git revision of Polars (rev = 802550b), and cargo publish strips [patch.crates-io] entries, causing dependency resolution to fail against the published Polars v0.53.0 on crates.io (which caps chrono <=0.4.41, incompatible with chrono 0.4.44). This will be resolved once Polars publishes a new crates.io release with updated chrono support. In the meantime, install qsv via the prebuilt binaries, Homebrew, or by building from source.
16.1.0
[16.1.0] - 2026-02-15 📊 "The Accelerated Civic Intelligence (ACI) Release" 📊
Statistical analysis gets faster and more robust; User & Agent Experience (UAX) improvements keep the CLI parser, docs, shell completions, and MCP tool definitions in sync from a single source; and the qsv MCP Server gets leaner and smarter.
With a properly configured environment, a User can team up with several AI Agents for accelerated analysis of large, real-world, messy data — raw datasets, presentations, reports, spreadsheets, etc. — without uploading it all to the cloud or manually wrangling it into shape first. Analyzing in a few minutes, what would otherwise take a few days, if not a few weeks to compile.
🌟 Major Features
New pragmastat Command
A pragmatic statistical toolkit by @AndreyAkinshin — Compute robust, median-of-pairwise statistics with the Pragmastat library. Designed for messy, heavy-tailed, or outlier-prone data where mean/stddev can mislead. See pragmastat.dev for details on the underlying algorithms and design philosophy.
Frequency Cache System
New --frequency-jsonl option for the frequency command creates a JSONL cache (analogous to stats --stats-jsonl) that accelerates repeated frequency analysis. Uses a hybrid strategy for high-cardinality columns with configurable thresholds.
Improved UAX: Unified Documentation & Shell Completions
A new docopt-based parsing system now generates markdown documentation, shell completions, and MCP tool definitions from the same USAGE text that powers qsv's CLI parsing. Everything stays in sync automatically — no more drift between help text, docs, completions and AI tooling.
--generate-help-mdflag produces polished markdown docs with section navigation, emoji legends, clickable URLs, and argument/option tables that are both Human and Agent-friendly.- Shell completions are now auto-generated, replacing 68 manually maintained completion files.
qsv MCP Server: Leaner Architecture
The qsv_pipeline tool has been removed in favor of direct sequential command execution. In practice, agents were already calling commands one at a time, and removing the pipeline abstraction made the server simpler, more predictable, and easier to debug. Additional MCP improvements include:
- Extended AI agent guidance to take advantage of frequency and stats caches
- Seamless support for Google Gemini CLI thanks to @kulnor's continuing contributions
- Major codebase refactoring: deduplicated helpers, extracted filesystem tools, fixed
anytypes, and various bug fixes
Detailed MCP changes are documented in the MCP CHANGELOG for full details.
Added
- feat:
pragmastatcommand — pragmatic statistical toolkit with parallelism, progress bar, and memcheck (by @AndreyAkinshin) - feat:
frequency --frequency-jsonl— JSONL frequency cache with hybrid strategy for high-cardinality columns - feat:
--generate-help-mdflag — auto-generate markdown docs from USAGE text with section navigation, emoji legends, and clickable URLs - docs: add
QSV_FREQ_HIGH_CARD_THRESHOLDandQSV_FREQ_HIGH_CARD_THRESHOLD_PCTenv vars
Changed
- perf:
stats— skip redundant modes tracking, reduce allocations, optimize cache line layout, deterministic antimode sorting - perf:
pragmastat— reduce redundant computations, add parallelism - perf:
frequency— usesort_unstable_byfor faster sorting; parallel computation for high-cardinality columns - refactor: shell completions auto-generated from USAGE text (removed 68 manual files)
- refactor:
describegpt— disambiguate "Other" bucket from literal "Other" in Data Dictionary Examples column - deps: bump anstream from 0.6.21 to 1.0.0
- deps: bump futures to 0.3.32
- deps: bump jsonschema from 0.41 to 0.42
- deps: bump libc from 0.2.180 to 0.2.181
- deps: bump memmap2 from 0.9.9 to 0.9.10
- deps: bump polars to latest upstream
- deps: bump pyo3 from 0.28.0 to 0.28.1
- deps: bump quickcheck from 1.0.3 to 1.1.0
- deps: bump rand from 0.9 to 0.10, rand_hc to 0.5, rand_xoshiro to 0.8
- deps: bump sysinfo from 0.37.2 to 0.38.2
- deps: bump tempfile from 3.24.0 to 3.25.0
- deps: bump toml from 0.9.12 to 1.0.1
- deps: bump uuid from 1.20.0 to 1.21.0
- deps: bump zmij from 1.0.20 to 1.0.21
- deps: update csv patched fork MSRV to 1.93
Fixed
- fix:
frequency— normalize delimiter for cache compatibility; deterministic output with secondary sort key; hybrid cache for high-cardinality columns - fix:
stats— remove unsafe block; deterministic antimode sorting - fix(help): section detection, acronym casing, and option word-wrap in markdown generation
Removed
- removed 68 manual shell completion files (now auto-generated from USAGE text)
Full Changelog: 16.0.0...16.1.0
16.0.0
[16.0.0] - 2026-02-08 🤖 "The AI-Native Release" 🤖
This release makes qsv deeply AI-native — from smarter date detection that flows through to Polars schemas, to a MCP Plugin layer that lets AI agents wield qsv as a first-class data tool.
Claude Desktop, Code, and Cowork users can now use qsv's powerful data-wrangling capabilities directly within their AI workflows, with intelligent guidance and seamless integration. Google Gemini is now also supported thanks to @kulnor.
🌟 Major Features
Smarter Date/DateTime Detection
qsv can now automatically detect date and datetime columns and carry that knowledge through the entire pipeline:
stats --dates-whitelist sniffis now the default — qsv sniffs the first 1000 rows to identify date/datetime field candidates for further guaranteed date/datetime type inferencingschemaauto-detects Date/DateTime columns when generating Polars schemas (.pschema.json)- DateTime type support in Polars schema parsing — temporal types are preserved through
sqlp,joinp, and Parquet conversion
Hardened Stats Cache
The stats cache system that accelerates frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, and sample is now more robust:
- Simplified API: Removed
dataset_statsfromget_stats_records(), streamlining all downstream consumers - Safe fallback: Corrupted or unparsable cache files are gracefully handled instead of erroring out
- Auto-regeneration: Stats cache regenerates on parse error rather than failing
Enhanced MCP Server (16.0.0)
The qsv MCP Server receives its largest update yet — see MCP CHANGELOG for full details.
Breaking Changes
diffcommand:--forceoption removed- Was used for short-circuiting diffs based on dataset_stats
- No longer needed after stats cache API simplification
tocommand:parquetsubcommand removed- Use dedicated
qsv_to_parquetMCP tool orsqlpfor Parquet output
- Use dedicated
Added
- feat:
stats— add 'sniff' support for--dates-whitelist - feat:
schema— auto-detect Date/DateTime columns for Polars schema via sniff - feat: Support DateTime type in Polars schema parsing
Changed
- refactor:
stats— make--dates-whitelist sniffthe default - perf: Use foldhash HashMap/HashSet across codebase for faster hashing
- Replaces std::collections with foldhash in 14 modules
- foldhash is much faster than std::collections for non-crypto hashing
- refactor:
statsRemove dataset_stats from stats cache system- Simplified get_stats_records() API
- Centralized rowcount handling in sample command
- Adapted diff, pivotp, sample, and other commands to new API
- refactor:
statsStats cache now regenerates on parse error (improved robustness) - refactor:
statsSafe fallback on corrupted stats cache - refactor:
pivotpuse sparsity for suggestions and uniqueness_ratio for pivot heuristics - refactor:
samplelazily compute row_count only for sampling methods that need it - deps: bump async-compression to 0.4.39
- deps: bump bytes from 1.11.0 to 1.11.1
- deps: bump calamine to 0.33
- deps: bump csv-nose from 0.7.0 to 0.8.0
- deps: bump csvlens to latest upstream (PR merged)
- deps: bump geosuggest to latest upstream
- deps: bump flate2 from 1.1.8 to 1.1.9
- deps: bump jsonschema from 0.40.0 to 0.41 (latest upstream with unreleased perf improvements)
- deps: bump polars from 0.52.0 at py-1.38.1 tag to 0.53
- deps: bump pyo3 from 0.27.2 to 0.28.0
- deps: bump redis from 1.0.2 to 1.0.3
- deps: bump regex from 1.12.2 to 1.12.3
- deps: bump reqwest from 0.13.1 to 0.13.2
- deps: bump zerocopy from 0.8.35 to 0.8.36
- deps: bump zip from 6 to 7
- deps: bump zmij from 1.0.17 to 1.0.20
- deps: we now bundle Luau 0.708 from 0.706
- deps: bump @modelcontextprotocol/sdk (MCP)
- applied several clippy lint suggestions
- applied several GH Copilot and Claude review suggestions
Fixed
- fix:
frequencycolumn selection when using--selectoption in different order- Now lookup cardinality by column name instead of index
- Handles user-selected/reordered column subsets correctly
- fix:
samplehandle missing min weight in stats cache - fix:
validateadapt tests to jsonschema 0.40.2 error message format changes - fix:
joinpswitch pschema serialization to serde_json for compound type support - fix:
exceladjust jsonl path usage caused by calamine 0.33 release - fix:
statsreturn sentinel when sniff finds no date columns - fix:
config—QSV_NO_HEADERSenvironment variable being ignored; split no_headers into explicit setter and CLI flag method
Removed
- removed
to parquetsubcommand in favor of dedicatedqsv_to_parquetMCP tool andsqlpParquet output support - removed
cargo installinstructions from README as qsv is rarelycargo installable as it uses patched forks on a regular basis andcargo installdoesn't support git dependencies.
Full Changelog: 15.0.1...16.0.0
15.0.1
[15.0.1] - 2026-01-28
Ooops, we celebrated color and the magika-powered revamped sniff but forgot to actually enable them in the release prebuilts! 🤦🏻♂️
This patch enables the new color command, turns on magika, along with several fixes and dependency bumps.
Changed
- deps: bump polars to latest upstream
- deps: bump csv-nose from 0.6.0 to 0.7.0
- deps: bump mlua from 0.11.5 to 0.11.6
- deps: bump minijinja from 2.14.0 to 2.15.1
- deps: bump minijinja-contrib from 2.14.0 to 2.15.1
- deps: bump siphasher from 1.0.1 to 1.0.2
- deps: bump iana-time-zone from 0.1.64 to 0.1.65
- deps: bump hono from 4.11.4 to 4.11.7 (MCP)
- build: add
colorfeature to build and test workflows - build: add
magikafeature to publishing workflows - docs: updated luau documentation to reflect bundled Luau 0.706
- docs:
sniffis now also 🤖-powered with its use of Magika mime-type detection
Fixed
- tests: fix flaky
colortest_get_theme test (now ignored due to environment dependencies) - tests: fix flaky
searchJSON test by using semantic rather than byte-by-byte compare
Full Changelog: 15.0.0...15.0.1