Releases: Blosc/python-blosc2
Release 4.3.1
Changes from 4.3.0 to 4.3.1
This is a maintenance release focused on CTable nested-column ergonomics,
grouped reductions, and API/documentation polish.
CTable nested columns and grouped reductions
- Nested column names in
group_by()results: grouped output columns can now
preserve dotted/nested names such astrip.secinstead of requiring valid
Python identifiers. - Column-object selectors:
CTable.group_by()andCTable.sort_by()now
acceptColumnobjects as well as string names, enabling idioms such as
t.group_by(t.trip.sec)andt.sort_by(t.trip.sec). - Grouped arg reductions:
CTableGroupBynow supportsargmin()and
argmax(), plusagg({"col": "argmin"})/agg({"col": "argmax"}).
Results are logical row positions in the grouped table or view; groups with no
non-null values return-1.
NDArray constructor ergonomics
blosc2.array(): added a NumPy-like constructor for NDArrays. It mirrors
blosc2.asarray()but defaults tocopy=True, so passing an existing
NDArraycreates a copy unlesscopy=Falseorcopy=Noneis requested.
Documentation
- Expanded the CTable reference with
RowTransformer,Column.row_transformer,
andCTableGroupBy.argmin/argmaxdocumentation. - Added
blosc2.ndarray(),blosc2.dictionary(), and related public schema
factory functions to the Schema Specs reference. - Moved
blosc2.group_reduce()into the Reduction Functions reference and
updated its example to use Blosc2 NDArrays.
Release 4.3.0
Changes from 4.2.0 to 4.3.0
CTable: N-dimensional (ndarray) columns
- Multidimensional columns: CTable columns can now hold NDArray-backed cells, allowing
each row of a column to contain a full n-dimensional compressed array. This enables
use cases such as embedding vectors, image patches, time-series windows, or any other
multidimensional per-row payload. - CSV and DataFrame import/export: Multidimensional column data can be imported and
exported via CSV and pandas DataFrames, with automatic detection of array-valued cells. - Nullable ndarray columns: Multidimensional columns fully support the nullable
semantics (null_count, sentinel handling,null_policy) already available for scalar
columns. from_pandas()improvements:CTable.from_pandas()now creates the correct
specialized backing storage forDictionarySpec,ListSpec,VLStringSpec,
VLBytesSpec, and other variable-length scalar specifications.- Improved schema coverage: New CTable timestamp schema type and extended
Column.infooutput withshape,chunks, andblocksdescriptors. - Arg reductions: Added
argmin()andargmax()for scalar and ndarray
CTable columns, plus row-transformer support for generated columns such as
per-row peak-hour or dominant-embedding-dimension features.
CTable: Group-by and filtered aggregation
-
CTable.group_by(): The primary group-by interface. Call
t.group_by("city", sort=True).agg({"qty": "mean"})to produce a new
:class:CTablewith aggregated results. Single-key and multi-key groupings are
supported, along with convenience methods such as.size(),.count(),
.sum(),.mean(),.min()and.max():.. code-block:: python
by_city = t.group_by("city", sort=True) by_city.size() # COUNT(*) by_city.sum("sales") # SUM(sales) per city by_city.agg({"sales": ["sum", "mean"]}) # SUM(sales), AVG(sales) per city -
Performance accelerators: Dedicated Cython fast paths deliver significant speedups:
~25× for float32/64 group-by keys, ~8× for integer and dictionary-code keys, and a
general-purpose hash table for arbitrary float keys. -
Filtered aggregate pushdown: The
where=parameter is now accepted in aggregation
methods, pushing the filter into the compute engine so that only matching rows are
read and reduced. -
Persistent grouped output: Group-by results can be saved directly to persistent
storage via theurlpath=parameter. -
blosc2.group_reduce(): New public function that performs group-by reduction over
NDArray instances and CTable columns, with Cython-accelerated backends for common
key/reduction combinations.
CTable: Dictionary / categorical columns
DictionarySpeccolumn type: Introduced a new dictionary-encoded (categorical)
column type that stores string or integer codes mapped to a shared dictionary, providing
compact storage and accelerated equality and membership queries.- Dictionary types in
whereclauses: Dictionary columns can be queried with the same
where=expression syntax as other column types, including nested dotted-name access. - Improved display:
CTableprinting now adapts to the terminal width, and dictionary
values are shown in their decoded form.Column.infohas been extended with type
details, shape, chunks, and blocks.
CTable: Nested columns and field-name escaping
- Dotted nested column access: Columns whose names contain literal
.
(e.g.,"root.nested") are now fully addressable via the dotted accessor syntax in
whereexpressions,__getitem__, and the public API. - Hierarchical
_colsstorage paths: The internal column storage layout now preserves
a hierarchical structure that mirrors the logical nesting, improving introspection
and interop. - Nested-field pipeline: A new flattened-storage pipeline with logical mapping
preserves nested schema structure (field names, types, and hierarchy) through
Arrow and Parquet import/export. For unnamed top-levellist<struct<...>>Parquet
files, the logical schema round-trips faithfully, though the original physical row
grouping is intentionally not preserved. - Field-name escaping: Special characters (
.and/) in column names are
automatically escaped during schema construction and metadata round-trips.
Parquet import/export improvements
- Arrow serializer by default:
CTable.from_parquet()now defaults to the Arrow
serializer, providing better schema fidelity and nested-type support. - Progress reporting: A
--progressflag and an ETA estimator have been added to
theparquet-to-blosc2CLI for long-running imports. --max-rowsparameter:CTable.from_parquet()and the CLI now acceptmax_rows
to limit the number of imported rows.--timestamp-unit: New CLI option to control timestamp unit conversion on import.--float-trunc-prec: New CLI option to truncate floating-point precision on import.- Separated nested columns enabled by default: The
separate_nested_colsflag is now
Trueby default for both the Python API and the CLI, ensuring nested Arrow structs
are always expanded into flat columns. list_serializerparameter: New option to control how list-type columns are
serialized, with sensible defaults for different list layouts.- Validation optimizations: Arrow datetime values are validated only during import,
reducing runtime overhead on subsequent operations.
TreeStore: Inline CTable support
- CTables inside TreeStore:
CTableobjects can now be stored inline as items
inside aTreeStore, enabling hierarchical storage that mixes arrays and tables in a
single persistent container. - Cache hardening: TreeStore cache assignments now use defensive copies and cache
effective object roots to avoid aliasing and stale-cache errors. - Examples and tutorials: New tutorials and docstring examples demonstrate how to
store, retrieve, and query CTables within a TreeStore.
Performance and usability enhancements
- Faster open and import:
blosc2.open()and store constructors now assume valid
file extensions and defer column metainfo loading, makingCTable.open()and
package import noticeably faster. CTable.nrowsis now lazy: The row count is computed on demand rather than eagerly,
speeding up open and schema-inspection workflows.- Accelerated scalar and small-slice access: The batch/list path for reading scalar
values or small column slices has been overhauled, eliminating internal placeholder
materialization and yielding lower latency. - Late-import optimizations: Heavy optional dependencies are imported lazily at the
blosc2 package level, reducing the baselineimport blosc2overhead. iter_arrow_batches()optimization: Avoids full Python object materialization of
batches during iteration, reducing memory pressure.NDArray-to-list conversion: Small optimization when converting NDArray objects
to Python lists._last_posinvalidation skipped: Mid-table deletes no longer eagerly invalidate
cached positional state, improving delete latency.
Documentation, examples and benchmarks
- API reference expanded:
blosc2.group_reduce()has been added to the Sphinx
reference, along with updated CTable, Column, and TreeStore pages. - New tutorials and examples: Added sections on CTable–TreeStore integration,
nested fields, dictionary columns, aggregates, grouping and querying withwhere=. - New benchmarks: Graph benchmarks for CTable insert time, column count, memory usage,
andwhere=queries, plus dedicated group-by, nested-filter, and Parquet round-trip
benchmarks.
Fixes and compatibility
- Null and NaN handling: NumPy scalar null sentinels are now normalized to plain Python
scalars, and floating-point NaN sentinels are treated consistently with Python
float('nan'). - Empty aggregate results: Filtered aggregations that produce no rows now handle the
empty result gracefully. - Generated column safety: Accessing a stalled (unfillable) generated column now raises
a clear exception instead of producing undefined results. - Miniexpr bundling: Miniexpr’s bundled
libtccand related runtime files are now
kept inside theblosc2package, avoiding conflicts with other TCC installations. - Test improvements: Torch-dependent tests are marked as
heavy, PyArrow-optional
tests are skipped when the library is absent, and parametrization matrices have been
trimmed to reduce CI time. - Missing Cython validation: Added validation guards for several Cython extension
functions that previously lacked explicit error checking. - C-Blosc2 update: Bundled C-Blosc2 has been updated to the latest version (3.0.3).
blosc2.open()default mode changed from 'a' to 'r': Removed the FutureWarning that
was added to prepare for this transition.
Release 4.2.0
Changes from 4.1.2 to 4.2.0
CTable: columnar compressed tables
- Introduced
blosc2.CTable, a new columnar table container for compressed, typed columns. CTables support dataclass- and schema-based construction, row iteration, column access, table views,head()/tail()/sample(), sorting, selection and compactwhereexpressions. - Added persistent CTables backed by
TreeStore, with support forblosc2.open(),CTable.open(),CTable.load(),CTable.save(),CTable.to_b2d()andCTable.to_b2z(). CTable views can be saved too, and.b2z/.b2dpath handling has been tightened. - Added mutation operations for CTables, including
append(),extend(),delete(),compact(),add_column(),drop_column(),rename_column()and related schema validation. - Added computed columns, including virtual computed columns backed by lazy expressions, materialized computed columns and automatic filling of materialized computed columns during inserts.
- Added CTable indexing support, including persistent indexes, direct expression indexes, ordered index reuse, boolean
LazyExpr/NDArraymasks inCTable.__getitem__,iter_sorted()and indexing support for.b2ztables. - Added nullable schema support and null policies for CTable scalar columns, preserving nullable scalar Parquet round-trips.
- Added variable-length CTable column support via
ListArray/ObjectArray, includingvlstringandvlbytesschema specs, fixed-length string/bytes import support and list/struct Arrow/Parquet round-trips. - Added Arrow, Parquet and CSV interoperability for CTables, including batch-wise Arrow/Parquet import/export, Arrow schema metadata preservation,
CTable.from_arrow_batches()improvements and a newparquet-to-blosc2CLI utility. - Added CTable documentation, tutorials, examples and benchmarks covering schema definition, persistence, querying, indexing, mutations, nullable columns, computed columns and variable-length columns.
Indexing and ordering
- Added a new indexing subsystem for NDArrays and CTables, including full, partial/bucket, light/medium and OPSI-style index kinds, out-of-core index builders and sidecar storage.
- Added
blosc2.Indexas the unified public index handle, plus APIs such ascreate_index(),compact_index(),iter_sorted(),will_use_index()and related query explanation support. - Added materialized expression indexes for NDArrays and direct expression indexes for CTables.
- Added persistent query-result caching for indexed lookups, with FIFO pruning and cache accounting.
- Added
blosc2.argsort()and refactored indexing APIs around explicit index enums and sorting helpers. - Improved indexed query performance with Cython accelerators, threaded chunk batching, zero-copy/cached mmap reads, chunk-aware and reduced-order layouts and faster scattered row gathering.
- Reduced memory usage during index creation and lookup by avoiding full sidecar materialization, replacing memmap staging with Blosc2 scratch arrays and adding
tmpdirsupport for full out-of-core indexes.
Persistence, stores and serialization
- Added structured Blosc2 serialization based on b2object carriers, including persisted
C2Array,LazyExprand DSLLazyUDFobjects. - Added
blosc2.Reffor serializing external references, plus examples for b2object bundles and persisted expressions/UDFs. - Added
blosc2.load()as a convenience loader. - Added
vlmetasupport toLazyArrayobjects. - Improved store handling by preserving lazy b2object carriers in
DictStore, allowing reopened proxies to refill caches after read-only opens, relaxingDictStore/TreeStoresuffix requirements and addingDictStore.to_b2d(). - Accelerated
blosc2.open()by trying standard opens first and warning on implicit append mode.
Arrays, computation and containers
- Added
ObjectArrayfor fully general object data and renamed the earlierVLArraywork accordingly; addedListArraydocstrings and Arrow integration improvements. - Added schema helpers including numeric specs,
blosc2.struct()andblosc2.object()for nested/fully general column declarations. - Improved
fromiter()with direct chunked construction and substantially lower peak memory use. - Improved
asarray()behavior for NDArray inputs when copy-inducing keyword arguments are supplied. - Added
SChunk.reorder_offsets(). - Improved
BatchArraydefaults and documentation; the default compression level is now tuned for faster lookup/scan behavior. - Continued matmul/linalg optimization work and shared-thread-pool integration.
CLI, docs and examples
- Added the
parquet-to-blosc2command with options such as--max-rows,--parquet-batch-size,--blosc2-items-per-blockand--use-dict. - Added new CTable, ObjectArray, BatchArray, containers, indexing and serialization tutorials and examples.
- Reorganized and expanded the API reference for CTable, Column, schema specs, Index, save/load helpers and miscellaneous APIs.
- Updated benchmark suites for CTables, indexing, Parquet import/export, BatchArray and NDArray construction/indexing.
Fixes and compatibility
- Updated bundled C-Blosc2 to v3.0.2 and require C-Blosc2 >= 3.0.0 when building against a system library.
- Updated bundled C-Blosc2 and miniexpr sources multiple times.
- Restored compatibility with NumPy < 2.
- Fixed Windows and mmap/file-locking issues in index creation, rebuilds and temporary file cleanup.
- Fixed full-index query failures for large CTable columns and full out-of-core merge failures on systems with small
/tmp. - Fixed stale sidecar/cache reuse and targeted cache invalidation when persistent sidecars are replaced.
- Fixed
.b2zdouble-open corruption caused by GC-triggered repacking and made temporary.b2zunpacking default to the source file directory. - Fixed a regression when reopening persisted proxies in read-only mode.
- Fixed GC-induced thread hangs on macOS with Python 3.14 and hardened async chunk reading/cache cleanup paths.
- Fixed lazy-chunk source-size handling in decode/getitem callers.
- Fixed nullable validation, dictionary extend validation, CTable close propagation, print alignment and NumPy mask support.
- Fixed
arange()regressions and several pre-existingset_sliceerror-handling issues. - Clamped indexing/thread defaults for wasm32.
Blosc2 v4.1.2
Updated c-blosc2 for memory leak and other bug fixes
Blosc2 v4.1.1
Update miniexpr version to fix bug on Ubuntu-arm64.
Blosc2 v4.1.0
- Add DSL kernel functionality for faster, compiled, user-defined functions which broadly respect python syntax and implement the
LazyArrayinterface. See the introductory tutorial at: https://blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf-kernels.html - Add read-only mmap support for store containers:
DictStore,TreeStore, andEmbedStorenow acceptmmap_mode="r"
when opened withmode="r"(including viablosc2.openfor.b2d,
.b2z, and.b2e). - New .meta entry for store containers, allowing better store recognition at
blosc2.open()time. Fixes #546. - Add
cumulative_sumandcumulative_prodfunctions for Array API compliance. - Add Unicode string arrays, support comparison operations with them, and optimised compression path.
- Add
endswithandstartswithand extendcontainsto support strings and offerminiexprmultithreaded computation when possible. - Use DSL kernels to accelerate
arange/linspaceconstructors by 6-10x. - Improve documentation for
filtersandfilters_meta. - Fix edge case issues with
resizeandconstructorsso thatchunksmay be set independently of shape, and arrays may be extended from empty consistently. - Continued work on
miniexprintegration, interface, and support. - Ruff fixes and implementation of PEP recommendations.
Blosc2 v4.0.0
What's Changed
The main change is hyperfast fully multithreaded computation with miniexpr (final PR * Miniexpr for Windows by @FrancescAlted in #565).
In addition, the internal wheel structure has been changed to implement PEP 427 (@lshaw8317 in #560). In addition:
- feat: add support for .b2z, .b2d, .b2e files and update related tests by @bossbeagle1509 in #541
- Add none indexing for lazyudf/lazyarray by @lshaw8317 in #545
- Respect NUMEXPR_MAX_THREADS when setting numexpr thread count by @skmendez in #567
- Add openzl_plugin support by @lshaw8317 in #559
Full Changelog: v3.12.2...v4.0.0
Blosc2 v4.0.0-b1
This is a beta version with hyperfast multithreaded expression calculatio via the incorporation of miniexpr; as well as better support for plugins (stay tuned for blosc2_openzl plugin!),
What's Changed
- Update pre-commit hooks by @pre-commit-ci[bot] in #537
- Fix fancy index item bug by @ykcUconn in #543
- feat: add support for .b2z, .b2d, .b2e files and update related tests by @bossbeagle1509 in #541
- Add none indexing for lazyudf/lazyarray by @lshaw8317 in #545
- Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #547
- Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #548
- Update pre-commit hooks by @pre-commit-ci[bot] in #550
- PEP 639 compliance by @DimitriPapadopoulos in #552
- Multi-threaded reductions by @FrancescAlted in #549
- Implement PEP recommendations by @lshaw8317 in #560
- Add openzl_plugin support by @lshaw8317 in #559
New Contributors
- @ykcUconn made their first contribution in #543
- @bossbeagle1509 made their first contribution in #541
Full Changelog: v3.12.2...v4.0.0-b1
Blosc2 v3.12.2
What's Changed
- Hotfix to change WASM wheel hosting to separate repo
Blosc2 v3.12.1
What's Changed
- Allow saving of numba-decorated lazyudfs by @lshaw8317 in #538
- Automate upload of WASM wheels to GitHub pages