Skip to content

Releases: Blosc/python-blosc2

Release 4.3.1

19 May 17:38

Choose a tag to compare

Changes from 4.3.0 to 4.3.1

This is a maintenance release focused on CTable nested-column ergonomics,
grouped reductions, and API/documentation polish.

CTable nested columns and grouped reductions

  • Nested column names in group_by() results: grouped output columns can now
    preserve dotted/nested names such as trip.sec instead of requiring valid
    Python identifiers.
  • Column-object selectors: CTable.group_by() and CTable.sort_by() now
    accept Column objects as well as string names, enabling idioms such as
    t.group_by(t.trip.sec) and t.sort_by(t.trip.sec).
  • Grouped arg reductions: CTableGroupBy now supports argmin() and
    argmax(), plus agg({"col": "argmin"}) / agg({"col": "argmax"}).
    Results are logical row positions in the grouped table or view; groups with no
    non-null values return -1.

NDArray constructor ergonomics

  • blosc2.array(): added a NumPy-like constructor for NDArrays. It mirrors
    blosc2.asarray() but defaults to copy=True, so passing an existing
    NDArray creates a copy unless copy=False or copy=None is requested.

Documentation

  • Expanded the CTable reference with RowTransformer, Column.row_transformer,
    and CTableGroupBy.argmin / argmax documentation.
  • Added blosc2.ndarray(), blosc2.dictionary(), and related public schema
    factory functions to the Schema Specs reference.
  • Moved blosc2.group_reduce() into the Reduction Functions reference and
    updated its example to use Blosc2 NDArrays.

Release 4.3.0

18 May 16:55

Choose a tag to compare

Changes from 4.2.0 to 4.3.0

CTable: N-dimensional (ndarray) columns

  • Multidimensional columns: CTable columns can now hold NDArray-backed cells, allowing
    each row of a column to contain a full n-dimensional compressed array. This enables
    use cases such as embedding vectors, image patches, time-series windows, or any other
    multidimensional per-row payload.
  • CSV and DataFrame import/export: Multidimensional column data can be imported and
    exported via CSV and pandas DataFrames, with automatic detection of array-valued cells.
  • Nullable ndarray columns: Multidimensional columns fully support the nullable
    semantics (null_count, sentinel handling, null_policy) already available for scalar
    columns.
  • from_pandas() improvements: CTable.from_pandas() now creates the correct
    specialized backing storage for DictionarySpec, ListSpec, VLStringSpec,
    VLBytesSpec, and other variable-length scalar specifications.
  • Improved schema coverage: New CTable timestamp schema type and extended
    Column.info output with shape, chunks, and blocks descriptors.
  • Arg reductions: Added argmin() and argmax() for scalar and ndarray
    CTable columns, plus row-transformer support for generated columns such as
    per-row peak-hour or dominant-embedding-dimension features.

CTable: Group-by and filtered aggregation

  • CTable.group_by(): The primary group-by interface. Call
    t.group_by("city", sort=True).agg({"qty": "mean"}) to produce a new
    :class:CTable with aggregated results. Single-key and multi-key groupings are
    supported, along with convenience methods such as .size(), .count(),
    .sum(), .mean(), .min() and .max():

    .. code-block:: python

    by_city = t.group_by("city", sort=True)
    by_city.size()  # COUNT(*)
    by_city.sum("sales")  # SUM(sales) per city
    by_city.agg({"sales": ["sum", "mean"]})  # SUM(sales), AVG(sales) per city
    
  • Performance accelerators: Dedicated Cython fast paths deliver significant speedups:
    ~25× for float32/64 group-by keys, ~8× for integer and dictionary-code keys, and a
    general-purpose hash table for arbitrary float keys.

  • Filtered aggregate pushdown: The where= parameter is now accepted in aggregation
    methods, pushing the filter into the compute engine so that only matching rows are
    read and reduced.

  • Persistent grouped output: Group-by results can be saved directly to persistent
    storage via the urlpath= parameter.

  • blosc2.group_reduce(): New public function that performs group-by reduction over
    NDArray instances and CTable columns, with Cython-accelerated backends for common
    key/reduction combinations.

CTable: Dictionary / categorical columns

  • DictionarySpec column type: Introduced a new dictionary-encoded (categorical)
    column type that stores string or integer codes mapped to a shared dictionary, providing
    compact storage and accelerated equality and membership queries.
  • Dictionary types in where clauses: Dictionary columns can be queried with the same
    where= expression syntax as other column types, including nested dotted-name access.
  • Improved display: CTable printing now adapts to the terminal width, and dictionary
    values are shown in their decoded form. Column.info has been extended with type
    details, shape, chunks, and blocks.

CTable: Nested columns and field-name escaping

  • Dotted nested column access: Columns whose names contain literal .
    (e.g., "root.nested") are now fully addressable via the dotted accessor syntax in
    where expressions, __getitem__, and the public API.
  • Hierarchical _cols storage paths: The internal column storage layout now preserves
    a hierarchical structure that mirrors the logical nesting, improving introspection
    and interop.
  • Nested-field pipeline: A new flattened-storage pipeline with logical mapping
    preserves nested schema structure (field names, types, and hierarchy) through
    Arrow and Parquet import/export. For unnamed top-level list<struct<...>> Parquet
    files, the logical schema round-trips faithfully, though the original physical row
    grouping is intentionally not preserved.
  • Field-name escaping: Special characters (. and /) in column names are
    automatically escaped during schema construction and metadata round-trips.

Parquet import/export improvements

  • Arrow serializer by default: CTable.from_parquet() now defaults to the Arrow
    serializer, providing better schema fidelity and nested-type support.
  • Progress reporting: A --progress flag and an ETA estimator have been added to
    the parquet-to-blosc2 CLI for long-running imports.
  • --max-rows parameter: CTable.from_parquet() and the CLI now accept max_rows
    to limit the number of imported rows.
  • --timestamp-unit: New CLI option to control timestamp unit conversion on import.
  • --float-trunc-prec: New CLI option to truncate floating-point precision on import.
  • Separated nested columns enabled by default: The separate_nested_cols flag is now
    True by default for both the Python API and the CLI, ensuring nested Arrow structs
    are always expanded into flat columns.
  • list_serializer parameter: New option to control how list-type columns are
    serialized, with sensible defaults for different list layouts.
  • Validation optimizations: Arrow datetime values are validated only during import,
    reducing runtime overhead on subsequent operations.

TreeStore: Inline CTable support

  • CTables inside TreeStore: CTable objects can now be stored inline as items
    inside a TreeStore, enabling hierarchical storage that mixes arrays and tables in a
    single persistent container.
  • Cache hardening: TreeStore cache assignments now use defensive copies and cache
    effective object roots to avoid aliasing and stale-cache errors.
  • Examples and tutorials: New tutorials and docstring examples demonstrate how to
    store, retrieve, and query CTables within a TreeStore.

Performance and usability enhancements

  • Faster open and import: blosc2.open() and store constructors now assume valid
    file extensions and defer column metainfo loading, making CTable.open() and
    package import noticeably faster.
  • CTable.nrows is now lazy: The row count is computed on demand rather than eagerly,
    speeding up open and schema-inspection workflows.
  • Accelerated scalar and small-slice access: The batch/list path for reading scalar
    values or small column slices has been overhauled, eliminating internal placeholder
    materialization and yielding lower latency.
  • Late-import optimizations: Heavy optional dependencies are imported lazily at the
    blosc2 package level, reducing the baseline import blosc2 overhead.
  • iter_arrow_batches() optimization: Avoids full Python object materialization of
    batches during iteration, reducing memory pressure.
  • NDArray-to-list conversion: Small optimization when converting NDArray objects
    to Python lists.
  • _last_pos invalidation skipped: Mid-table deletes no longer eagerly invalidate
    cached positional state, improving delete latency.

Documentation, examples and benchmarks

  • API reference expanded: blosc2.group_reduce() has been added to the Sphinx
    reference, along with updated CTable, Column, and TreeStore pages.
  • New tutorials and examples: Added sections on CTable–TreeStore integration,
    nested fields, dictionary columns, aggregates, grouping and querying with where=.
  • New benchmarks: Graph benchmarks for CTable insert time, column count, memory usage,
    and where= queries, plus dedicated group-by, nested-filter, and Parquet round-trip
    benchmarks.

Fixes and compatibility

  • Null and NaN handling: NumPy scalar null sentinels are now normalized to plain Python
    scalars, and floating-point NaN sentinels are treated consistently with Python
    float('nan').
  • Empty aggregate results: Filtered aggregations that produce no rows now handle the
    empty result gracefully.
  • Generated column safety: Accessing a stalled (unfillable) generated column now raises
    a clear exception instead of producing undefined results.
  • Miniexpr bundling: Miniexpr’s bundled libtcc and related runtime files are now
    kept inside the blosc2 package, avoiding conflicts with other TCC installations.
  • Test improvements: Torch-dependent tests are marked as heavy, PyArrow-optional
    tests are skipped when the library is absent, and parametrization matrices have been
    trimmed to reduce CI time.
  • Missing Cython validation: Added validation guards for several Cython extension
    functions that previously lacked explicit error checking.
  • C-Blosc2 update: Bundled C-Blosc2 has been updated to the latest version (3.0.3).
  • blosc2.open() default mode changed from 'a' to 'r': Removed the FutureWarning that
    was added to prepare for this transition.

Release 4.2.0

07 May 11:38

Choose a tag to compare

Changes from 4.1.2 to 4.2.0

CTable: columnar compressed tables

  • Introduced blosc2.CTable, a new columnar table container for compressed, typed columns. CTables support dataclass- and schema-based construction, row iteration, column access, table views, head() / tail() / sample(), sorting, selection and compact where expressions.
  • Added persistent CTables backed by TreeStore, with support for blosc2.open(), CTable.open(), CTable.load(), CTable.save(), CTable.to_b2d() and CTable.to_b2z(). CTable views can be saved too, and .b2z/.b2d path handling has been tightened.
  • Added mutation operations for CTables, including append(), extend(), delete(), compact(), add_column(), drop_column(), rename_column() and related schema validation.
  • Added computed columns, including virtual computed columns backed by lazy expressions, materialized computed columns and automatic filling of materialized computed columns during inserts.
  • Added CTable indexing support, including persistent indexes, direct expression indexes, ordered index reuse, boolean LazyExpr/NDArray masks in CTable.__getitem__, iter_sorted() and indexing support for .b2z tables.
  • Added nullable schema support and null policies for CTable scalar columns, preserving nullable scalar Parquet round-trips.
  • Added variable-length CTable column support via ListArray / ObjectArray, including vlstring and vlbytes schema specs, fixed-length string/bytes import support and list/struct Arrow/Parquet round-trips.
  • Added Arrow, Parquet and CSV interoperability for CTables, including batch-wise Arrow/Parquet import/export, Arrow schema metadata preservation, CTable.from_arrow_batches() improvements and a new parquet-to-blosc2 CLI utility.
  • Added CTable documentation, tutorials, examples and benchmarks covering schema definition, persistence, querying, indexing, mutations, nullable columns, computed columns and variable-length columns.

Indexing and ordering

  • Added a new indexing subsystem for NDArrays and CTables, including full, partial/bucket, light/medium and OPSI-style index kinds, out-of-core index builders and sidecar storage.
  • Added blosc2.Index as the unified public index handle, plus APIs such as create_index(), compact_index(), iter_sorted(), will_use_index() and related query explanation support.
  • Added materialized expression indexes for NDArrays and direct expression indexes for CTables.
  • Added persistent query-result caching for indexed lookups, with FIFO pruning and cache accounting.
  • Added blosc2.argsort() and refactored indexing APIs around explicit index enums and sorting helpers.
  • Improved indexed query performance with Cython accelerators, threaded chunk batching, zero-copy/cached mmap reads, chunk-aware and reduced-order layouts and faster scattered row gathering.
  • Reduced memory usage during index creation and lookup by avoiding full sidecar materialization, replacing memmap staging with Blosc2 scratch arrays and adding tmpdir support for full out-of-core indexes.

Persistence, stores and serialization

  • Added structured Blosc2 serialization based on b2object carriers, including persisted C2Array, LazyExpr and DSL LazyUDF objects.
  • Added blosc2.Ref for serializing external references, plus examples for b2object bundles and persisted expressions/UDFs.
  • Added blosc2.load() as a convenience loader.
  • Added vlmeta support to LazyArray objects.
  • Improved store handling by preserving lazy b2object carriers in DictStore, allowing reopened proxies to refill caches after read-only opens, relaxing DictStore/TreeStore suffix requirements and adding DictStore.to_b2d().
  • Accelerated blosc2.open() by trying standard opens first and warning on implicit append mode.

Arrays, computation and containers

  • Added ObjectArray for fully general object data and renamed the earlier VLArray work accordingly; added ListArray docstrings and Arrow integration improvements.
  • Added schema helpers including numeric specs, blosc2.struct() and blosc2.object() for nested/fully general column declarations.
  • Improved fromiter() with direct chunked construction and substantially lower peak memory use.
  • Improved asarray() behavior for NDArray inputs when copy-inducing keyword arguments are supplied.
  • Added SChunk.reorder_offsets().
  • Improved BatchArray defaults and documentation; the default compression level is now tuned for faster lookup/scan behavior.
  • Continued matmul/linalg optimization work and shared-thread-pool integration.

CLI, docs and examples

  • Added the parquet-to-blosc2 command with options such as --max-rows, --parquet-batch-size, --blosc2-items-per-block and --use-dict.
  • Added new CTable, ObjectArray, BatchArray, containers, indexing and serialization tutorials and examples.
  • Reorganized and expanded the API reference for CTable, Column, schema specs, Index, save/load helpers and miscellaneous APIs.
  • Updated benchmark suites for CTables, indexing, Parquet import/export, BatchArray and NDArray construction/indexing.

Fixes and compatibility

  • Updated bundled C-Blosc2 to v3.0.2 and require C-Blosc2 >= 3.0.0 when building against a system library.
  • Updated bundled C-Blosc2 and miniexpr sources multiple times.
  • Restored compatibility with NumPy < 2.
  • Fixed Windows and mmap/file-locking issues in index creation, rebuilds and temporary file cleanup.
  • Fixed full-index query failures for large CTable columns and full out-of-core merge failures on systems with small /tmp.
  • Fixed stale sidecar/cache reuse and targeted cache invalidation when persistent sidecars are replaced.
  • Fixed .b2z double-open corruption caused by GC-triggered repacking and made temporary .b2z unpacking default to the source file directory.
  • Fixed a regression when reopening persisted proxies in read-only mode.
  • Fixed GC-induced thread hangs on macOS with Python 3.14 and hardened async chunk reading/cache cleanup paths.
  • Fixed lazy-chunk source-size handling in decode/getitem callers.
  • Fixed nullable validation, dictionary extend validation, CTable close propagation, print alignment and NumPy mask support.
  • Fixed arange() regressions and several pre-existing set_slice error-handling issues.
  • Clamped indexing/thread defaults for wasm32.

Blosc2 v4.1.2

03 Mar 11:09

Choose a tag to compare

Updated c-blosc2 for memory leak and other bug fixes

Blosc2 v4.1.1

02 Mar 15:03

Choose a tag to compare

Update miniexpr version to fix bug on Ubuntu-arm64.

Blosc2 v4.1.0

28 Feb 07:13

Choose a tag to compare

  • Add DSL kernel functionality for faster, compiled, user-defined functions which broadly respect python syntax and implement the LazyArray interface. See the introductory tutorial at: https://blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf-kernels.html
  • Add read-only mmap support for store containers:
    DictStore, TreeStore, and EmbedStore now accept mmap_mode="r"
    when opened with mode="r" (including via blosc2.open for .b2d,
    .b2z, and .b2e).
  • New .meta entry for store containers, allowing better store recognition at blosc2.open() time. Fixes #546.
  • Add cumulative_sum and cumulative_prod functions for Array API compliance.
  • Add Unicode string arrays, support comparison operations with them, and optimised compression path.
  • Add endswith and startswith and extend contains to support strings and offer miniexpr multithreaded computation when possible.
  • Use DSL kernels to accelerate arange/linspace constructors by 6-10x.
  • Improve documentation for filters and filters_meta.
  • Fix edge case issues with resize and constructors so that chunks may be set independently of shape, and arrays may be extended from empty consistently.
  • Continued work on miniexpr integration, interface, and support.
  • Ruff fixes and implementation of PEP recommendations.

Blosc2 v4.0.0

29 Jan 14:18

Choose a tag to compare

What's Changed

The main change is hyperfast fully multithreaded computation with miniexpr (final PR * Miniexpr for Windows by @FrancescAlted in #565).
In addition, the internal wheel structure has been changed to implement PEP 427 (@lshaw8317 in #560). In addition:

Full Changelog: v3.12.2...v4.0.0

Blosc2 v4.0.0-b1

22 Jan 15:43

Choose a tag to compare

Blosc2 v4.0.0-b1 Pre-release
Pre-release

This is a beta version with hyperfast multithreaded expression calculatio via the incorporation of miniexpr; as well as better support for plugins (stay tuned for blosc2_openzl plugin!),

What's Changed

New Contributors

Full Changelog: v3.12.2...v4.0.0-b1

Blosc2 v3.12.2

04 Dec 11:46

Choose a tag to compare

What's Changed

  • Hotfix to change WASM wheel hosting to separate repo

Blosc2 v3.12.1

03 Dec 17:10

Choose a tag to compare

What's Changed

  • Allow saving of numba-decorated lazyudfs by @lshaw8317 in #538
  • Automate upload of WASM wheels to GitHub pages