CTable

A columnar compressed table backed by one physical container per column. Scalar columns use NDArray; list-valued columns use ListArray. Each column is stored, compressed, and queried independently; rows are never materialised in their entirety unless you explicitly call to_arrow() or iterate with __iter__().

class blosc2.CTable(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None)[source]

Columnar compressed table with typed columns and row-oriented access.

Attributes:
cbytes

Total compressed size in bytes (all columns + valid_rows mask).

computed_columns

Read-only view of the computed-column definitions.

cratio

Compression ratio for the whole table payload.

indexes

Return a list of blosc2.Index handles for all active indexes.

info

Get information about this table.

info_items

Structured summary items used by info().

nbytes

Total uncompressed size in bytes (all columns + valid_rows mask).

ncols

Total number of columns, including computed (virtual) columns.

nrows
schema

The compiled schema that drives this table’s columns and validation.

Methods

add_column(name, spec)

Add a new column filled from the default declared in spec.

add_computed_column(name, expr, *[, dtype])

Add a read-only virtual column computed from stored columns.

add_generated_column(name, *, values[, ...])

Add a stored generated column maintained by the table.

append(data)

Append a single row to the table.

close()

Close any persistent backing store held by this table.

column_schema(name)

Return the CompiledColumn descriptor for name.

compact()

Physically rewrite every column array keeping only live rows.

compact_index([col_name, expression, name])

Compact an index, merging any incremental append runs.

copy([compact, urlpath, overwrite])

Return a new standalone copy of this table.

cov()

Return the covariance matrix as a numpy array.

create_index([col_name, field, expression, ...])

Build and register an index for a stored column or table expression.

delete(ind)

Mark one or more rows as deleted (tombstone deletion).

describe()

Print a per-column statistical summary.

drop_column(name)

Remove a column from the table.

drop_computed_column(name)

Remove a computed column from the table.

drop_index([col_name, expression, name])

Remove an index and delete any sidecar files.

extend(data, *[, validate])

Append multiple rows at once.

from_arrow(schema, batches, *[, urlpath, ...])

Build a CTable from an Arrow schema and iterable of record batches.

from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

from_pandas(df, row_cls)

Build a CTable from a pandas DataFrame.

from_parquet(path, *[, columns, batch_size, ...])

Read a Parquet file into a CTable.

group_by(keys, *[, sort, dropna, engine, ...])

Return a deferred group-by object for this table.

head([N])

Return a view of the first N live rows (default 5).

index([col_name, expression, name])

Return the index handle for a stored-column or expression target.

iter_arrow_batches(*[, columns, batch_size, ...])

Yield live rows as bounded-size pyarrow.RecordBatch objects.

iter_sorted(cols[, ascending, start, stop, ...])

Iterate rows in sorted order without materializing a full copy.

load(urlpath)

Load a persistent table from urlpath into RAM.

materialize_computed_column(name, *[, ...])

Materialize a computed column into a new stored snapshot column.

open(urlpath, *[, mode])

Open a persistent CTable from urlpath.

rebuild_index([col_name, expression, name])

Drop and recreate an index with the same parameters.

refresh_generated_column(name)

Recompute a stored generated/materialized column from its source columns.

refresh_generated_columns(*[, source])

Refresh all generated columns, optionally only those depending on source.

rename_column(old, new)

Rename a column.

sample(n, *[, seed])

Return a read-only view of n randomly chosen live rows.

save(urlpath, *[, overwrite])

Persist this table to disk at urlpath.

schema_dict()

Return a JSON-compatible dict describing this table's schema.

select(cols)

Return a column-projection view exposing only cols.

sort_by(cols[, ascending, inplace])

Return a copy of the table sorted by one or more columns.

tail([N])

Return a view of the last N live rows (default 5).

to_arrow()

Convert all live rows to a pyarrow.Table.

to_b2d(urlpath, *[, overwrite, compact])

Write this table to a directory-backed store.

to_b2z(urlpath, *[, overwrite, compact])

Write this table to a compact .b2z container.

to_csv(path, *[, header, sep])

Write all live rows to a CSV file.

to_pandas()

Convert to a pandas DataFrame.

to_parquet(path, *[, columns, batch_size, ...])

Write this table to a Parquet file batch-wise using pyarrow.

view(new_valid_rows)

Return a row-filter view backed by a boolean mask array without copying data.

where(expr_result, *[, columns])

Return a row-filtered view matching a boolean predicate.

Special methods

CTable.__len__()

Return the number of live (non-deleted) rows.

CTable.__iter__()

Iterate over live rows in insertion order, yielding namedtuple-like row objects.

CTable.__getitem__(key)

Type-driven indexing for columns, rows, projections, and filters.

CTable.__getattr__(s)

Convenience fallback for attribute-style column access.

CTable.__repr__()

Short CTable<cols>(N rows, X compressed) summary string.

CTable.__str__()

Pandas-style tabular display with column names, dtypes, and a row count footer.

__len__()[source]

Return the number of live (non-deleted) rows.

Return the number of live (non-deleted) rows.

__iter__()[source]

Iterate over live rows in insertion order, yielding namedtuple-like row objects.

Iterate over live rows in insertion order, yielding namedtuple-like row objects with one attribute per column.

__getitem__ supports type-driven indexing:

  • str — column name returns a Column; any other string is interpreted as a boolean expression and behaves like where().

  • boolean LazyExpr / NDArray — filtered row view, same as where(), e.g. t[t.temperature_f > 70].

  • int — single row as a namedtuple-like object.

  • slice — row-range view.

  • list[int] / ndarray[int] — gathered-row view.

  • ndarray[bool] — boolean-mask filtered view.

  • list[str] — column-projection view (same as select()).

__getattr__ provides convenience attribute-style column access only after normal Python attribute lookup fails; use t["name"] for columns that conflict with table attributes or methods.

__repr__() str[source]

Short CTable<cols>(N rows, X compressed) summary string.

__str__() str[source]

Pandas-style tabular display with column names, dtypes, and a row count footer.

classmethod from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

Nested struct flattening: top-level Arrow struct<…> fields are automatically and recursively flattened into dotted leaf columns. For example, a field trip: struct<begin: struct<lon: float64, lat: float64>> becomes two CTable columns trip.begin.lon and trip.begin.lat. Each leaf is stored as an independent compressed NDArray. Row reads via t[i] reconstruct the original nested dict shape. Use t["trip.begin.lon"] or t.trip.begin.lon to access a leaf:

import pyarrow as pa, blosc2
trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))])
schema = pa.schema([pa.field("trip", trip_type)])
t = blosc2.CTable.from_arrow(schema, batches)
t.col_names          # ['trip.begin.lon']
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Non-struct struct columns (not containing only scalar leaves) are imported as struct() columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

list_serializer selects the backend serializer for imported list columns. "msgpack" is the default; "arrow" stores Arrow list batches directly and can be much faster for deeply nested list columns.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

classmethod from_pandas(df, row_cls) CTable[source]

Build a CTable from a pandas DataFrame.

Schema comes from row_cls (a dataclass) — CTable is always typed. Object-dtype DataFrame columns are not automatically inferred as ndarray columns; the row_cls must explicitly declare blosc2.ndarray() fields.

Parameters:
  • df – Source pandas DataFrame.

  • row_cls – A dataclass whose fields define the column names and types.

Returns:

A new CTable containing all DataFrame rows.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If DataFrame columns do not match the row_cls schema.

classmethod from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.

Nested struct flattening: top-level Parquet struct<…> fields are automatically and recursively flattened into dotted leaf columns — the same as in from_arrow(). For example, a Parquet file that contains a column trip: struct<begin: struct<lon: double, lat: double>> produces two CTable columns trip.begin.lon and trip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:

t = blosc2.CTable.from_parquet("trips.parquet")
t.col_names               # e.g. ['trip.begin.lon', 'trip.begin.lat', ...]
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.

  • list_serializer ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default, "arrow", stores Arrow list batches directly and is much faster for deeply nested or list<struct<...>> columns. The tradeoff is that accessing those list columns later requires PyArrow. Use "msgpack" to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.

  • separate_nested_cols (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to True. In particular, a single unnamed top-level list<struct<...>> field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Use separate_nested_cols=False when closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.

  • max_rows (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root list<struct<...>> files imported with separate_nested_cols=True, this limits flattened element rows.

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If max_rows is negative.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
classmethod load(urlpath: str) CTable[source]

Load a persistent table from urlpath into RAM.

The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.

Parameters:

urlpath – Path to the table root directory.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod open(urlpath: str, *, mode: str = 'r') CTable[source]

Open a persistent CTable from urlpath.

Parameters:
  • urlpath – Path to the table root directory (created by passing urlpath to CTable).

  • mode'r' (default) — read-only. 'a' — read/write.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

__getattr__(s: str)[source]

Convenience fallback for attribute-style column access.

This is called only after normal Python attribute lookup fails. Thus t.name can return a column only for non-conflicting identifier-like column names. For columns whose names conflict with existing CTable attributes/methods, or are not valid identifiers, use the canonical item access form t["name"].

__getitem__(key)[source]

Type-driven indexing for columns, rows, projections, and filters.

Supported keys are:

  • str: return a Column when it matches a stored or computed column name; otherwise evaluate it as a boolean expression via where(). Dotted names (e.g. "trip.begin.lon") select nested leaf columns directly; a struct-prefix name (e.g. "trip.begin") that matches multiple descendant leaves returns a _StructPathColumn view. This item-access form is the canonical way to access columns and works for every column name, including names that are not valid Python identifiers or that collide with existing CTable attributes or methods.

  • boolean blosc2.LazyExpr or blosc2.NDArray: return the same filtered view as where(), e.g. t[t.temperature_f > 70].

  • int: return one live row as a namedtuple-like object.

  • slice: return a row-range view.

  • integer array/list: return a gathered-row view.

  • boolean NumPy array/list: return a boolean-mask filtered view.

  • string list: return a column-projection view, equivalent to select().

Examples

Access columns and rows:

temps = t["temperature"]
first = t[0]
view = t[10:20]

Filter rows with a string expression, a stored-column expression, or a computed-column expression:

warm = t["temperature > 20"]
warm_active = t[(t.temperature > 20) & t.active]
hot_fahrenheit = t[t.temperature_f > 70]

Project columns:

slim = t[["sensor_id", "temperature_f"]]

Access a nested leaf column with a dotted name or an attribute chain:

lons = t["trip.begin.lon"]   # Column for the nested leaf
lons = t.trip.begin.lon      # equivalent attribute-chain form

Attribute access is only a convenience fallback. If a column name is not a valid identifier, or if it conflicts with an existing table attribute or method such as nrows, where or sort_by, use item access instead:

col = t["where"]             # column named "where"
method = t.where             # CTable.where method
add_column(name: str, spec: SchemaSpec | Field) None[source]

Add a new column filled from the default declared in spec.

Parameters:
  • name – Column name. Must follow the same naming rules as schema fields.

  • spec – A schema descriptor such as b2.int64(ge=0) or a field descriptor such as b2.field(b2.int64(ge=0), default=0). When the table already has live rows, use blosc2.field(...) with a default declared so those rows can be backfilled.

Raises:
  • ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.

  • TypeError – If a declared default cannot be coerced to spec’s dtype.

add_computed_column(name: str, expr: str | LazyExpr | Callable[[dict[str, Any]], LazyExpr], *, dtype: dtype | None = None) None[source]

Add a read-only virtual column computed from stored columns.

A computed column has no physical storage. It is backed by a blosc2.LazyExpr and is evaluated when values are read, filtered, displayed, exported, or aggregated. Because it is virtual, it is read-only, cannot be indexed directly, and is not supplied in append() / extend() inputs. To store and optionally index a computed result, use add_generated_column() or materialize an existing computed column with materialize_computed_column().

Supported signatures are:

add_computed_column(name, "price * qty", dtype=None)
add_computed_column(name, lazy_expr, dtype=None)
add_computed_column(name, lambda cols: cols["price"] * cols["qty"], dtype=None)
Parameters:
  • name – Name of the virtual computed column. It must be a valid column name and must not collide with an existing stored or computed column.

  • expr

    Definition of the virtual column. Accepted forms:

    • str: scalar expression over stored scalar columns, e.g. "price * qty".

    • blosc2.LazyExpr: lazy expression over stored columns of this table.

    • callable: called as expr(self._cols) and must return a blosc2.LazyExpr over stored columns of this table.

    Expressions must depend only on stored columns of this table; computed columns cannot depend on other computed columns in this version. Fixed-shape ndarray columns are not accepted in computed column expressions yet. For row-wise ndarray projections or reductions, use add_generated_column() with values=t.ndarray_col.row_transformer....

  • dtype – Optional dtype override for the computed values. When omitted, the dtype is inferred from the resulting blosc2.LazyExpr. This changes the dtype reported by the CTable column wrapper; it does not create physical storage.

Examples

Add a computed column from a string expression and use it like a normal read-only column:

t.add_computed_column("total", "price * qty")
assert t.total[:].shape == (t.nrows,)

Add a computed column from a callable. The callable receives the table’s stored column mapping:

t.add_computed_column(
    "price_with_tax",
    lambda cols: cols["price"] * 1.21,
    dtype=np.float64,
)

Callable expressions can use normal Python logic while still returning a lazy expression:

def total_expr(cols):
    base = cols["price"] * cols["qty"]
    return base * 1.21 if include_tax else base

t.add_computed_column("total", total_expr)

They are also convenient for reusable, parameterized helpers:

def ratio(num, den):
    return lambda cols: cols[num] / cols[den]

t.add_computed_column("margin", ratio("profit", "revenue"))

Computed columns participate in filters and aggregates:

expensive = t.where(t.total > 100)
total_revenue = t.total.sum()

Computed columns are virtual and read-only. Materialize one when a stored snapshot or an indexable column is needed:

t.materialize_computed_column("total", new_name="total_stored")
t.create_index("total_stored")

For maintained stored results, prefer generated columns:

t.add_generated_column(
    "total_stored",
    values="price * qty",
    dtype=blosc2.float64(),
    create_index=True,
)
Raises:
  • ValueError – If called on a view or read-only table, if name already exists, or if an expression operand does not reference a stored column of this table.

  • TypeError – If expr has an unsupported form, does not produce a blosc2.LazyExpr, references unsupported source columns, or if a RowTransformer is passed. Row transformers are only accepted by add_generated_column().

add_generated_column(name: str, *, values: str | LazyExpr | Callable[[dict[str, Any]], LazyExpr] | RowTransformer, dtype=None, create_index: bool = False) None[source]

Add a stored generated column maintained by the table.

A generated column is physical storage, not a virtual expression. The initial values are computed for all current live rows, and later append() / extend() calls automatically compute values for newly inserted rows when source columns are provided. If a source column is modified in-place, dependent generated columns are marked stale; call refresh_generated_column() or refresh_generated_columns() to recompute them.

Supported signatures are:

add_generated_column(name, *, values="price * qty", dtype=..., create_index=False)
add_generated_column(name, *, values=lazy_expr, dtype=..., create_index=False)
add_generated_column(name, *, values=lambda cols: cols["price"] * 1.21, dtype=...)
add_generated_column(name, *, values=t.embedding.row_transformer.norm(axis=0), dtype=...)
add_generated_column(name, *, values=t.image.row_transformer.mean(axis=(0, 1)),
                     dtype=blosc2.ndarray((3,), dtype=...))
Parameters:
  • name – Name of the generated column to create. It must be a valid column name and must not collide with an existing stored or computed column.

  • values

    Definition used to compute the generated values. Accepted forms:

    • str: scalar expression over stored scalar columns, e.g. "price * qty". The expression must produce one scalar value per row.

    • blosc2.LazyExpr: scalar lazy expression over stored columns of this table. It must produce a 1-D scalar stream.

    • callable: called as values(self._cols) and must return a blosc2.LazyExpr over stored columns of this table.

    • RowTransformer: row-wise projection/reduction bound to a fixed-shape ndarray column, e.g. t.embedding.row_transformer.norm(axis=0) or t.image.row_transformer.mean(axis=(0, 1)). Row transformers may produce either one scalar per row or one fixed-shape ndarray item per row.

    Expression forms currently cannot depend on computed columns and cannot directly consume fixed-shape ndarray columns; use a row-transformer for ndarray row projections/reductions.

  • dtype – Output schema or dtype. Scalar outputs may pass a NumPy dtype or a Blosc2 scalar spec such as blosc2.float64(). Fixed-shape ndarray outputs must pass an ndarray spec such as blosc2.ndarray((3,), dtype=blosc2.float32()) unless the table has existing rows from which the output shape can be inferred. When omitted, dtype and fixed-shape output shape are inferred from the current generated values; this is not possible for an empty table.

  • create_index – If True, create an index on the generated column immediately. Only scalar generated columns can be indexed; fixed-shape ndarray generated columns raise ValueError when indexing is requested.

Examples

Create and index a scalar generated column from a string expression:

t.add_generated_column(
    "total",
    values="price * qty",
    dtype=blosc2.float64(),
    create_index=True,
)

Use a callable when normal Python composition is more convenient:

t.add_generated_column(
    "price_with_tax",
    values=lambda cols: cols["price"] * 1.21,
    dtype=blosc2.float64(),
)

Generate a scalar from each fixed-shape ndarray row. For row transformers, axes refer to the per-row item shape, so axis=0 is the embedding-coordinate axis for item_shape=(dim,):

t.add_generated_column(
    "embedding_norm",
    values=t.embedding.row_transformer.norm(axis=0, ord=2),
    dtype=blosc2.float64(),
    create_index=True,
)

Generate a fixed-shape ndarray value per row. Here an image column has item_shape=(height, width, 3) and the generated column stores one RGB vector per row:

t.add_generated_column(
    "image_mean_rgb",
    values=t.image.row_transformer.mean(axis=(0, 1)),
    dtype=blosc2.ndarray((3,), dtype=blosc2.float32()),
)

Generated columns are maintained on append/extend:

t.append((new_id, new_embedding, new_image))
assert t.embedding_norm[-1] == np.linalg.norm(new_embedding)

If source values are changed in place, refresh dependent generated columns before relying on them:

t.embedding[0] = new_embedding
t.refresh_generated_column("embedding_norm")
Raises:
  • ValueError – If called on a view or read-only table, if name already exists, if generated output length/shape is incompatible with the table, or if create_index=True is requested for an ndarray generated column.

  • TypeError – If values has an unsupported form, references unsupported source columns, or cannot be coerced to dtype.

  • KeyError – If a RowTransformer references a missing source column.

append(data: list | void | ndarray) None[source]

Append a single row to the table.

data may be a list, tuple, numpy.void, or structured numpy.ndarray whose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. Raises ValueError if the table is read-only or a view.

For tables with nested (dotted) column names the row dict may be supplied either as a flat mapping of dotted keys or as a nested dict that mirrors the original struct shape — both are accepted and automatically flattened to the physical dotted leaf names:

# flat dotted keys
t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8,
          "payment.fare": 12.5})

# original nested dict (auto-flattened)
t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8}},
          "payment": {"fare": 12.5}})
close() None[source]

Close any persistent backing store held by this table.

column_schema(name: str) CompiledColumn[source]

Return the CompiledColumn descriptor for name.

Raises:

KeyError – If name is not a column in this table.

compact()[source]

Physically rewrite every column array keeping only live rows.

Closes the gaps left by prior delete() calls by shuffling live data to the front of each column array. The underlying NDArray allocations are not resized — each column retains its original capacity. To actually reclaim memory, use copy() with compact=True instead, which allocates fresh arrays sized to the live row count. All existing indexes are dropped and must be recreated afterwards. Raises ValueError if the table is read-only or a view.

compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Compact an index, merging any incremental append runs.

copy(compact: bool = True, *, urlpath: str | PathLike[str] | None = None, overwrite: bool = False) CTable[source]

Return a new standalone copy of this table.

This is the only operation that truly reclaims memory: when compact=True the new table allocates fresh arrays sized exactly to the live row count, discarding all deleted-row gaps and unused capacity.

Parameters:
  • compact – If True (default), only live (non-deleted) rows are copied. The result is a dense table with no tombstones and no parent dependency — ideal for materialising a filtered view or freeing memory after heavy deletions. If False, all physical slots are copied including deleted gaps, preserving the tombstone state exactly for in-memory copies.

  • urlpath – Destination path for a persistent copy. The .b2z extension selects a compact zip-backed store; any other path uses a directory-backed store. A .b2d suffix is recommended for directory-backed stores. If None (default), return an in-memory copy.

  • overwrite – If True, replace an existing persistent destination.

cov() ndarray[source]

Return the covariance matrix as a numpy array.

Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise TypeError.

Returns:

Shape (ncols, ncols). Column order matches col_names.

Return type:

numpy.ndarray

Raises:
  • TypeError – If any column has an unsupported dtype (complex, string, …).

  • ValueError – If the table has fewer than 2 live rows (covariance undefined).

create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]

Build and register an index for a stored column or table expression.

For tables with nested (dotted) column names, pass the dotted leaf name directly:

t.create_index("trip.begin.lon")
t.where("trip.begin.lon > -87.7").nrows   # index is used automatically
delete(ind: int | slice | str | Iterable) None[source]

Mark one or more rows as deleted (tombstone deletion).

ind may be a logical row index (int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed until compact() is called. Raises ValueError if the table is read-only or a view.

describe() None[source]

Print a per-column statistical summary.

Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.

drop_column(name: str) None[source]

Remove a column from the table.

On disk tables the corresponding persisted column leaf is deleted.

Raises:
  • ValueError – If the table is read-only, is a view, or name is the last column.

  • KeyError – If name does not exist.

drop_computed_column(name: str) None[source]

Remove a computed column from the table.

Parameters:

name – Name of the computed column to remove.

Raises:
  • KeyError – If name is not a computed column.

  • ValueError – If called on a view.

drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]

Remove an index and delete any sidecar files.

extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]

Append multiple rows at once.

data may be:

  • a dict of arrays {"col": array, ...} — all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;

  • a list of rows, each compatible with append();

  • another CTable — columns are matched by name.

Pass validate=False to skip per-row Pydantic validation on trusted bulk imports. Raises ValueError if the table is read-only or a view.

For tables with nested (dotted) column names both the dict-of-arrays and list-of-dicts forms accept the original nested dict shape and auto-flatten it to physical dotted leaf names:

# nested dict of arrays
t.extend({
    "trip": {"begin": {"lon": lons, "lat": lats}},
    "payment": {"fare": fares},
})

# list of nested dicts
t.extend([
    {"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}},
    {"trip": {"begin": {"lon": -87.5, "lat": 41.7}}, "payment": {"fare": 8.0}},
])
group_by(keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]

Return a deferred group-by object for this table.

Parameters:
  • keys – Column name or sequence of column names to group by.

  • sort – If True, sort the result by the group keys. The default False preserves the hash aggregation order and is usually faster.

  • dropna – If True (default), rows with null/NaN group keys are skipped. If False, null/NaN keys form their own group.

  • engine – Execution engine. Phase 1 accepts "auto" and uses the NumPy chunked implementation.

  • chunk_size – Optional number of physical rows processed per chunk.

Returns:

A lightweight deferred operation builder. Call methods such as .size(), .count(column) or .agg({...}) to materialize a grouped result as a new CTable.

Return type:

CTableGroupBy

head(N: int = 5) CTable[source]

Return a view of the first N live rows (default 5).

index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Return the index handle for a stored-column or expression target.

iter_arrow_batches(*, columns: list[str] | None = None, batch_size: int = 2048, include_computed: bool = True)[source]

Yield live rows as bounded-size pyarrow.RecordBatch objects.

iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]

Iterate rows in sorted order without materializing a full copy.

Uses a FULL index when available (no sort needed); otherwise falls back to np.lexsort on live physical positions. Yields namedtuple-like row objects in the same way as __iter__.

The sorted positions array is stored as a compressed blosc2.NDArray to keep RAM usage low for large tables. batch_size positions are decompressed at a time during iteration.

Parameters:
  • cols – Column name or list of column names to sort by.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • start – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • stop – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • step – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • batch_size – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.

materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]

Materialize a computed column into a new stored snapshot column.

Parameters:
  • name – Existing computed column to materialize.

  • new_name – Name of the new stored column. Defaults to f"{name}_stored".

  • dtype – Optional target dtype for the stored column. Defaults to the computed column dtype.

  • cparams – Optional compression parameters for the new stored column.

Raises:
  • ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.

  • KeyError – If name is not a computed column.

  • TypeError – If dtype is incompatible with the computed values.

rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Drop and recreate an index with the same parameters.

refresh_generated_column(name: str) None[source]

Recompute a stored generated/materialized column from its source columns.

refresh_generated_columns(*, source: str | None = None) None[source]

Refresh all generated columns, optionally only those depending on source.

rename_column(old: str, new: str) None[source]

Rename a column.

On disk tables the corresponding persisted column leaf is renamed.

Renaming a flat column to a dotted name (e.g. "trip.begin.lon") promotes it to a nested leaf column: it will be stored under the hierarchical path /_cols/trip/begin/lon on disk and can be accessed via t["trip.begin.lon"] or the attribute-chain proxy t.trip.begin.lon. This is the primary way to define nested columns when importing from non-Arrow sources:

t.rename_column("trip_begin_lon", "trip.begin.lon")
t["trip.begin.lon"].mean()   # works as a regular Column
Raises:
  • ValueError – If the table is read-only, is a view, or new already exists.

  • KeyError – If old does not exist.

sample(n: int, *, seed: int | None = None) CTable[source]

Return a read-only view of n randomly chosen live rows.

Parameters:
  • n – Number of rows to sample. If n >= number of live rows, returns a view of the whole table.

  • seed – Optional random seed for reproducibility.

Returns:

A read-only view sharing columns with this table.

Return type:

CTable

save(urlpath: str, *, overwrite: bool = False) None[source]

Persist this table to disk at urlpath.

This writes a standalone copy and returns None; use copy() directly when the copied CTable object is needed.

Only live rows are written — the on-disk table is always compacted. A .b2z suffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a .b2d suffix for directory-backed stores when possible so the format is clear.

Parameters:
  • urlpath – Destination path. Use a .b2z suffix for a compact zip-backed store; any other suffix creates a directory-backed store. A .b2d suffix is recommended for directory-backed stores.

  • overwrite – If False (default), raise ValueError when urlpath already exists. Set to True to replace an existing table.

Raises:

ValueError – If urlpath already exists and overwrite=False.

schema_dict() dict[str, Any][source]

Return a JSON-compatible dict describing this table’s schema.

select(cols: list[str]) CTable[source]

Return a column-projection view exposing only cols.

The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.

Parameters:

cols

Ordered list of column names to keep. For tables with nested (dotted) column names, a struct-prefix name automatically expands to all descendant leaves:

t.select(["trip.begin"])   # expands to trip.begin.lon, trip.begin.lat
t.select(["trip"])          # expands to all trip.* leaves

Raises:
  • KeyError – If any name in cols is not a column of this table (and does not match any struct prefix).

  • ValueError – If cols is empty.

sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]

Return a copy of the table sorted by one or more columns.

Parameters:
  • cols

    Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on. For tables with nested (dotted) column names, pass the dotted leaf name directly:

    t.sort_by("trip.begin.lon")
    t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])
    

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • inplace – If True, rewrite the physical data in place and return self (like compact() but sorted). If False (default), return a new in-memory CTable leaving this one untouched.

Raises:
  • ValueError – If called on a view or a read-only table when inplace=True.

  • KeyError – If any column name is not found.

  • TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).

tail(N: int = 5) CTable[source]

Return a view of the last N live rows (default 5).

to_arrow()[source]

Convert all live rows to a pyarrow.Table.

to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a directory-backed store.

Directory-backed CTable stores may use any path that does not end in .b2z; using a .b2d suffix is recommended for clarity. For persistent, non-view .b2z tables opened read-only and compact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.

For in-memory tables, views, writable .b2z tables, existing directory-backed tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new directory-backed store.

Examples

Fast-unpack an existing compact zip store into a directory-backed table:

table = blosc2.CTable.open("data.b2z", mode="r")
table.to_b2d("data.b2d", overwrite=True)
table.close()

Materialize a filtered view into a directory-backed store:

view = table.where(table["score"] > 10)
view.to_b2d("high-score.b2d", overwrite=True)

Force a logical compacted copy, even for a persistent .b2z table:

table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a compact .b2z container.

.b2z is the compact zip-backed CTable format. For persistent, non-view directory-backed tables and compact=False, this uses a fast physical-pack path: the backing TreeStore directory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A .b2d suffix is recommended for directory-backed stores, but not required.

For in-memory tables, views, existing .b2z tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new .b2z store.

Examples

Fast-pack an existing directory-backed table into a compact zip store:

table = blosc2.CTable.open("data.b2d", mode="r")
table.to_b2z("data.b2z", overwrite=True)
table.close()

Materialize a filtered view into a new compact store:

view = table.where(table["score"] > 10)
view.to_b2z("high-score.b2z", overwrite=True)

Force a logical compacted copy, even for a persistent .b2d table:

table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]

Write all live rows to a CSV file.

Uses Python’s stdlib csv module — no extra dependency required. Fixed-shape ndarray column cells are serialised as JSON arrays for readability and shape safety (e.g. "[1.0, 2.0, 3.0]").

Parameters:
  • path – Destination file path. Created or overwritten.

  • header – If True (default), write column names as the first row.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

to_pandas()[source]

Convert to a pandas DataFrame.

Scalar columns become regular DataFrame columns. Fixed-shape ndarray columns become object-dtype columns whose cells hold NumPy arrays of per-row shape item_shape.

Return type:

pandas.DataFrame

Examples

>>> import blosc2
>>> from dataclasses import dataclass
>>> import numpy as np
>>> @dataclass
... class Row:
...     id: int = blosc2.field(blosc2.int64())
...     embedding: object = blosc2.field(blosc2.ndarray((3,), dtype=blosc2.float32()))
>>> t = blosc2.CTable(Row, new_data=[
...     (1, np.array([1, 2, 3], dtype=np.float32)),
...     (2, np.array([4, 5, 6], dtype=np.float32)),
... ])
>>> df = t.to_pandas()
>>> df["id"].tolist()
[1, 2]
>>> df["embedding"].dtype
dtype('O')
>>> np.testing.assert_array_equal(df["embedding"][0], np.array([1, 2, 3], dtype=np.float32))
to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]

Write this table to a Parquet file batch-wise using pyarrow.

view(new_valid_rows)[source]

Return a row-filter view backed by a boolean mask array without copying data.

where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]

Return a row-filtered view matching a boolean predicate.

Signature:

where(expr_result) -> CTable

The predicate can be supplied as a boolean blosc2.LazyExpr, a boolean blosc2.NDArray, a boolean NumPy array, a boolean Column, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.

The returned object is a CTable view sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.

Parameters:

expr_result – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g. "value * category >= 150". Column objects can also be used in Python expressions, e.g. (t.value * t.category) >= 150.

Returns:

A view over the same columns containing only rows where the predicate is true and the source row is live. When columns is provided, the returned view is additionally projected to that ordered subset of columns.

Return type:

CTable

Raises:

TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.

Examples

Filter using a string expression:

view = t.where("value * category >= 150")
slim = t.where("value * category >= 150", columns=["value", "category"])

Filter using column arithmetic:

view = t.where((t.value * t.category) >= 150)

Blosc2 lazy functions can be used in column expressions:

view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)

For column names that are not valid Python identifiers, use item access:

view = t.where((t["unit price"] * t["quantity"]) > 100)

For tables with nested (dotted) column names, dotted leaf names and attribute-chain proxies work in both string and expression forms:

view = t.where("trip.begin.lon > -87.7 and payment.fare > 10")
view = t.where(t.trip.begin.lon > -87.7)

Notes

Use bitwise operators (&, |, ~) or string expressions for element-wise boolean logic. Python’s logical operators and, or and not cannot be overloaded and therefore do not build lazy column expressions.

Use:

t.where((t.x > 0) & (t.y < 10))
t.where(~t.returned)
t.where("not returned")

not:

t.where((t.x > 0) and (t.y < 10))
t.where(not t.returned)
base: CTable | None

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()). None for top-level tables. Structural mutations such as add_column() and drop_column() are blocked on views.

property cbytes: int

Total compressed size in bytes (all columns + valid_rows mask).

col_names: list[str]

Ordered list of stored column names. Computed columns are not included; access those via computed_columns.

property computed_columns: dict[str, dict]

Read-only view of the computed-column definitions.

Each value is a dict with keys expression, col_deps, lazy (blosc2.LazyExpr), and dtype.

property cratio: float

Compression ratio for the whole table payload.

property indexes: list[Index]

Return a list of blosc2.Index handles for all active indexes.

property info: _CTableInfoReporter

Get information about this table.

Examples

>>> print(t.info)
>>> t.info()
property info_items: list[tuple[str, object]]

Structured summary items used by info().

property nbytes: int

Total uncompressed size in bytes (all columns + valid_rows mask).

property ncols: int

Total number of columns, including computed (virtual) columns.

property schema: CompiledSchema

The compiled schema that drives this table’s columns and validation.

Construction

CTable.__init__(row_type[, new_data, ...])

CTable.open(urlpath, *[, mode])

Open a persistent CTable from urlpath.

CTable.load(urlpath)

Load a persistent table from urlpath into RAM.

CTable.from_arrow(schema, batches, *[, ...])

Build a CTable from an Arrow schema and iterable of record batches.

CTable.from_parquet(path, *[, columns, ...])

Read a Parquet file into a CTable.

CTable.from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

CTable.__init__(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None) None[source]
classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]

Open a persistent CTable from urlpath.

Parameters:
  • urlpath – Path to the table root directory (created by passing urlpath to CTable).

  • mode'r' (default) — read-only. 'a' — read/write.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod CTable.load(urlpath: str) CTable[source]

Load a persistent table from urlpath into RAM.

The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.

Parameters:

urlpath – Path to the table root directory.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

Nested struct flattening: top-level Arrow struct<…> fields are automatically and recursively flattened into dotted leaf columns. For example, a field trip: struct<begin: struct<lon: float64, lat: float64>> becomes two CTable columns trip.begin.lon and trip.begin.lat. Each leaf is stored as an independent compressed NDArray. Row reads via t[i] reconstruct the original nested dict shape. Use t["trip.begin.lon"] or t.trip.begin.lon to access a leaf:

import pyarrow as pa, blosc2
trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))])
schema = pa.schema([pa.field("trip", trip_type)])
t = blosc2.CTable.from_arrow(schema, batches)
t.col_names          # ['trip.begin.lon']
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Non-struct struct columns (not containing only scalar leaves) are imported as struct() columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

list_serializer selects the backend serializer for imported list columns. "msgpack" is the default; "arrow" stores Arrow list batches directly and can be much faster for deeply nested list columns.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.

Nested struct flattening: top-level Parquet struct<…> fields are automatically and recursively flattened into dotted leaf columns — the same as in from_arrow(). For example, a Parquet file that contains a column trip: struct<begin: struct<lon: double, lat: double>> produces two CTable columns trip.begin.lon and trip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:

t = blosc2.CTable.from_parquet("trips.parquet")
t.col_names               # e.g. ['trip.begin.lon', 'trip.begin.lat', ...]
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.

  • list_serializer ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default, "arrow", stores Arrow list batches directly and is much faster for deeply nested or list<struct<...>> columns. The tradeoff is that accessing those list columns later requires PyArrow. Use "msgpack" to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.

  • separate_nested_cols (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to True. In particular, a single unnamed top-level list<struct<...>> field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Use separate_nested_cols=False when closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.

  • max_rows (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root list<struct<...>> files imported with separate_nested_cols=True, this limits flattened element rows.

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If max_rows is negative.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

Parquet interoperability

Parquet import/export is intended as logical data interchange between Parquet and Blosc2 CTable, not as exact preservation of Parquet’s physical layout. For example, Parquet files whose top-level schema is an unnamed list<struct<...>> may be imported as a regular CTable whose rows are the list elements and whose nested scalar fields are exposed as ordinary dotted columns. Exporting such a table writes a valid logical Parquet table, but does not attempt to reconstruct the original unnamed root-list grouping, row groups, encoding choices, or file metadata exactly.

Null policy

Nullable scalar CTable columns are represented with per-column sentinel values, not native validity bitmaps. When CTable has to infer those sentinels, the selection can be customized with NullPolicy and scoped with null_policy():

policy = blosc2.NullPolicy(
    signed_int_strategy="max",
    string_value="<NULL>",
    column_null_values={"user_id": -1, "country": "NA"},
)

with blosc2.null_policy(policy):
    table = blosc2.CTable.from_parquet("data.parquet")

The same policy is used by explicit nullable schema specs when no null_value is supplied:

from dataclasses import dataclass

@dataclass
class Row:
    user_id: int = blosc2.field(blosc2.int64(nullable=True))
    country: str = blosc2.field(blosc2.string(nullable=True))

with blosc2.null_policy(policy):
    table = blosc2.CTable(Row)

Sentinels are resolved in this order: explicit null_value in the schema, NullPolicy.column_null_values for a matching column, then the type-wide NullPolicy default. Columns without nullable=True or an explicit null_value are not nullable.

NullPolicy(string_value, bytes_value, ...)

Default sentinels for inferred CTable scalar nulls.

null_policy(policy)

Temporarily set the default policy for CTable null sentinel inference.

get_null_policy()

Return the current default null policy.

class blosc2.NullPolicy(string_value: str = '__BLOSC2_NULL__', bytes_value: bytes = b'__BLOSC2_NULL__', float_value: float = nan, bool_value: int = 255, signed_int_strategy: ~typing.Literal['min', 'max'] = 'min', unsigned_int_strategy: ~typing.Literal['min', 'max'] = 'max', timestamp_value: int = -9223372036854775808, column_null_values: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]

Default sentinels for inferred CTable scalar nulls.

CTable nullable scalar columns are represented with per-column sentinel values. This policy is used when CTable has to infer those sentinels, such as when importing nullable scalar Arrow or Parquet columns without an explicit column-level null sentinel. The selected sentinel is stored in the resulting CTable schema, so existing tables remain self-describing.

Examples

Use blosc2.null_policy() to apply a policy while creating a CTable from data with nullable scalar columns:

policy = blosc2.NullPolicy(
    signed_int_strategy="max",
    string_value="<NULL>",
    column_null_values={"user_id": -1, "country": "NA"},
)

with blosc2.null_policy(policy):
    table = blosc2.CTable.from_parquet("data.parquet")

The same policy is used for explicit nullable schema specs:

@dataclass
class Row:
    user_id: int = blosc2.field(blosc2.int64(nullable=True))
    country: str = blosc2.field(blosc2.string(nullable=True))

with blosc2.null_policy(policy):
    table = blosc2.CTable(Row)

column_null_values takes precedence over the type-wide defaults in the policy. This is useful when a particular column needs a sentinel that is known not to collide with its real values.

Methods

sentinel_for_arrow_type(pa, pa_type)

Return the default sentinel for pa_type, or None if unsupported.

blosc2.null_policy(policy: NullPolicy)

Temporarily set the default policy for CTable null sentinel inference.

blosc2.get_null_policy() NullPolicy[source]

Return the current default null policy.

Attributes

CTable.col_names

Ordered list of stored column names.

CTable.computed_columns

Read-only view of the computed-column definitions.

CTable.nrows

CTable.ncols

Total number of columns, including computed (virtual) columns.

CTable.cbytes

Total compressed size in bytes (all columns + valid_rows mask).

CTable.nbytes

Total uncompressed size in bytes (all columns + valid_rows mask).

CTable.schema

The compiled schema that drives this table's columns and validation.

CTable.base

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()).

CTable.col_names: list[str]

Ordered list of stored column names. Computed columns are not included; access those via computed_columns.

property CTable.computed_columns: dict[str, dict]

Read-only view of the computed-column definitions.

Each value is a dict with keys expression, col_deps, lazy (blosc2.LazyExpr), and dtype.

property CTable.nrows: int
property CTable.ncols: int

Total number of columns, including computed (virtual) columns.

property CTable.cbytes: int

Total compressed size in bytes (all columns + valid_rows mask).

property CTable.nbytes: int

Total uncompressed size in bytes (all columns + valid_rows mask).

property CTable.schema: CompiledSchema

The compiled schema that drives this table’s columns and validation.

CTable.base: CTable | None

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()). None for top-level tables. Structural mutations such as add_column() and drop_column() are blocked on views.

Inserting data

CTable.append(data)

Append a single row to the table.

CTable.extend(data, *[, validate])

Append multiple rows at once.

CTable.append(data: list | void | ndarray) None[source]

Append a single row to the table.

data may be a list, tuple, numpy.void, or structured numpy.ndarray whose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. Raises ValueError if the table is read-only or a view.

For tables with nested (dotted) column names the row dict may be supplied either as a flat mapping of dotted keys or as a nested dict that mirrors the original struct shape — both are accepted and automatically flattened to the physical dotted leaf names:

# flat dotted keys
t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8,
          "payment.fare": 12.5})

# original nested dict (auto-flattened)
t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8}},
          "payment": {"fare": 12.5}})
CTable.extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]

Append multiple rows at once.

data may be:

  • a dict of arrays {"col": array, ...} — all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;

  • a list of rows, each compatible with append();

  • another CTable — columns are matched by name.

Pass validate=False to skip per-row Pydantic validation on trusted bulk imports. Raises ValueError if the table is read-only or a view.

For tables with nested (dotted) column names both the dict-of-arrays and list-of-dicts forms accept the original nested dict shape and auto-flatten it to physical dotted leaf names:

# nested dict of arrays
t.extend({
    "trip": {"begin": {"lon": lons, "lat": lats}},
    "payment": {"fare": fares},
})

# list of nested dicts
t.extend([
    {"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}},
    {"trip": {"begin": {"lon": -87.5, "lat": 41.7}}, "payment": {"fare": 8.0}},
])

Querying

Boolean expressions

Use bitwise operators (&, |, ~) or string expressions for row-wise boolean logic. Python’s logical operators and, or and not cannot be overloaded and therefore do not build lazy column expressions.

Use column expressions with explicit parentheses around comparisons:

t.where((t.amount > 100) & (t.region == "North"))
t.where(~t.returned)

or use string expressions when that reads better:

t.where("amount > 100 and region == 'North'")
t.where("not returned")
t["not returned"]

The last three forms for negating a boolean column are equivalent: t.where(~t.returned), t.where("not returned"), and t["not returned"].

Indexing & projection

CTable indexing is type-driven:

t["amount"]                 # column access
t[3]                        # one row as a namedtuple-like object
t[3:8]                      # row view
t[[1, 4, 7]]                # gathered-row view
t[mask]                     # filtered row view
t[t.amount > 100]           # LazyExpr filtered row view, like where()
t[["region", "amount"]]   # projected column view

String keys first try exact column-name lookup. If the string is not a column name, it is interpreted as a boolean expression and behaves like CTable.where(). Boolean LazyExpr and boolean NDArray keys also behave like CTable.where(), so computed column predicates such as t[t.temperature_f > 70] are supported.

For explicit filtered projection, use:

t.where("amount > 100", columns=["region", "amount"])

When a NumPy structured array is needed, materialize explicitly:

np.asarray(t[:10])

CTable.where(expr_result, *[, columns])

Return a row-filtered view matching a boolean predicate.

CTable.view(new_valid_rows)

Return a row-filter view backed by a boolean mask array without copying data.

CTable.select(cols)

Return a column-projection view exposing only cols.

CTable.head([N])

Return a view of the first N live rows (default 5).

CTable.tail([N])

Return a view of the last N live rows (default 5).

CTable.sample(n, *[, seed])

Return a read-only view of n randomly chosen live rows.

CTable.sort_by(cols[, ascending, inplace])

Return a copy of the table sorted by one or more columns.

CTable.iter_sorted(cols[, ascending, start, ...])

Iterate rows in sorted order without materializing a full copy.

CTable.group_by(keys, *[, sort, dropna, ...])

Return a deferred group-by object for this table.

CTable.where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]

Return a row-filtered view matching a boolean predicate.

Signature:

where(expr_result) -> CTable

The predicate can be supplied as a boolean blosc2.LazyExpr, a boolean blosc2.NDArray, a boolean NumPy array, a boolean Column, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.

The returned object is a CTable view sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.

Parameters:

expr_result – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g. "value * category >= 150". Column objects can also be used in Python expressions, e.g. (t.value * t.category) >= 150.

Returns:

A view over the same columns containing only rows where the predicate is true and the source row is live. When columns is provided, the returned view is additionally projected to that ordered subset of columns.

Return type:

CTable

Raises:

TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.

Examples

Filter using a string expression:

view = t.where("value * category >= 150")
slim = t.where("value * category >= 150", columns=["value", "category"])

Filter using column arithmetic:

view = t.where((t.value * t.category) >= 150)

Blosc2 lazy functions can be used in column expressions:

view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)

For column names that are not valid Python identifiers, use item access:

view = t.where((t["unit price"] * t["quantity"]) > 100)

For tables with nested (dotted) column names, dotted leaf names and attribute-chain proxies work in both string and expression forms:

view = t.where("trip.begin.lon > -87.7 and payment.fare > 10")
view = t.where(t.trip.begin.lon > -87.7)

Notes

Use bitwise operators (&, |, ~) or string expressions for element-wise boolean logic. Python’s logical operators and, or and not cannot be overloaded and therefore do not build lazy column expressions.

Use:

t.where((t.x > 0) & (t.y < 10))
t.where(~t.returned)
t.where("not returned")

not:

t.where((t.x > 0) and (t.y < 10))
t.where(not t.returned)
CTable.view(new_valid_rows)[source]

Return a row-filter view backed by a boolean mask array without copying data.

CTable.select(cols: list[str]) CTable[source]

Return a column-projection view exposing only cols.

The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.

Parameters:

cols

Ordered list of column names to keep. For tables with nested (dotted) column names, a struct-prefix name automatically expands to all descendant leaves:

t.select(["trip.begin"])   # expands to trip.begin.lon, trip.begin.lat
t.select(["trip"])          # expands to all trip.* leaves

Raises:
  • KeyError – If any name in cols is not a column of this table (and does not match any struct prefix).

  • ValueError – If cols is empty.

CTable.head(N: int = 5) CTable[source]

Return a view of the first N live rows (default 5).

CTable.tail(N: int = 5) CTable[source]

Return a view of the last N live rows (default 5).

CTable.sample(n: int, *, seed: int | None = None) CTable[source]

Return a read-only view of n randomly chosen live rows.

Parameters:
  • n – Number of rows to sample. If n >= number of live rows, returns a view of the whole table.

  • seed – Optional random seed for reproducibility.

Returns:

A read-only view sharing columns with this table.

Return type:

CTable

CTable.sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]

Return a copy of the table sorted by one or more columns.

Parameters:
  • cols

    Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on. For tables with nested (dotted) column names, pass the dotted leaf name directly:

    t.sort_by("trip.begin.lon")
    t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])
    

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • inplace – If True, rewrite the physical data in place and return self (like compact() but sorted). If False (default), return a new in-memory CTable leaving this one untouched.

Raises:
  • ValueError – If called on a view or a read-only table when inplace=True.

  • KeyError – If any column name is not found.

  • TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).

CTable.iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]

Iterate rows in sorted order without materializing a full copy.

Uses a FULL index when available (no sort needed); otherwise falls back to np.lexsort on live physical positions. Yields namedtuple-like row objects in the same way as __iter__.

The sorted positions array is stored as a compressed blosc2.NDArray to keep RAM usage low for large tables. batch_size positions are decompressed at a time during iteration.

Parameters:
  • cols – Column name or list of column names to sort by.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • start – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • stop – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • step – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • batch_size – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.

CTable.group_by(keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]

Return a deferred group-by object for this table.

Parameters:
  • keys – Column name or sequence of column names to group by.

  • sort – If True, sort the result by the group keys. The default False preserves the hash aggregation order and is usually faster.

  • dropna – If True (default), rows with null/NaN group keys are skipped. If False, null/NaN keys form their own group.

  • engine – Execution engine. Phase 1 accepts "auto" and uses the NumPy chunked implementation.

  • chunk_size – Optional number of physical rows processed per chunk.

Returns:

A lightweight deferred operation builder. Call methods such as .size(), .count(column) or .agg({...}) to materialize a grouped result as a new CTable.

Return type:

CTableGroupBy

Group-by reductions

CTable.group_by() returns a lightweight deferred group-by object. It is not a table view; methods such as size(), count(), sum(), argmax(), and agg() materialize a new CTable with one row per group:

by_city = t.group_by("city", sort=True)
counts = by_city.size()                  # row count per city / COUNT(*)
non_null = by_city.count("sales")        # non-null sales count / COUNT(sales)
totals = by_city.sum("sales")            # equivalent to agg({"sales": "sum"})
means = by_city.mean("sales")
mins = by_city.min("sales")
maxs = by_city.max("sales")
min_rows = by_city.argmin("sales")       # logical row position of min sales
max_rows = by_city.argmax("sales")       # logical row position of max sales

Grouped results are in-memory by default. Pass urlpath= to a terminal method to write the result as a persistent CTable:

totals = by_city.sum("sales", urlpath="sales_by_city.b2d")

For array-oriented grouped reductions without a CTable, see blosc2.group_reduce().

class blosc2.CTableGroupBy(table: CTable, keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]

Deferred group-by operation returned by CTable.group_by().

The object stores the source table, grouping keys, and execution options. It is not a CTable view and does not materialize grouped data until a terminal method such as size(), count(), or agg() is called.

Methods

agg(aggregations, *[, urlpath])

Aggregate value columns per group.

argmax(column, *[, urlpath])

Return logical row positions of maximum non-null column values per group.

argmin(column, *[, urlpath])

Return logical row positions of minimum non-null column values per group.

count(column, *[, urlpath])

Return non-null value counts for column per group.

max(column, *[, urlpath])

Return maximum values of column per group.

mean(column, *[, urlpath])

Return means of column per group.

min(column, *[, urlpath])

Return minimum values of column per group.

size(*[, urlpath])

Return row counts per group as a new CTable.

sum(column, *[, urlpath])

Return sums of column per group.

agg(aggregations: Mapping[str, str | Sequence[str]], *, urlpath: str | None = None)[source]

Aggregate value columns per group.

Parameters:

aggregations – Mapping from input column name to an aggregation name or list of names. Supported operations in Phase 1 are "count", "sum", "mean", "min", "max", "argmin", "argmax" and the special row-count spelling {"*": "size"}.

argmax(column: str, *, urlpath: str | None = None)[source]

Return logical row positions of maximum non-null column values per group.

Ties keep the first row in the grouped input table or view. Groups with no non-null values for column receive -1.

argmin(column: str, *, urlpath: str | None = None)[source]

Return logical row positions of minimum non-null column values per group.

Ties keep the first row in the grouped input table or view. Groups with no non-null values for column receive -1.

count(column: str, *, urlpath: str | None = None)[source]

Return non-null value counts for column per group.

This is equivalent to SQL COUNT(column) and to group_by(...).agg({column: "count"}).

max(column: str, *, urlpath: str | None = None)[source]

Return maximum values of column per group.

This is equivalent to group_by(...).agg({column: "max"}).

mean(column: str, *, urlpath: str | None = None)[source]

Return means of column per group.

This is equivalent to group_by(...).agg({column: "mean"}).

min(column: str, *, urlpath: str | None = None)[source]

Return minimum values of column per group.

This is equivalent to group_by(...).agg({column: "min"}).

size(*, urlpath: str | None = None)[source]

Return row counts per group as a new CTable.

This is equivalent to SQL COUNT(*): it counts rows in each group and is independent of null values in non-key columns. If urlpath is provided, the result is written as a persistent CTable at that path.

sum(column: str, *, urlpath: str | None = None)[source]

Return sums of column per group.

This is equivalent to group_by(...).agg({column: "sum"}).

Mutations

In addition to physical schema changes such as CTable.add_column(), CTables can host computed columns backed by a lazy expression over stored columns. Computed columns are read-only, use no extra storage, participate in display, filtering, sorting, and aggregates, and are persisted across CTable.save(), CTable.load(), and CTable.open().

When a computed result should become a normal stored column, use CTable.materialize_computed_column(). The materialized column is a stored snapshot that can be indexed like any other stored column. New rows inserted later via CTable.append() or CTable.extend() auto-fill omitted materialized-column values from the recorded expression metadata.

CTable.delete(ind)

Mark one or more rows as deleted (tombstone deletion).

CTable.compact()

Physically rewrite every column array keeping only live rows.

CTable.add_column(name, spec)

Add a new column filled from the default declared in spec.

CTable.add_computed_column(name, expr, *[, ...])

Add a read-only virtual column computed from stored columns.

CTable.materialize_computed_column(name, *)

Materialize a computed column into a new stored snapshot column.

CTable.drop_computed_column(name)

Remove a computed column from the table.

CTable.drop_column(name)

Remove a column from the table.

CTable.rename_column(old, new)

Rename a column.

CTable.delete(ind: int | slice | str | Iterable) None[source]

Mark one or more rows as deleted (tombstone deletion).

ind may be a logical row index (int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed until compact() is called. Raises ValueError if the table is read-only or a view.

CTable.compact()[source]

Physically rewrite every column array keeping only live rows.

Closes the gaps left by prior delete() calls by shuffling live data to the front of each column array. The underlying NDArray allocations are not resized — each column retains its original capacity. To actually reclaim memory, use copy() with compact=True instead, which allocates fresh arrays sized to the live row count. All existing indexes are dropped and must be recreated afterwards. Raises ValueError if the table is read-only or a view.

CTable.add_column(name: str, spec: SchemaSpec | Field) None[source]

Add a new column filled from the default declared in spec.

Parameters:
  • name – Column name. Must follow the same naming rules as schema fields.

  • spec – A schema descriptor such as b2.int64(ge=0) or a field descriptor such as b2.field(b2.int64(ge=0), default=0). When the table already has live rows, use blosc2.field(...) with a default declared so those rows can be backfilled.

Raises:
  • ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.

  • TypeError – If a declared default cannot be coerced to spec’s dtype.

CTable.add_computed_column(name: str, expr: str | LazyExpr | Callable[[dict[str, Any]], LazyExpr], *, dtype: dtype | None = None) None[source]

Add a read-only virtual column computed from stored columns.

A computed column has no physical storage. It is backed by a blosc2.LazyExpr and is evaluated when values are read, filtered, displayed, exported, or aggregated. Because it is virtual, it is read-only, cannot be indexed directly, and is not supplied in append() / extend() inputs. To store and optionally index a computed result, use add_generated_column() or materialize an existing computed column with materialize_computed_column().

Supported signatures are:

add_computed_column(name, "price * qty", dtype=None)
add_computed_column(name, lazy_expr, dtype=None)
add_computed_column(name, lambda cols: cols["price"] * cols["qty"], dtype=None)
Parameters:
  • name – Name of the virtual computed column. It must be a valid column name and must not collide with an existing stored or computed column.

  • expr

    Definition of the virtual column. Accepted forms:

    • str: scalar expression over stored scalar columns, e.g. "price * qty".

    • blosc2.LazyExpr: lazy expression over stored columns of this table.

    • callable: called as expr(self._cols) and must return a blosc2.LazyExpr over stored columns of this table.

    Expressions must depend only on stored columns of this table; computed columns cannot depend on other computed columns in this version. Fixed-shape ndarray columns are not accepted in computed column expressions yet. For row-wise ndarray projections or reductions, use add_generated_column() with values=t.ndarray_col.row_transformer....

  • dtype – Optional dtype override for the computed values. When omitted, the dtype is inferred from the resulting blosc2.LazyExpr. This changes the dtype reported by the CTable column wrapper; it does not create physical storage.

Examples

Add a computed column from a string expression and use it like a normal read-only column:

t.add_computed_column("total", "price * qty")
assert t.total[:].shape == (t.nrows,)

Add a computed column from a callable. The callable receives the table’s stored column mapping:

t.add_computed_column(
    "price_with_tax",
    lambda cols: cols["price"] * 1.21,
    dtype=np.float64,
)

Callable expressions can use normal Python logic while still returning a lazy expression:

def total_expr(cols):
    base = cols["price"] * cols["qty"]
    return base * 1.21 if include_tax else base

t.add_computed_column("total", total_expr)

They are also convenient for reusable, parameterized helpers:

def ratio(num, den):
    return lambda cols: cols[num] / cols[den]

t.add_computed_column("margin", ratio("profit", "revenue"))

Computed columns participate in filters and aggregates:

expensive = t.where(t.total > 100)
total_revenue = t.total.sum()

Computed columns are virtual and read-only. Materialize one when a stored snapshot or an indexable column is needed:

t.materialize_computed_column("total", new_name="total_stored")
t.create_index("total_stored")

For maintained stored results, prefer generated columns:

t.add_generated_column(
    "total_stored",
    values="price * qty",
    dtype=blosc2.float64(),
    create_index=True,
)
Raises:
  • ValueError – If called on a view or read-only table, if name already exists, or if an expression operand does not reference a stored column of this table.

  • TypeError – If expr has an unsupported form, does not produce a blosc2.LazyExpr, references unsupported source columns, or if a RowTransformer is passed. Row transformers are only accepted by add_generated_column().

CTable.materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]

Materialize a computed column into a new stored snapshot column.

Parameters:
  • name – Existing computed column to materialize.

  • new_name – Name of the new stored column. Defaults to f"{name}_stored".

  • dtype – Optional target dtype for the stored column. Defaults to the computed column dtype.

  • cparams – Optional compression parameters for the new stored column.

Raises:
  • ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.

  • KeyError – If name is not a computed column.

  • TypeError – If dtype is incompatible with the computed values.

CTable.drop_computed_column(name: str) None[source]

Remove a computed column from the table.

Parameters:

name – Name of the computed column to remove.

Raises:
  • KeyError – If name is not a computed column.

  • ValueError – If called on a view.

CTable.drop_column(name: str) None[source]

Remove a column from the table.

On disk tables the corresponding persisted column leaf is deleted.

Raises:
  • ValueError – If the table is read-only, is a view, or name is the last column.

  • KeyError – If name does not exist.

CTable.rename_column(old: str, new: str) None[source]

Rename a column.

On disk tables the corresponding persisted column leaf is renamed.

Renaming a flat column to a dotted name (e.g. "trip.begin.lon") promotes it to a nested leaf column: it will be stored under the hierarchical path /_cols/trip/begin/lon on disk and can be accessed via t["trip.begin.lon"] or the attribute-chain proxy t.trip.begin.lon. This is the primary way to define nested columns when importing from non-Arrow sources:

t.rename_column("trip_begin_lon", "trip.begin.lon")
t["trip.begin.lon"].mean()   # works as a regular Column
Raises:
  • ValueError – If the table is read-only, is a view, or new already exists.

  • KeyError – If old does not exist.

Indexes

CTable indexes are created with CTable.create_index() and returned as blosc2.Index handles. For tables, Index refers to an entry stored in the table index catalog and delegates maintenance operations such as drop(), rebuild(), and compact() back to the owning table. Users normally only receive these handles from the CTable API; they do not instantiate them directly.

Indexes can target stored columns or direct expressions over stored columns via create_index(expression=...). This lets queries reuse indexes for derived predicates without adding either a computed column or a materialized stored one. A matching FULL direct-expression index can also be reused by ordering paths such as CTable.sort_by() when sorting by a computed column backed by the same expression. OPSI indexes are a separate exact-filtering tier with a tunable number of iterative ordering cycles; they are not intended to converge to a completely sorted FULL/CSI index, so use FULL when globally sorted ordered reuse is required.

CTable.create_index([col_name, field, ...])

Build and register an index for a stored column or table expression.

CTable.index([col_name, expression, name])

Return the index handle for a stored-column or expression target.

CTable.indexes

Return a list of blosc2.Index handles for all active indexes.

CTable.drop_index([col_name, expression, name])

Remove an index and delete any sidecar files.

CTable.rebuild_index([col_name, expression, ...])

Drop and recreate an index with the same parameters.

CTable.compact_index([col_name, expression, ...])

Compact an index, merging any incremental append runs.

CTable.create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]

Build and register an index for a stored column or table expression.

For tables with nested (dotted) column names, pass the dotted leaf name directly:

t.create_index("trip.begin.lon")
t.where("trip.begin.lon > -87.7").nrows   # index is used automatically
CTable.index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Return the index handle for a stored-column or expression target.

CTable.indexes

Return a list of blosc2.Index handles for all active indexes.

CTable.drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]

Remove an index and delete any sidecar files.

CTable.rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Drop and recreate an index with the same parameters.

CTable.compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Compact an index, merging any incremental append runs.

See blosc2.Index for the returned handle attributes and methods.

Persistence

Persist CTables to disk or interchange formats, and restore them later without losing schema information. These methods cover native Blosc2 persistence as well as import/export paths for CSV, Arrow, and Parquet data.

CTable.load(urlpath)

Load a persistent table from urlpath into RAM.

CTable.open(urlpath, *[, mode])

Open a persistent CTable from urlpath.

CTable.save(urlpath, *[, overwrite])

Persist this table to disk at urlpath.

CTable.to_b2z(urlpath, *[, overwrite, compact])

Write this table to a compact .b2z container.

CTable.to_b2d(urlpath, *[, overwrite, compact])

Write this table to a directory-backed store.

CTable.to_csv(path, *[, header, sep])

Write all live rows to a CSV file.

CTable.to_arrow()

Convert all live rows to a pyarrow.Table.

CTable.to_parquet(path, *[, columns, ...])

Write this table to a Parquet file batch-wise using pyarrow.

CTable.from_arrow(schema, batches, *[, ...])

Build a CTable from an Arrow schema and iterable of record batches.

CTable.from_parquet(path, *[, columns, ...])

Read a Parquet file into a CTable.

CTable.from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

classmethod CTable.load(urlpath: str) CTable[source]

Load a persistent table from urlpath into RAM.

The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.

Parameters:

urlpath – Path to the table root directory.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]

Open a persistent CTable from urlpath.

Parameters:
  • urlpath – Path to the table root directory (created by passing urlpath to CTable).

  • mode'r' (default) — read-only. 'a' — read/write.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

CTable.save(urlpath: str, *, overwrite: bool = False) None[source]

Persist this table to disk at urlpath.

This writes a standalone copy and returns None; use copy() directly when the copied CTable object is needed.

Only live rows are written — the on-disk table is always compacted. A .b2z suffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a .b2d suffix for directory-backed stores when possible so the format is clear.

Parameters:
  • urlpath – Destination path. Use a .b2z suffix for a compact zip-backed store; any other suffix creates a directory-backed store. A .b2d suffix is recommended for directory-backed stores.

  • overwrite – If False (default), raise ValueError when urlpath already exists. Set to True to replace an existing table.

Raises:

ValueError – If urlpath already exists and overwrite=False.

CTable.to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a compact .b2z container.

.b2z is the compact zip-backed CTable format. For persistent, non-view directory-backed tables and compact=False, this uses a fast physical-pack path: the backing TreeStore directory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A .b2d suffix is recommended for directory-backed stores, but not required.

For in-memory tables, views, existing .b2z tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new .b2z store.

Examples

Fast-pack an existing directory-backed table into a compact zip store:

table = blosc2.CTable.open("data.b2d", mode="r")
table.to_b2z("data.b2z", overwrite=True)
table.close()

Materialize a filtered view into a new compact store:

view = table.where(table["score"] > 10)
view.to_b2z("high-score.b2z", overwrite=True)

Force a logical compacted copy, even for a persistent .b2d table:

table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
CTable.to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a directory-backed store.

Directory-backed CTable stores may use any path that does not end in .b2z; using a .b2d suffix is recommended for clarity. For persistent, non-view .b2z tables opened read-only and compact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.

For in-memory tables, views, writable .b2z tables, existing directory-backed tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new directory-backed store.

Examples

Fast-unpack an existing compact zip store into a directory-backed table:

table = blosc2.CTable.open("data.b2z", mode="r")
table.to_b2d("data.b2d", overwrite=True)
table.close()

Materialize a filtered view into a directory-backed store:

view = table.where(table["score"] > 10)
view.to_b2d("high-score.b2d", overwrite=True)

Force a logical compacted copy, even for a persistent .b2z table:

table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
CTable.to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]

Write all live rows to a CSV file.

Uses Python’s stdlib csv module — no extra dependency required. Fixed-shape ndarray column cells are serialised as JSON arrays for readability and shape safety (e.g. "[1.0, 2.0, 3.0]").

Parameters:
  • path – Destination file path. Created or overwritten.

  • header – If True (default), write column names as the first row.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

CTable.to_arrow()[source]

Convert all live rows to a pyarrow.Table.

CTable.to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]

Write this table to a Parquet file batch-wise using pyarrow.

classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

Nested struct flattening: top-level Arrow struct<…> fields are automatically and recursively flattened into dotted leaf columns. For example, a field trip: struct<begin: struct<lon: float64, lat: float64>> becomes two CTable columns trip.begin.lon and trip.begin.lat. Each leaf is stored as an independent compressed NDArray. Row reads via t[i] reconstruct the original nested dict shape. Use t["trip.begin.lon"] or t.trip.begin.lon to access a leaf:

import pyarrow as pa, blosc2
trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))])
schema = pa.schema([pa.field("trip", trip_type)])
t = blosc2.CTable.from_arrow(schema, batches)
t.col_names          # ['trip.begin.lon']
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Non-struct struct columns (not containing only scalar leaves) are imported as struct() columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

list_serializer selects the backend serializer for imported list columns. "msgpack" is the default; "arrow" stores Arrow list batches directly and can be much faster for deeply nested list columns.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.

Nested struct flattening: top-level Parquet struct<…> fields are automatically and recursively flattened into dotted leaf columns — the same as in from_arrow(). For example, a Parquet file that contains a column trip: struct<begin: struct<lon: double, lat: double>> produces two CTable columns trip.begin.lon and trip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:

t = blosc2.CTable.from_parquet("trips.parquet")
t.col_names               # e.g. ['trip.begin.lon', 'trip.begin.lat', ...]
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.

  • list_serializer ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default, "arrow", stores Arrow list batches directly and is much faster for deeply nested or list<struct<...>> columns. The tradeoff is that accessing those list columns later requires PyArrow. Use "msgpack" to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.

  • separate_nested_cols (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to True. In particular, a single unnamed top-level list<struct<...>> field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Use separate_nested_cols=False when closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.

  • max_rows (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root list<struct<...>> files imported with separate_nested_cols=True, this limits flattened element rows.

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If max_rows is negative.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

Inspection & statistics

Compute common descriptive statistics directly on CTable data without materializing rows first. These methods operate column-wise on the compressed representation, making it easy to summarize distributions or measure relationships between numeric columns.

CTable.column_schema(name)

Return the CompiledColumn descriptor for name.

CTable.info

Get information about this table.

CTable.schema_dict()

Return a JSON-compatible dict describing this table's schema.

CTable.describe()

Print a per-column statistical summary.

CTable.cov()

Return the covariance matrix as a numpy array.

CTable.column_schema(name: str) CompiledColumn[source]

Return the CompiledColumn descriptor for name.

Raises:

KeyError – If name is not a column in this table.

CTable.info()

Get information about this table.

Examples

>>> print(t.info)
>>> t.info()
CTable.schema_dict() dict[str, Any][source]

Return a JSON-compatible dict describing this table’s schema.

CTable.describe() None[source]

Print a per-column statistical summary.

Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.

CTable.cov() ndarray[source]

Return the covariance matrix as a numpy array.

Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise TypeError.

Returns:

Shape (ncols, ncols). Column order matches col_names.

Return type:

numpy.ndarray

Raises:
  • TypeError – If any column has an unsupported dtype (complex, string, …).

  • ValueError – If the table has fewer than 2 live rows (covariance undefined).


Column

A lazy column accessor returned by table["col_name"] or table.col_name. All index operations and aggregates apply the table’s tombstone mask (_valid_rows) so deleted rows are silently excluded.

class blosc2.Column(table: CTable, col_name: str, mask=None)[source]

Column view for a CTable, with vectorized operations and reductions.

Attributes:
dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

info

Get information about this column.

info_items

Structured summary items used by info.

is_computed

True if this column is a virtual computed column (read-only).

is_dictionary

True if this column is a dictionary-encoded string column.

is_generated

True if this column is a stored generated/materialized column.

is_list
is_ndarray

True if this column stores fixed-shape N-D array values per row.

is_stale

True if this generated column needs to be refreshed before use.

is_varlen_scalar

True if this column holds variable-length scalar strings or bytes.

item_ndim

Number of per-row item dimensions.

item_shape

Per-row item shape; () for scalar columns.

item_size

Number of scalar values stored in each row item.

ndim

Number of logical dimensions.

null_value

The sentinel value that represents NULL for this column, or None.

row_transformer

Build row-wise projections/reductions for generated columns.

shape

Logical shape of the live column values.

size

Number of live scalar values in the logical column array.

view

Return a ColumnViewIndexer for creating logical sub-views.

Methods

all()

Return True if every live, non-null value is True.

any()

Return True if at least one live, non-null value is True.

argmax([axis, where])

Index of the maximum live, non-null value.

argmin([axis, where])

Index of the minimum live, non-null value.

assign(data)

Replace all live values in this column with data.

is_null()

Return a boolean array True where the live value is the null sentinel.

isin(values)

Return a boolean array True where the live value is in values.

iter_chunks([size])

Iterate over live column values in chunks of size rows.

max([axis, where])

Maximum live, non-null value.

mean([axis, where])

Arithmetic mean of all live, non-null values.

min([axis, where])

Minimum live, non-null value.

norm([ord, axis, where])

Vector/matrix norm of a fixed-shape ndarray column.

notnull()

Return a boolean array True where the live value is not the null sentinel.

null_count()

Return the number of live rows whose value equals the null sentinel.

read_stale([key])

Read stored values even when this generated column is marked stale.

std([ddof, axis, where])

Standard deviation of all live, non-null values (single-pass, Welford's algorithm).

sum([dtype, axis, where, jit, jit_backend])

Sum of all live, non-null values.

summary()

Return and print a compact summary for this column.

unique()

Return sorted array of unique live, non-null values.

value_counts()

Return a {value: count} dict sorted by count descending.

Special methods

Column.__len__()

Return the number of live (non-deleted) values in this column.

Column.__iter__()

Iterate over live column values in insertion order, skipping deleted rows.

Column.__getitem__(key)

Return values for the given logical index.

Column.__setitem__(key, value)

Set one or more live column values; accepts the same index forms as __getitem__().

__len__()[source]

Return the number of live (non-deleted) values in this column.

Return the number of live (non-deleted) values in this column.

__iter__()[source]

Iterate over live column values in insertion order, skipping deleted rows.

Iterate over live values in insertion order, skipping deleted rows.

__getitem__(key: int | slice | list | ndarray)[source]

Return values for the given logical index.

  • int → scalar

  • slicenumpy.ndarray

  • list / np.ndarraynumpy.ndarray

  • bool np.ndarraynumpy.ndarray

For a writable logical sub-view use view.

__setitem__(key: int | slice | list | ndarray, value)[source]

Set one or more live column values; accepts the same index forms as __getitem__().

Set one or more live column values. Accepts the same index forms as __getitem__().

all() bool[source]

Return True if every live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.

any() bool[source]

Return True if at least one live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.

argmax(axis=None, *, where=None)[source]

Index of the maximum live, non-null value.

For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape (nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).

argmin(axis=None, *, where=None)[source]

Index of the minimum live, non-null value.

For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape (nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).

assign(data) None[source]

Replace all live values in this column with data.

Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.

Parameters:

data – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.

Raises:
  • ValueError – If len(data) does not match the number of live rows, or the table is opened read-only.

  • TypeError – If values cannot be coerced to the column’s dtype.

is_null() ndarray[source]

Return a boolean array True where the live value is the null sentinel.

For varlen scalar columns (vlstring/vlbytes) nullability is represented as native None values, so this returns True wherever the value is None. For dictionary columns, returns True where the code equals the null_code (-1 by default).

isin(values) ndarray[source]

Return a boolean array True where the live value is in values.

For dictionary columns this performs efficient integer-code membership testing (no decoding of all values). Values absent from the dictionary are treated as not-present.

For non-dictionary columns this decodes all live values and tests membership in a set.

iter_chunks(size: int = 65536)[source]

Iterate over live column values in chunks of size rows.

Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.

Parameters:

size – Number of live rows per yielded chunk. Defaults to 65 536.

Yields:

numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.

Examples

>>> for chunk in t["score"].iter_chunks(size=100_000):
...     process(chunk)
max(axis=None, *, where=None)[source]

Maximum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

mean(axis=None, *, where=None)[source]

Arithmetic mean of all live, non-null values.

Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.

min(axis=None, *, where=None)[source]

Minimum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

norm(ord=None, axis=None, *, where=None)[source]

Vector/matrix norm of a fixed-shape ndarray column.

The column is treated as a logical array of shape (nrows, *item_shape). For example, axis=1 computes one norm per row for a 1-D item shape.

notnull() ndarray[source]

Return a boolean array True where the live value is not the null sentinel.

null_count() int[source]

Return the number of live rows whose value equals the null sentinel.

Returns 0 in O(1) if no null_value is configured for this column and the column is not a varlen scalar column.

read_stale(key=slice(None, None, None))[source]

Read stored values even when this generated column is marked stale.

This is an explicit escape hatch for inspecting the last materialized values. Normal reads raise for stale generated columns so outdated values are not used accidentally.

std(ddof: int = 0, axis=None, *, where=None)[source]

Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).

Parameters:
  • ddof – Delta degrees of freedom. 0 (default) gives the population std; 1 gives the sample std (divides by N-1).

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.

  • dtypes (Supported)

  • skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)

  • float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)

sum(dtype=None, axis=None, *, where=None, jit=None, jit_backend=None)[source]

Sum of all live, non-null values.

Returns zero for an empty column or filtered view.

Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.

Parameters:
  • dtype – Optional accumulator dtype. When omitted, float columns use np.float64, complex columns use np.complex128, and integer / bool columns use np.int64.

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.

  • jit – Optional miniexpr JIT policy passed to the lazy reduction engine.

  • jit_backend – Optional miniexpr JIT backend. Use "tcc" or "cc".

Examples

Sum values matching a predicate without materializing a filtered view:

total = t["amount"].sum(where=t.category == 3)

Combine several column predicates:

total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))

Nullable sentinel values are skipped automatically:

# Equivalent to summing only live rows where predicate is true and
# t.col2 is not its configured null sentinel.
total = t.col2.sum(where=t.col1 < 300)
summary() str[source]

Return and print a compact summary for this column.

For fixed-shape ndarray columns this includes logical shape, storage, and row-norm statistics when numeric. Scalar columns fall back to info.

unique() ndarray[source]

Return sorted array of unique live, non-null values.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

value_counts() dict[source]

Return a {value: count} dict sorted by count descending.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Example

>>> t["active"].value_counts()
{True: 8432, False: 1568}
property dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

property info: _CTableInfoReporter

Get information about this column.

The report includes both logical/live-row details and, when available, the physical storage details used internally by lazy predicates.

Examples

>>> print(t["score"].info)
>>> t["score"].info()
property info_items: list[tuple[str, object]]

Structured summary items used by info.

property is_computed: bool

True if this column is a virtual computed column (read-only).

property is_dictionary: bool

True if this column is a dictionary-encoded string column.

property is_generated: bool

True if this column is a stored generated/materialized column.

property is_ndarray: bool

True if this column stores fixed-shape N-D array values per row.

property is_stale: bool

True if this generated column needs to be refreshed before use.

property is_varlen_scalar: bool

True if this column holds variable-length scalar strings or bytes.

property item_ndim: int

Number of per-row item dimensions.

property item_shape: tuple[int, ...]

Per-row item shape; () for scalar columns.

property item_size: int

Number of scalar values stored in each row item.

property ndim: int

Number of logical dimensions.

property null_value

The sentinel value that represents NULL for this column, or None.

property row_transformer: RowTransformer

Build row-wise projections/reductions for generated columns.

property shape: tuple[int, ...]

Logical shape of the live column values.

property size: int

Number of live scalar values in the logical column array.

property view: ColumnViewIndexer

Return a ColumnViewIndexer for creating logical sub-views.

Examples

Read a sub-view for chained aggregates:

sub = t.price.view[2:10]
sub.sum()

Bulk write through a sub-view:

t.price.view[0:5][:] = np.zeros(5)

Attributes

Column.dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

Column.null_value

The sentinel value that represents NULL for this column, or None.

Column.row_transformer

Build row-wise projections/reductions for generated columns.

property Column.dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

property Column.null_value

The sentinel value that represents NULL for this column, or None.

property Column.row_transformer: RowTransformer

Build row-wise projections/reductions for generated columns.

Data access

Column.view

Return a ColumnViewIndexer for creating logical sub-views.

Column.iter_chunks([size])

Iterate over live column values in chunks of size rows.

Column.assign(data)

Replace all live values in this column with data.

property Column.view: ColumnViewIndexer

Return a ColumnViewIndexer for creating logical sub-views.

Examples

Read a sub-view for chained aggregates:

sub = t.price.view[2:10]
sub.sum()

Bulk write through a sub-view:

t.price.view[0:5][:] = np.zeros(5)
Column.iter_chunks(size: int = 65536)[source]

Iterate over live column values in chunks of size rows.

Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.

Parameters:

size – Number of live rows per yielded chunk. Defaults to 65 536.

Yields:

numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.

Examples

>>> for chunk in t["score"].iter_chunks(size=100_000):
...     process(chunk)
Column.assign(data) None[source]

Replace all live values in this column with data.

Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.

Parameters:

data – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.

Raises:
  • ValueError – If len(data) does not match the number of live rows, or the table is opened read-only.

  • TypeError – If values cannot be coerced to the column’s dtype.

Row transformers

Column.row_transformer builds row-wise projections and reductions for fixed-shape ndarray columns. Use these transformers with CTable.add_generated_column() when the generated value should be computed from each row’s ndarray payload rather than from scalar columns:

t.add_generated_column(
    "embedding_norm",
    values=t.embedding.row_transformer.norm(axis=0),
    dtype=blosc2.float64(),
)
t.add_generated_column(
    "image_mean_rgb",
    values=t.image.row_transformer.mean(axis=(0, 1)),
    dtype=blosc2.ndarray((3,), dtype=blosc2.float32()),
)
class blosc2.RowTransformer(source: str, *, selection=(), op: str | None = None, axis=None, ord=None)[source]

Row-wise transformer for fixed-shape ndarray columns.

A row transformer sees one table row at a time. For a source column with physical shape (nrows, *item_shape), axes passed to reductions are axes within item_shape (so they are shifted by one for batch evaluation).

Methods

argmax

argmin

max

mean

min

norm

sum

Nullable helpers

Column.is_null()

Return a boolean array True where the live value is the null sentinel.

Column.notnull()

Return a boolean array True where the live value is not the null sentinel.

Column.null_count()

Return the number of live rows whose value equals the null sentinel.

Column.is_null() ndarray[source]

Return a boolean array True where the live value is the null sentinel.

For varlen scalar columns (vlstring/vlbytes) nullability is represented as native None values, so this returns True wherever the value is None. For dictionary columns, returns True where the code equals the null_code (-1 by default).

Column.notnull() ndarray[source]

Return a boolean array True where the live value is not the null sentinel.

Column.null_count() int[source]

Return the number of live rows whose value equals the null sentinel.

Returns 0 in O(1) if no null_value is configured for this column and the column is not a varlen scalar column.

Unique values

Column.unique()

Return sorted array of unique live, non-null values.

Column.value_counts()

Return a {value: count} dict sorted by count descending.

Column.unique() ndarray[source]

Return sorted array of unique live, non-null values.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Column.value_counts() dict[source]

Return a {value: count} dict sorted by count descending.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Example

>>> t["active"].value_counts()
{True: 8432, False: 1568}

Aggregates

Null sentinel values are automatically excluded from all aggregates.

Column.sum([dtype, axis, where, jit, ...])

Sum of all live, non-null values.

Column.min([axis, where])

Minimum live, non-null value.

Column.max([axis, where])

Maximum live, non-null value.

Column.argmin([axis, where])

Index of the minimum live, non-null value.

Column.argmax([axis, where])

Index of the maximum live, non-null value.

Column.mean([axis, where])

Arithmetic mean of all live, non-null values.

Column.std([ddof, axis, where])

Standard deviation of all live, non-null values (single-pass, Welford's algorithm).

Column.any()

Return True if at least one live, non-null value is True.

Column.all()

Return True if every live, non-null value is True.

Column.sum(dtype=None, axis=None, *, where=None, jit=None, jit_backend=None)[source]

Sum of all live, non-null values.

Returns zero for an empty column or filtered view.

Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.

Parameters:
  • dtype – Optional accumulator dtype. When omitted, float columns use np.float64, complex columns use np.complex128, and integer / bool columns use np.int64.

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.

  • jit – Optional miniexpr JIT policy passed to the lazy reduction engine.

  • jit_backend – Optional miniexpr JIT backend. Use "tcc" or "cc".

Examples

Sum values matching a predicate without materializing a filtered view:

total = t["amount"].sum(where=t.category == 3)

Combine several column predicates:

total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))

Nullable sentinel values are skipped automatically:

# Equivalent to summing only live rows where predicate is true and
# t.col2 is not its configured null sentinel.
total = t.col2.sum(where=t.col1 < 300)
Column.min(axis=None, *, where=None)[source]

Minimum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

Column.max(axis=None, *, where=None)[source]

Maximum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

Column.argmin(axis=None, *, where=None)[source]

Index of the minimum live, non-null value.

For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape (nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).

Column.argmax(axis=None, *, where=None)[source]

Index of the maximum live, non-null value.

For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape (nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).

Column.mean(axis=None, *, where=None)[source]

Arithmetic mean of all live, non-null values.

Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.

Column.std(ddof: int = 0, axis=None, *, where=None)[source]

Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).

Parameters:
  • ddof – Delta degrees of freedom. 0 (default) gives the population std; 1 gives the sample std (divides by N-1).

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.

  • dtypes (Supported)

  • skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)

  • float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)

Column.any() bool[source]

Return True if at least one live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.

Column.all() bool[source]

Return True if every live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.


Schema Specs

Schema specs are passed to field() to declare a column’s type, storage constraints, and optional null sentinel. They are also available directly in the blosc2 namespace (e.g. blosc2.int64).

blosc2.field(spec: ~blosc2.schema.SchemaSpec, *, default=<dataclasses._MISSING_TYPE object>, cparams: dict[str, ~typing.Any] | None = None, dparams: dict[str, ~typing.Any] | None = None, chunks: tuple[int, ...] | None = None, blocks: tuple[int, ...] | None = None) Field[source]

Attach a Blosc2 schema spec and per-column storage options to a dataclass field.

Parameters:
  • spec – A schema descriptor such as b2.int64(ge=0) or b2.float64().

  • default – Default value for the field. Omit for required fields.

  • cparams – Compression parameters for this column’s NDArray.

  • dparams – Decompression parameters for this column’s NDArray.

  • chunks – Chunk shape for this column’s NDArray.

  • blocks – Block shape for this column’s NDArray.

Examples

>>> from dataclasses import dataclass
>>> import blosc2 as b2
>>> @dataclass
... class Row:
...     id: int = b2.field(b2.int64(ge=0))
...     score: float = b2.field(b2.float64(ge=0, le=100))
...     active: bool = b2.field(b2.bool(), default=True)

Numeric

int8(*[, ge, gt, le, lt, nullable, null_value])

8-bit signed integer column (−128 … 127).

int16(*[, ge, gt, le, lt, nullable, null_value])

16-bit signed integer column (−32 768 … 32 767).

int32(*[, ge, gt, le, lt, nullable, null_value])

32-bit signed integer column (−2 147 483 648 … 2 147 483 647).

int64(*[, ge, gt, le, lt, nullable, null_value])

64-bit signed integer column.

uint8(*[, ge, gt, le, lt, nullable, null_value])

8-bit unsigned integer column (0 … 255).

uint16(*[, ge, gt, le, lt, nullable, null_value])

16-bit unsigned integer column (0 … 65 535).

uint32(*[, ge, gt, le, lt, nullable, null_value])

32-bit unsigned integer column (0 … 4 294 967 295).

uint64(*[, ge, gt, le, lt, nullable, null_value])

64-bit unsigned integer column.

float32(*[, ge, gt, le, lt, nullable, ...])

32-bit floating-point column (single precision).

float64(*[, ge, gt, le, lt, nullable, ...])

64-bit floating-point column (double precision).

timestamp(*[, unit, timezone, nullable, ...])

Timestamp column stored as signed 64-bit epoch offsets.

class blosc2.int8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

8-bit signed integer column (−128 … 127).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int8

class blosc2.int16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

16-bit signed integer column (−32 768 … 32 767).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int16

class blosc2.int32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

32-bit signed integer column (−2 147 483 648 … 2 147 483 647).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int32

class blosc2.int64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

64-bit signed integer column.

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int64

class blosc2.uint8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

8-bit unsigned integer column (0 … 255).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint8

class blosc2.uint16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

16-bit unsigned integer column (0 … 65 535).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint16

class blosc2.uint32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

32-bit unsigned integer column (0 … 4 294 967 295).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint32

class blosc2.uint64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

64-bit unsigned integer column.

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint64

class blosc2.float32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

32-bit floating-point column (single precision).

Methods

python_type

alias of float

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of float32

class blosc2.float64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

64-bit floating-point column (double precision).

Methods

python_type

alias of float

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of float64

class blosc2.timestamp(*, unit: str = 'us', timezone: str | None = None, nullable: bool = False, null_value=None)[source]

Timestamp column stored as signed 64-bit epoch offsets.

The physical storage dtype is int64. unit follows Arrow/NumPy datetime units: "s", "ms", "us" or "ns". timezone is metadata preserved for Arrow/Parquet roundtrips.

Methods

python_type

alias of object

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int64

Complex

complex64()

64-bit complex number column (two 32-bit floats).

complex128()

128-bit complex number column (two 64-bit floats).

class blosc2.complex64[source]

64-bit complex number column (two 32-bit floats).

Methods

python_type

alias of complex

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of complex64

class blosc2.complex128[source]

128-bit complex number column (two 64-bit floats).

Methods

python_type

alias of complex

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of complex128

Boolean

bool(*[, nullable, null_value])

Boolean column.

class blosc2.bool(*, nullable: bool = False, null_value=None)[source]

Boolean column.

Nullable bool columns use uint8 physical storage with values 0 (false), 1 (true), and 255 (null).

Methods

python_type

alias of bool

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of bool

Text & binary

string(*[, min_length, max_length, pattern, ...])

Fixed-width Unicode string column.

bytes(*[, min_length, max_length, nullable, ...])

Fixed-width bytes column.

vlstring(*[, nullable, serializer, ...])

Build a variable-length scalar string schema descriptor.

vlbytes(*[, nullable, serializer, ...])

Build a variable-length scalar bytes schema descriptor.

class blosc2.string(*, min_length=None, max_length=None, pattern=None, nullable: bool = False, null_value=None)[source]

Fixed-width Unicode string column.

Parameters:
  • max_length – Maximum number of characters. Determines the NumPy U<n> dtype. Defaults to 32 if not specified.

  • min_length – Minimum number of characters (validation only, no effect on dtype).

  • pattern – Regex pattern the value must match (validation only).

  • nullable – If True and null_value is not set, choose a null sentinel from the current CTable null policy when the schema is compiled.

  • null_value – Explicit null sentinel. Takes precedence over nullable=True.

Methods

python_type

alias of str

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

class blosc2.bytes(*, min_length=None, max_length=None, nullable: bool = False, null_value=None)[source]

Fixed-width bytes column.

Parameters:
  • max_length – Maximum number of bytes. Determines the NumPy S<n> dtype. Defaults to 32 if not specified.

  • min_length – Minimum number of bytes (validation only, no effect on dtype).

  • nullable – If True and null_value is not set, choose a null sentinel from the current CTable null policy when the schema is compiled.

  • null_value – Explicit null sentinel. Takes precedence over nullable=True.

Methods

python_type

alias of bytes

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

blosc2.vlstring(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLStringSpec[source]

Build a variable-length scalar string schema descriptor.

Use this as an explicit opt-in when a CTable column holds long or wildly variable-length strings that would waste space in a fixed-width string(max_length=N) column. Must be requested via blosc2.field(blosc2.vlstring()) — it is never inferred automatically from plain str annotations.

blosc2.vlbytes(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLBytesSpec[source]

Build a variable-length scalar bytes schema descriptor.

Use this as an explicit opt-in when a CTable column holds long or wildly variable-length byte strings. Must be requested via blosc2.field(blosc2.vlbytes()) — it is never inferred automatically from plain bytes annotations.

Array, encoded, and compound specs

ndarray(item_shape[, dtype, nullable, ...])

Build a fixed-shape N-D array descriptor for CTable columns.

dictionary(*[, index_type, value_type, ...])

Build a dictionary-encoded string column descriptor.

struct(fields, *[, nullable])

Build a structured schema descriptor for dict-like CTable values.

list(item_spec, *[, nullable, storage, ...])

Build a list-valued schema descriptor for CTable and ListArray.

object(*[, nullable, serializer, ...])

Build a schema-less Python object column descriptor for CTable.

blosc2.ndarray(item_shape, dtype=<class 'numpy.float64'>, *, nullable: ~blosc2.schema.bool = False, null_value=None) NDArraySpec[source]

Build a fixed-shape N-D array descriptor for CTable columns.

blosc2.dictionary(*, index_type=None, value_type=None, ordered: bool = False, nullable: bool = True) DictionarySpec[source]

Build a dictionary-encoded string column descriptor.

Dictionary columns store repeated string values as compact int32 codes with a separate global dictionary of unique string values. This matches Arrow dictionary encoding and is ideal for low-cardinality string columns such as categories or enumerated values.

Parameters:
  • index_type – The physical type for category codes. Must be blosc2.int32() in v1. Defaults to blosc2.int32() when not specified.

  • value_type – The type of dictionary values. Must be blosc2.vlstring() in v1. Defaults to blosc2.vlstring() when not specified.

  • ordered – If True, dictionary order is semantically meaningful.

  • nullable – If True (default), null row values are allowed (stored as code -1).

blosc2.struct(fields: dict[str, SchemaSpec], *, nullable: bool = False) StructSpec[source]

Build a structured schema descriptor for dict-like CTable values.

Top-level struct columns store one dictionary (or None when nullable) per row. Struct specs may also be nested as list item specs.

blosc2.list(item_spec: SchemaSpec, *, nullable: bool = False, storage: str = 'batch', serializer: str = 'msgpack', batch_rows: int | None = None, items_per_block: int | None = None) ListSpec[source]

Build a list-valued schema descriptor for CTable and ListArray.

blosc2.object(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) ObjectSpec[source]

Build a schema-less Python object column descriptor for CTable.

Values are stored via batched msgpack serialization. Prefer typed specs such as struct(), list(), vlstring(), or vlbytes() when the data has a stable schema; use object for heterogeneous per-row payloads.

Timestamp columns

Timestamp columns are declared with blosc2.timestamp and store signed 64-bit epoch offsets with timestamp metadata. Column reads return numpy.datetime64 values, comparisons accept numpy.datetime64 values, ISO-like strings, or Python datetime objects, and Arrow/Parquet import/export roundtrips timestamp units and time zones:

from dataclasses import dataclass
import numpy as np
import blosc2 as b2

@dataclass
class Event:
    when: np.datetime64 = b2.field(b2.timestamp(unit="us", nullable=True))
    value: int = b2.field(b2.int64())

table = b2.CTable(Event)
table.append(["2025-01-01T12:00:00", 42])
recent = table[table.when >= np.datetime64("2025-01-01", "us")]

Object columns

Schema-less object columns are declared with blosc2.object() and store one msgpack-serializable Python object (or None when nullable) per row in batched variable-length storage. Prefer typed specs such as blosc2.struct() or blosc2.list() when the payload has a stable schema; use object columns for heterogeneous per-row payloads:

from dataclasses import dataclass
import blosc2 as b2

@dataclass
class Event:
    id: int = b2.field(b2.int64())
    payload: object = b2.field(b2.object(nullable=True))

table.append([1, {"kind": "click", "xy": [10, 20]}])
table.append([2, ("custom", {"nested": True})])
table.append([3, None])

Object columns have no fixed Arrow type, so CTable.to_arrow() and CTable.to_parquet() raise for them unless users first convert the payloads to a typed representation. They are not used as an implicit fallback during Parquet import; unsupported Arrow/Parquet types still raise unless explicitly imported through CTable.from_arrow() with object_fallback=True.

Nested fields

CTable supports first-class nested struct schemas by physically flattening struct leaves into independent compressed columns. This keeps analytics fast (each leaf is an ordinary NDArray), while preserving the logical nested row shape on read.

Automatic flattening from Arrow / Parquet

When CTable.from_arrow() or CTable.from_parquet() encounters a top-level struct<…> field, it recursively flattens every scalar leaf into a dotted column name and stores each leaf as its own physical column:

import pyarrow as pa
import blosc2

trip_type = pa.struct([
    ("begin", pa.struct([("lon", pa.float64()), ("lat", pa.float64())])),
    ("end",   pa.struct([("lon", pa.float64()), ("lat", pa.float64())])),
])
schema = pa.schema([pa.field("trip", trip_type),
                    pa.field("fare", pa.float64())])
batch = pa.record_batch(
    [pa.array([{"begin": {"lon": -87.6, "lat": 41.8},
                "end":   {"lon": -87.7, "lat": 41.9}}],
              type=trip_type),
     pa.array([12.5])],
    schema=schema,
)

t = blosc2.CTable.from_arrow(schema, [batch])
# t.col_names → ['trip.begin.lon', 'trip.begin.lat',
#                 'trip.end.lon',   'trip.end.lat', 'fare']

Column access

Nested leaves are accessed with their dotted logical name or via chained attribute proxies:

t["trip.begin.lon"].mean()      # Column object (fast path)
t.trip.begin.lon.max()          # attribute proxy, same column

A literal ., /, or \\ inside an Arrow field name is escaped with a backslash in the logical column name. For example, path segments ("trip.info", "begin/point", "lon.deg") become:

t[r"trip\.info.begin\/point.lon\.deg"]

Such leaves are stored with percent-encoded path segments under _cols; the example above is stored at _cols/trip%2Einfo/begin%2Fpoint/lon%2Edeg.

Filtering and expressions

Dotted names work everywhere a flat column name would:

t.where("trip.begin.lon > -87.7 and fare > 10")
t.where(t.trip.begin.lon > -87.7)

Select / projection

A struct prefix expands to all descendant leaves:

t.select(["trip.begin"])        # → columns trip.begin.lon, trip.begin.lat
t.select(["trip"])              # → all four trip.* leaves

Indexes and aggregates

Scalar leaf columns support all the same operations as flat columns:

t.create_index(col_name="trip.begin.lon")
t.where("trip.begin.lon > -87.7").nrows   # uses the index

Row reconstruction

Single-row access reconstructs the original nested dict shape:

row = t[0]
row.trip       # → {"begin": {"lon": ..., "lat": ...}, "end": {...}}
row.fare       # → 12.5

Inserting nested rows

CTable.append() and CTable.extend() accept either the flat dotted form or the original nested dict / list-of-dicts shape:

# flat dotted keys
t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8,
          "trip.end.lon": -87.7,   "trip.end.lat": 41.9, "fare": 12.5})

# original nested dict (auto-flattened)
t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8},
                    "end":   {"lon": -87.7, "lat": 41.9}},
          "fare": 12.5})

# extend with a list of nested dicts
t.extend([
    {"trip": {"begin": {"lon": -87.6, "lat": 41.8},
              "end":   {"lon": -87.7, "lat": 41.9}}, "fare": 12.5},
    {"trip": {"begin": {"lon": -87.5, "lat": 41.7},
              "end":   {"lon": -87.8, "lat": 41.6}}, "fare": 8.0},
])

Physical storage layout

Leaf columns are stored under a hierarchical path in the backing container: /_cols/trip/begin/lon, /_cols/trip/begin/lat, etc. Intermediate nodes are namespaces only; no data is stored at non-leaf levels.

Arrow / Parquet round-trip

CTable.to_parquet() and CTable.to_arrow() reconstruct the original nested Arrow schema from the stored metadata, so round-trips are lossless:

t.to_parquet("out.parquet")    # Arrow schema has top-level "trip" struct

Struct columns

Struct columns are declared with blosc2.struct() and store one dictionary (or None when nullable) per row in batched variable-length storage. They are also used when importing top-level Arrow/Parquet struct<...> columns when not using the nested-leaf flattening path described above:

from dataclasses import dataclass
import blosc2 as b2

@dataclass
class Product:
    properties: dict = b2.field(
        b2.struct({"code": b2.int32(), "label": b2.vlstring()}, nullable=True)
    )

table.append([{"code": 1, "label": "fresh"}])
table.append([None])

List columns

List columns are declared with blosc2.list(), for example:

from dataclasses import dataclass
import blosc2 as b2

@dataclass
class Product:
    code: str = b2.field(b2.string(max_length=8))
    tags: list[str] = b2.field(b2.list(b2.string(), nullable=True))

Whole-cell replacement is supported, so users should reassign modified lists:

row_tags = table.tags[0]
row_tags.append("extra")      # local Python list only
table.tags[0] = row_tags      # explicit write-back