CTable¶
A columnar compressed table backed by one physical container per column.
Scalar columns use NDArray; list-valued columns use
ListArray. Each column is stored, compressed, and queried
independently; rows are never materialised in their entirety unless you
explicitly call to_arrow() or iterate with
__iter__().
- class blosc2.CTable(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None)[source]¶
Columnar compressed table with typed columns and row-oriented access.
- Attributes:
cbytesTotal compressed size in bytes (all columns + valid_rows mask).
computed_columnsRead-only view of the computed-column definitions.
cratioCompression ratio for the whole table payload.
indexesReturn a list of
blosc2.Indexhandles for all active indexes.infoGet information about this table.
info_itemsStructured summary items used by
info().nbytesTotal uncompressed size in bytes (all columns + valid_rows mask).
ncolsTotal number of columns, including computed (virtual) columns.
- nrows
schemaThe compiled schema that drives this table’s columns and validation.
Methods
add_column(name, spec)Add a new column filled from the default declared in spec.
add_computed_column(name, expr, *[, dtype])Add a read-only virtual column computed from stored columns.
add_generated_column(name, *, values[, ...])Add a stored generated column maintained by the table.
append(data)Append a single row to the table.
close()Close any persistent backing store held by this table.
column_schema(name)Return the
CompiledColumndescriptor for name.compact()Physically rewrite every column array keeping only live rows.
compact_index([col_name, expression, name])Compact an index, merging any incremental append runs.
copy([compact, urlpath, overwrite])Return a new standalone copy of this table.
cov()Return the covariance matrix as a numpy array.
create_index([col_name, field, expression, ...])Build and register an index for a stored column or table expression.
delete(ind)Mark one or more rows as deleted (tombstone deletion).
describe()Print a per-column statistical summary.
drop_column(name)Remove a column from the table.
drop_computed_column(name)Remove a computed column from the table.
drop_index([col_name, expression, name])Remove an index and delete any sidecar files.
extend(data, *[, validate])Append multiple rows at once.
from_arrow(schema, batches, *[, urlpath, ...])Build a
CTablefrom an Arrow schema and iterable of record batches.from_csv(path, row_cls, *[, header, sep])Build a
CTablefrom a CSV file.from_pandas(df, row_cls)Build a
CTablefrom a pandas DataFrame.from_parquet(path, *[, columns, batch_size, ...])Read a Parquet file into a
CTable.group_by(keys, *[, sort, dropna, engine, ...])Return a deferred group-by object for this table.
head([N])Return a view of the first N live rows (default 5).
index([col_name, expression, name])Return the index handle for a stored-column or expression target.
iter_arrow_batches(*[, columns, batch_size, ...])Yield live rows as bounded-size
pyarrow.RecordBatchobjects.iter_sorted(cols[, ascending, start, stop, ...])Iterate rows in sorted order without materializing a full copy.
load(urlpath)Load a persistent table from urlpath into RAM.
materialize_computed_column(name, *[, ...])Materialize a computed column into a new stored snapshot column.
open(urlpath, *[, mode])Open a persistent CTable from urlpath.
rebuild_index([col_name, expression, name])Drop and recreate an index with the same parameters.
refresh_generated_column(name)Recompute a stored generated/materialized column from its source columns.
refresh_generated_columns(*[, source])Refresh all generated columns, optionally only those depending on source.
rename_column(old, new)Rename a column.
sample(n, *[, seed])Return a read-only view of n randomly chosen live rows.
save(urlpath, *[, overwrite])Persist this table to disk at urlpath.
Return a JSON-compatible dict describing this table's schema.
select(cols)Return a column-projection view exposing only cols.
sort_by(cols[, ascending, inplace])Return a copy of the table sorted by one or more columns.
tail([N])Return a view of the last N live rows (default 5).
to_arrow()Convert all live rows to a
pyarrow.Table.to_b2d(urlpath, *[, overwrite, compact])Write this table to a directory-backed store.
to_b2z(urlpath, *[, overwrite, compact])Write this table to a compact
.b2zcontainer.to_csv(path, *[, header, sep])Write all live rows to a CSV file.
Convert to a pandas DataFrame.
to_parquet(path, *[, columns, batch_size, ...])Write this table to a Parquet file batch-wise using pyarrow.
view(new_valid_rows)Return a row-filter view backed by a boolean mask array without copying data.
where(expr_result, *[, columns])Return a row-filtered view matching a boolean predicate.
Special methods
Return the number of live (non-deleted) rows.
Iterate over live rows in insertion order, yielding namedtuple-like row objects.
CTable.__getitem__(key)Type-driven indexing for columns, rows, projections, and filters.
Convenience fallback for attribute-style column access.
Short
CTable<cols>(N rows, X compressed)summary string.Pandas-style tabular display with column names, dtypes, and a row count footer.
- __len__()[source]¶
Return the number of live (non-deleted) rows.
Return the number of live (non-deleted) rows.
- __iter__()[source]¶
Iterate over live rows in insertion order, yielding namedtuple-like row objects.
Iterate over live rows in insertion order, yielding namedtuple-like row objects with one attribute per column.
__getitem__supports type-driven indexing:str— column name returns aColumn; any other string is interpreted as a boolean expression and behaves likewhere().boolean
LazyExpr/NDArray— filtered row view, same aswhere(), e.g.t[t.temperature_f > 70].int— single row as a namedtuple-like object.slice— row-range view.list[int]/ndarray[int]— gathered-row view.ndarray[bool]— boolean-mask filtered view.list[str]— column-projection view (same asselect()).
__getattr__provides convenience attribute-style column access only after normal Python attribute lookup fails; uset["name"]for columns that conflict with table attributes or methods.- __str__() str[source]¶
Pandas-style tabular display with column names, dtypes, and a row count footer.
- classmethod from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.Nested struct flattening: top-level Arrow
struct<…>fields are automatically and recursively flattened into dotted leaf columns. For example, a fieldtrip: struct<begin: struct<lon: float64, lat: float64>>becomes two CTable columnstrip.begin.lonandtrip.begin.lat. Each leaf is stored as an independent compressedNDArray. Row reads viat[i]reconstruct the original nested dict shape. Uset["trip.begin.lon"]ort.trip.begin.lonto access a leaf:import pyarrow as pa, blosc2 trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))]) schema = pa.schema([pa.field("trip", trip_type)]) t = blosc2.CTable.from_arrow(schema, batches) t.col_names # ['trip.begin.lon'] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Non-structstructcolumns (not containing only scalar leaves) are imported asstruct()columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.list_serializerselects the backend serializer for imported list columns."msgpack"is the default;"arrow"stores Arrow list batches directly and can be much faster for deeply nested list columns.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
- classmethod from_pandas(df, row_cls) CTable[source]¶
Build a
CTablefrom a pandas DataFrame.Schema comes from row_cls (a dataclass) — CTable is always typed. Object-dtype DataFrame columns are not automatically inferred as ndarray columns; the row_cls must explicitly declare
blosc2.ndarray()fields.- Parameters:
- Returns:
A new CTable containing all DataFrame rows.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If DataFrame columns do not match the row_cls schema.
- classmethod from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.Nested struct flattening: top-level Parquet
struct<…>fields are automatically and recursively flattened into dotted leaf columns — the same as infrom_arrow(). For example, a Parquet file that contains a columntrip: struct<begin: struct<lon: double, lat: double>>produces two CTable columnstrip.begin.lonandtrip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:t = blosc2.CTable.from_parquet("trips.parquet") t.col_names # e.g. ['trip.begin.lon', 'trip.begin.lat', ...] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
Unsupported Parquet types are not silently imported as schema-less
object()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.list_serializer¶ ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default,
"arrow", stores Arrow list batches directly and is much faster for deeply nested orlist<struct<...>>columns. The tradeoff is that accessing those list columns later requires PyArrow. Use"msgpack"to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.separate_nested_cols¶ (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to
True. In particular, a single unnamed top-levellist<struct<...>>field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Useseparate_nested_cols=Falsewhen closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.max_rows¶ (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root
list<struct<...>>files imported withseparate_nested_cols=True, this limits flattened element rows.**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
max_rowsis negative.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- classmethod load(urlpath: str) CTable[source]¶
Load a persistent table from urlpath into RAM.
The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.
- Parameters:
urlpath¶ – Path to the table root directory.
- Raises:
FileNotFoundError – If urlpath does not contain a CTable.
ValueError – If the metadata at urlpath does not identify a CTable.
- classmethod open(urlpath: str, *, mode: str = 'r') CTable[source]¶
Open a persistent CTable from urlpath.
- __getattr__(s: str)[source]¶
Convenience fallback for attribute-style column access.
This is called only after normal Python attribute lookup fails. Thus
t.namecan return a column only for non-conflicting identifier-like column names. For columns whose names conflict with existing CTable attributes/methods, or are not valid identifiers, use the canonical item access formt["name"].
- __getitem__(key)[source]¶
Type-driven indexing for columns, rows, projections, and filters.
Supported keys are:
str: return aColumnwhen it matches a stored or computed column name; otherwise evaluate it as a boolean expression viawhere(). Dotted names (e.g."trip.begin.lon") select nested leaf columns directly; a struct-prefix name (e.g."trip.begin") that matches multiple descendant leaves returns a_StructPathColumnview. This item-access form is the canonical way to access columns and works for every column name, including names that are not valid Python identifiers or that collide with existingCTableattributes or methods.boolean
blosc2.LazyExprorblosc2.NDArray: return the same filtered view aswhere(), e.g.t[t.temperature_f > 70].int: return one live row as a namedtuple-like object.slice: return a row-range view.integer array/list: return a gathered-row view.
boolean NumPy array/list: return a boolean-mask filtered view.
string list: return a column-projection view, equivalent to
select().
Examples
Access columns and rows:
temps = t["temperature"] first = t[0] view = t[10:20]
Filter rows with a string expression, a stored-column expression, or a computed-column expression:
warm = t["temperature > 20"] warm_active = t[(t.temperature > 20) & t.active] hot_fahrenheit = t[t.temperature_f > 70]
Project columns:
slim = t[["sensor_id", "temperature_f"]]
Access a nested leaf column with a dotted name or an attribute chain:
lons = t["trip.begin.lon"] # Column for the nested leaf lons = t.trip.begin.lon # equivalent attribute-chain form
Attribute access is only a convenience fallback. If a column name is not a valid identifier, or if it conflicts with an existing table attribute or method such as
nrows,whereorsort_by, use item access instead:col = t["where"] # column named "where" method = t.where # CTable.where method
- add_column(name: str, spec: SchemaSpec | Field) None[source]¶
Add a new column filled from the default declared in spec.
- Parameters:
name¶ – Column name. Must follow the same naming rules as schema fields.
spec¶ – A schema descriptor such as
b2.int64(ge=0)or a field descriptor such asb2.field(b2.int64(ge=0), default=0). When the table already has live rows, useblosc2.field(...)with a default declared so those rows can be backfilled.
- Raises:
ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.
TypeError – If a declared default cannot be coerced to spec’s dtype.
- add_computed_column(name: str, expr: str | LazyExpr | Callable[[dict[str, Any]], LazyExpr], *, dtype: dtype | None = None) None[source]¶
Add a read-only virtual column computed from stored columns.
A computed column has no physical storage. It is backed by a
blosc2.LazyExprand is evaluated when values are read, filtered, displayed, exported, or aggregated. Because it is virtual, it is read-only, cannot be indexed directly, and is not supplied inappend()/extend()inputs. To store and optionally index a computed result, useadd_generated_column()or materialize an existing computed column withmaterialize_computed_column().Supported signatures are:
add_computed_column(name, "price * qty", dtype=None) add_computed_column(name, lazy_expr, dtype=None) add_computed_column(name, lambda cols: cols["price"] * cols["qty"], dtype=None)
- Parameters:
name¶ – Name of the virtual computed column. It must be a valid column name and must not collide with an existing stored or computed column.
expr¶ –
Definition of the virtual column. Accepted forms:
str: scalar expression over stored scalar columns, e.g."price * qty".blosc2.LazyExpr: lazy expression over stored columns of this table.callable: called as
expr(self._cols)and must return ablosc2.LazyExprover stored columns of this table.
Expressions must depend only on stored columns of this table; computed columns cannot depend on other computed columns in this version. Fixed-shape ndarray columns are not accepted in computed column expressions yet. For row-wise ndarray projections or reductions, use
add_generated_column()withvalues=t.ndarray_col.row_transformer....dtype¶ – Optional dtype override for the computed values. When omitted, the dtype is inferred from the resulting
blosc2.LazyExpr. This changes the dtype reported by the CTable column wrapper; it does not create physical storage.
Examples
Add a computed column from a string expression and use it like a normal read-only column:
t.add_computed_column("total", "price * qty") assert t.total[:].shape == (t.nrows,)
Add a computed column from a callable. The callable receives the table’s stored column mapping:
t.add_computed_column( "price_with_tax", lambda cols: cols["price"] * 1.21, dtype=np.float64, )
Callable expressions can use normal Python logic while still returning a lazy expression:
def total_expr(cols): base = cols["price"] * cols["qty"] return base * 1.21 if include_tax else base t.add_computed_column("total", total_expr)
They are also convenient for reusable, parameterized helpers:
def ratio(num, den): return lambda cols: cols[num] / cols[den] t.add_computed_column("margin", ratio("profit", "revenue"))
Computed columns participate in filters and aggregates:
expensive = t.where(t.total > 100) total_revenue = t.total.sum()
Computed columns are virtual and read-only. Materialize one when a stored snapshot or an indexable column is needed:
t.materialize_computed_column("total", new_name="total_stored") t.create_index("total_stored")
For maintained stored results, prefer generated columns:
t.add_generated_column( "total_stored", values="price * qty", dtype=blosc2.float64(), create_index=True, )
- Raises:
ValueError – If called on a view or read-only table, if name already exists, or if an expression operand does not reference a stored column of this table.
TypeError – If expr has an unsupported form, does not produce a
blosc2.LazyExpr, references unsupported source columns, or if aRowTransformeris passed. Row transformers are only accepted byadd_generated_column().
- add_generated_column(name: str, *, values: str | LazyExpr | Callable[[dict[str, Any]], LazyExpr] | RowTransformer, dtype=None, create_index: bool = False) None[source]¶
Add a stored generated column maintained by the table.
A generated column is physical storage, not a virtual expression. The initial values are computed for all current live rows, and later
append()/extend()calls automatically compute values for newly inserted rows when source columns are provided. If a source column is modified in-place, dependent generated columns are marked stale; callrefresh_generated_column()orrefresh_generated_columns()to recompute them.Supported signatures are:
add_generated_column(name, *, values="price * qty", dtype=..., create_index=False) add_generated_column(name, *, values=lazy_expr, dtype=..., create_index=False) add_generated_column(name, *, values=lambda cols: cols["price"] * 1.21, dtype=...) add_generated_column(name, *, values=t.embedding.row_transformer.norm(axis=0), dtype=...) add_generated_column(name, *, values=t.image.row_transformer.mean(axis=(0, 1)), dtype=blosc2.ndarray((3,), dtype=...))
- Parameters:
name¶ – Name of the generated column to create. It must be a valid column name and must not collide with an existing stored or computed column.
values¶ –
Definition used to compute the generated values. Accepted forms:
str: scalar expression over stored scalar columns, e.g."price * qty". The expression must produce one scalar value per row.blosc2.LazyExpr: scalar lazy expression over stored columns of this table. It must produce a 1-D scalar stream.callable: called as
values(self._cols)and must return ablosc2.LazyExprover stored columns of this table.RowTransformer: row-wise projection/reduction bound to a fixed-shape ndarray column, e.g.t.embedding.row_transformer.norm(axis=0)ort.image.row_transformer.mean(axis=(0, 1)). Row transformers may produce either one scalar per row or one fixed-shape ndarray item per row.
Expression forms currently cannot depend on computed columns and cannot directly consume fixed-shape ndarray columns; use a row-transformer for ndarray row projections/reductions.
dtype¶ – Output schema or dtype. Scalar outputs may pass a NumPy dtype or a Blosc2 scalar spec such as
blosc2.float64(). Fixed-shape ndarray outputs must pass an ndarray spec such asblosc2.ndarray((3,), dtype=blosc2.float32())unless the table has existing rows from which the output shape can be inferred. When omitted, dtype and fixed-shape output shape are inferred from the current generated values; this is not possible for an empty table.create_index¶ – If
True, create an index on the generated column immediately. Only scalar generated columns can be indexed; fixed-shape ndarray generated columns raiseValueErrorwhen indexing is requested.
Examples
Create and index a scalar generated column from a string expression:
t.add_generated_column( "total", values="price * qty", dtype=blosc2.float64(), create_index=True, )
Use a callable when normal Python composition is more convenient:
t.add_generated_column( "price_with_tax", values=lambda cols: cols["price"] * 1.21, dtype=blosc2.float64(), )
Generate a scalar from each fixed-shape ndarray row. For row transformers, axes refer to the per-row item shape, so
axis=0is the embedding-coordinate axis foritem_shape=(dim,):t.add_generated_column( "embedding_norm", values=t.embedding.row_transformer.norm(axis=0, ord=2), dtype=blosc2.float64(), create_index=True, )
Generate a fixed-shape ndarray value per row. Here an image column has
item_shape=(height, width, 3)and the generated column stores one RGB vector per row:t.add_generated_column( "image_mean_rgb", values=t.image.row_transformer.mean(axis=(0, 1)), dtype=blosc2.ndarray((3,), dtype=blosc2.float32()), )
Generated columns are maintained on append/extend:
t.append((new_id, new_embedding, new_image)) assert t.embedding_norm[-1] == np.linalg.norm(new_embedding)
If source values are changed in place, refresh dependent generated columns before relying on them:
t.embedding[0] = new_embedding t.refresh_generated_column("embedding_norm")
- Raises:
ValueError – If called on a view or read-only table, if name already exists, if generated output length/shape is incompatible with the table, or if
create_index=Trueis requested for an ndarray generated column.TypeError – If values has an unsupported form, references unsupported source columns, or cannot be coerced to dtype.
KeyError – If a
RowTransformerreferences a missing source column.
- append(data: list | void | ndarray) None[source]¶
Append a single row to the table.
data may be a list, tuple,
numpy.void, or structurednumpy.ndarraywhose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. RaisesValueErrorif the table is read-only or a view.For tables with nested (dotted) column names the row dict may be supplied either as a flat mapping of dotted keys or as a nested dict that mirrors the original struct shape — both are accepted and automatically flattened to the physical dotted leaf names:
# flat dotted keys t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8, "payment.fare": 12.5}) # original nested dict (auto-flattened) t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}})
- column_schema(name: str) CompiledColumn[source]¶
Return the
CompiledColumndescriptor for name.- Raises:
KeyError – If name is not a column in this table.
- compact()[source]¶
Physically rewrite every column array keeping only live rows.
Closes the gaps left by prior
delete()calls by shuffling live data to the front of each column array. The underlying NDArray allocations are not resized — each column retains its original capacity. To actually reclaim memory, usecopy()withcompact=Trueinstead, which allocates fresh arrays sized to the live row count. All existing indexes are dropped and must be recreated afterwards. RaisesValueErrorif the table is read-only or a view.
- compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Compact an index, merging any incremental append runs.
- copy(compact: bool = True, *, urlpath: str | PathLike[str] | None = None, overwrite: bool = False) CTable[source]¶
Return a new standalone copy of this table.
This is the only operation that truly reclaims memory: when
compact=Truethe new table allocates fresh arrays sized exactly to the live row count, discarding all deleted-row gaps and unused capacity.- Parameters:
compact¶ – If
True(default), only live (non-deleted) rows are copied. The result is a dense table with no tombstones and no parent dependency — ideal for materialising a filtered view or freeing memory after heavy deletions. IfFalse, all physical slots are copied including deleted gaps, preserving the tombstone state exactly for in-memory copies.urlpath¶ – Destination path for a persistent copy. The
.b2zextension selects a compact zip-backed store; any other path uses a directory-backed store. A.b2dsuffix is recommended for directory-backed stores. IfNone(default), return an in-memory copy.overwrite¶ – If
True, replace an existing persistent destination.
- cov() ndarray[source]¶
Return the covariance matrix as a numpy array.
Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise
TypeError.- Returns:
Shape
(ncols, ncols). Column order matchescol_names.- Return type:
numpy.ndarray
- Raises:
TypeError – If any column has an unsupported dtype (complex, string, …).
ValueError – If the table has fewer than 2 live rows (covariance undefined).
- create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]¶
Build and register an index for a stored column or table expression.
For tables with nested (dotted) column names, pass the dotted leaf name directly:
t.create_index("trip.begin.lon") t.where("trip.begin.lon > -87.7").nrows # index is used automatically
- delete(ind: int | slice | str | Iterable) None[source]¶
Mark one or more rows as deleted (tombstone deletion).
ind may be a logical row index (
int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed untilcompact()is called. RaisesValueErrorif the table is read-only or a view.
- describe() None[source]¶
Print a per-column statistical summary.
Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.
- drop_column(name: str) None[source]¶
Remove a column from the table.
On disk tables the corresponding persisted column leaf is deleted.
- Raises:
ValueError – If the table is read-only, is a view, or name is the last column.
KeyError – If name does not exist.
- drop_computed_column(name: str) None[source]¶
Remove a computed column from the table.
- Parameters:
name¶ – Name of the computed column to remove.
- Raises:
KeyError – If name is not a computed column.
ValueError – If called on a view.
- drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]¶
Remove an index and delete any sidecar files.
- extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]¶
Append multiple rows at once.
data may be:
a dict of arrays
{"col": array, ...}— all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;a list of rows, each compatible with
append();another CTable — columns are matched by name.
Pass
validate=Falseto skip per-row Pydantic validation on trusted bulk imports. RaisesValueErrorif the table is read-only or a view.For tables with nested (dotted) column names both the dict-of-arrays and list-of-dicts forms accept the original nested dict shape and auto-flatten it to physical dotted leaf names:
# nested dict of arrays t.extend({ "trip": {"begin": {"lon": lons, "lat": lats}}, "payment": {"fare": fares}, }) # list of nested dicts t.extend([ {"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}}, {"trip": {"begin": {"lon": -87.5, "lat": 41.7}}, "payment": {"fare": 8.0}}, ])
- group_by(keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]¶
Return a deferred group-by object for this table.
- Parameters:
keys¶ – Column name or sequence of column names to group by.
sort¶ – If
True, sort the result by the group keys. The defaultFalsepreserves the hash aggregation order and is usually faster.dropna¶ – If
True(default), rows with null/NaN group keys are skipped. IfFalse, null/NaN keys form their own group.engine¶ – Execution engine. Phase 1 accepts
"auto"and uses the NumPy chunked implementation.chunk_size¶ – Optional number of physical rows processed per chunk.
- Returns:
A lightweight deferred operation builder. Call methods such as
.size(),.count(column)or.agg({...})to materialize a grouped result as a newCTable.- Return type:
- index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Return the index handle for a stored-column or expression target.
- iter_arrow_batches(*, columns: list[str] | None = None, batch_size: int = 2048, include_computed: bool = True)[source]¶
Yield live rows as bounded-size
pyarrow.RecordBatchobjects.
- iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]¶
Iterate rows in sorted order without materializing a full copy.
Uses a FULL index when available (no sort needed); otherwise falls back to
np.lexsorton live physical positions. Yields namedtuple-like row objects in the same way as__iter__.The sorted positions array is stored as a compressed
blosc2.NDArrayto keep RAM usage low for large tables.batch_sizepositions are decompressed at a time during iteration.- Parameters:
cols¶ – Column name or list of column names to sort by.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
start¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.stop¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.step¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.batch_size¶ – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.
- materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]¶
Materialize a computed column into a new stored snapshot column.
- Parameters:
- Raises:
ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.
KeyError – If name is not a computed column.
TypeError – If dtype is incompatible with the computed values.
- rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Drop and recreate an index with the same parameters.
- refresh_generated_column(name: str) None[source]¶
Recompute a stored generated/materialized column from its source columns.
- refresh_generated_columns(*, source: str | None = None) None[source]¶
Refresh all generated columns, optionally only those depending on source.
- rename_column(old: str, new: str) None[source]¶
Rename a column.
On disk tables the corresponding persisted column leaf is renamed.
Renaming a flat column to a dotted name (e.g.
"trip.begin.lon") promotes it to a nested leaf column: it will be stored under the hierarchical path/_cols/trip/begin/lonon disk and can be accessed viat["trip.begin.lon"]or the attribute-chain proxyt.trip.begin.lon. This is the primary way to define nested columns when importing from non-Arrow sources:t.rename_column("trip_begin_lon", "trip.begin.lon") t["trip.begin.lon"].mean() # works as a regular Column
- Raises:
ValueError – If the table is read-only, is a view, or new already exists.
KeyError – If old does not exist.
- sample(n: int, *, seed: int | None = None) CTable[source]¶
Return a read-only view of n randomly chosen live rows.
- save(urlpath: str, *, overwrite: bool = False) None[source]¶
Persist this table to disk at urlpath.
This writes a standalone copy and returns
None; usecopy()directly when the copiedCTableobject is needed.Only live rows are written — the on-disk table is always compacted. A
.b2zsuffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a.b2dsuffix for directory-backed stores when possible so the format is clear.- Parameters:
urlpath¶ – Destination path. Use a
.b2zsuffix for a compact zip-backed store; any other suffix creates a directory-backed store. A.b2dsuffix is recommended for directory-backed stores.overwrite¶ – If
False(default), raiseValueErrorwhen urlpath already exists. Set toTrueto replace an existing table.
- Raises:
ValueError – If urlpath already exists and
overwrite=False.
- select(cols: list[str]) CTable[source]¶
Return a column-projection view exposing only cols.
The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.
- Parameters:
cols¶ –
Ordered list of column names to keep. For tables with nested (dotted) column names, a struct-prefix name automatically expands to all descendant leaves:
t.select(["trip.begin"]) # expands to trip.begin.lon, trip.begin.lat t.select(["trip"]) # expands to all trip.* leaves
- Raises:
KeyError – If any name in cols is not a column of this table (and does not match any struct prefix).
ValueError – If cols is empty.
- sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]¶
Return a copy of the table sorted by one or more columns.
- Parameters:
cols¶ –
Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on. For tables with nested (dotted) column names, pass the dotted leaf name directly:
t.sort_by("trip.begin.lon") t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
inplace¶ – If
True, rewrite the physical data in place and returnself(likecompact()but sorted). IfFalse(default), return a new in-memory CTable leaving this one untouched.
- Raises:
ValueError – If called on a view or a read-only table when
inplace=True.KeyError – If any column name is not found.
TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).
- to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a directory-backed store.
Directory-backed CTable stores may use any path that does not end in
.b2z; using a.b2dsuffix is recommended for clarity. For persistent, non-view.b2ztables opened read-only andcompact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.For in-memory tables, views, writable
.b2ztables, existing directory-backed tables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new directory-backed store.Examples
Fast-unpack an existing compact zip store into a directory-backed table:
table = blosc2.CTable.open("data.b2z", mode="r") table.to_b2d("data.b2d", overwrite=True) table.close()
Materialize a filtered view into a directory-backed store:
view = table.where(table["score"] > 10) view.to_b2d("high-score.b2d", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2ztable:table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
- to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a compact
.b2zcontainer..b2zis the compact zip-backed CTable format. For persistent, non-view directory-backed tables andcompact=False, this uses a fast physical-pack path: the backingTreeStoredirectory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A.b2dsuffix is recommended for directory-backed stores, but not required.For in-memory tables, views, existing
.b2ztables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new.b2zstore.Examples
Fast-pack an existing directory-backed table into a compact zip store:
table = blosc2.CTable.open("data.b2d", mode="r") table.to_b2z("data.b2z", overwrite=True) table.close()
Materialize a filtered view into a new compact store:
view = table.where(table["score"] > 10) view.to_b2z("high-score.b2z", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2dtable:table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
- to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]¶
Write all live rows to a CSV file.
Uses Python’s stdlib
csvmodule — no extra dependency required. Fixed-shape ndarray column cells are serialised as JSON arrays for readability and shape safety (e.g."[1.0, 2.0, 3.0]").
- to_pandas()[source]¶
Convert to a pandas DataFrame.
Scalar columns become regular DataFrame columns. Fixed-shape ndarray columns become
object-dtype columns whose cells hold NumPy arrays of per-row shape item_shape.- Return type:
pandas.DataFrame
Examples
>>> import blosc2 >>> from dataclasses import dataclass >>> import numpy as np >>> @dataclass ... class Row: ... id: int = blosc2.field(blosc2.int64()) ... embedding: object = blosc2.field(blosc2.ndarray((3,), dtype=blosc2.float32())) >>> t = blosc2.CTable(Row, new_data=[ ... (1, np.array([1, 2, 3], dtype=np.float32)), ... (2, np.array([4, 5, 6], dtype=np.float32)), ... ]) >>> df = t.to_pandas() >>> df["id"].tolist() [1, 2] >>> df["embedding"].dtype dtype('O') >>> np.testing.assert_array_equal(df["embedding"][0], np.array([1, 2, 3], dtype=np.float32))
- to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]¶
Write this table to a Parquet file batch-wise using pyarrow.
- view(new_valid_rows)[source]¶
Return a row-filter view backed by a boolean mask array without copying data.
- where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]¶
Return a row-filtered view matching a boolean predicate.
Signature:
where(expr_result) -> CTable
The predicate can be supplied as a boolean
blosc2.LazyExpr, a booleanblosc2.NDArray, a boolean NumPy array, a booleanColumn, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.The returned object is a
CTableview sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.- Parameters:
expr_result¶ – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g.
"value * category >= 150". Column objects can also be used in Python expressions, e.g.(t.value * t.category) >= 150.- Returns:
A view over the same columns containing only rows where the predicate is true and the source row is live. When
columnsis provided, the returned view is additionally projected to that ordered subset of columns.- Return type:
- Raises:
TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.
Examples
Filter using a string expression:
view = t.where("value * category >= 150") slim = t.where("value * category >= 150", columns=["value", "category"])
Filter using column arithmetic:
view = t.where((t.value * t.category) >= 150)
Blosc2 lazy functions can be used in column expressions:
view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)
For column names that are not valid Python identifiers, use item access:
view = t.where((t["unit price"] * t["quantity"]) > 100)
For tables with nested (dotted) column names, dotted leaf names and attribute-chain proxies work in both string and expression forms:
view = t.where("trip.begin.lon > -87.7 and payment.fare > 10") view = t.where(t.trip.begin.lon > -87.7)
Notes
Use bitwise operators (
&,|,~) or string expressions for element-wise boolean logic. Python’s logical operatorsand,orandnotcannot be overloaded and therefore do not build lazy column expressions.Use:
t.where((t.x > 0) & (t.y < 10)) t.where(~t.returned) t.where("not returned")
not:
t.where((t.x > 0) and (t.y < 10)) t.where(not t.returned)
- base: CTable | None¶
Parent table when this instance is a row-filter or column-projection view (created by
where(),select(), orview()).Nonefor top-level tables. Structural mutations such asadd_column()anddrop_column()are blocked on views.
- property cbytes: int¶
Total compressed size in bytes (all columns + valid_rows mask).
- col_names: list[str]¶
Ordered list of stored column names. Computed columns are not included; access those via
computed_columns.
- property computed_columns: dict[str, dict]¶
Read-only view of the computed-column definitions.
Each value is a dict with keys
expression,col_deps,lazy(blosc2.LazyExpr), anddtype.
- property cratio: float¶
Compression ratio for the whole table payload.
- property indexes: list[Index]¶
Return a list of
blosc2.Indexhandles for all active indexes.
- property info: _CTableInfoReporter¶
Get information about this table.
Examples
>>> print(t.info) >>> t.info()
- property nbytes: int¶
Total uncompressed size in bytes (all columns + valid_rows mask).
- property ncols: int¶
Total number of columns, including computed (virtual) columns.
- property schema: CompiledSchema¶
The compiled schema that drives this table’s columns and validation.
Construction¶
|
|
|
Open a persistent CTable from urlpath. |
|
Load a persistent table from urlpath into RAM. |
|
Build a |
|
Read a Parquet file into a |
|
Build a |
- CTable.__init__(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None) None[source]¶
- classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]¶
Open a persistent CTable from urlpath.
- classmethod CTable.load(urlpath: str) CTable[source]¶
Load a persistent table from urlpath into RAM.
The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.
- Parameters:
urlpath¶ – Path to the table root directory.
- Raises:
FileNotFoundError – If urlpath does not contain a CTable.
ValueError – If the metadata at urlpath does not identify a CTable.
- classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.Nested struct flattening: top-level Arrow
struct<…>fields are automatically and recursively flattened into dotted leaf columns. For example, a fieldtrip: struct<begin: struct<lon: float64, lat: float64>>becomes two CTable columnstrip.begin.lonandtrip.begin.lat. Each leaf is stored as an independent compressedNDArray. Row reads viat[i]reconstruct the original nested dict shape. Uset["trip.begin.lon"]ort.trip.begin.lonto access a leaf:import pyarrow as pa, blosc2 trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))]) schema = pa.schema([pa.field("trip", trip_type)]) t = blosc2.CTable.from_arrow(schema, batches) t.col_names # ['trip.begin.lon'] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Non-structstructcolumns (not containing only scalar leaves) are imported asstruct()columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.list_serializerselects the backend serializer for imported list columns."msgpack"is the default;"arrow"stores Arrow list batches directly and can be much faster for deeply nested list columns.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.Nested struct flattening: top-level Parquet
struct<…>fields are automatically and recursively flattened into dotted leaf columns — the same as infrom_arrow(). For example, a Parquet file that contains a columntrip: struct<begin: struct<lon: double, lat: double>>produces two CTable columnstrip.begin.lonandtrip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:t = blosc2.CTable.from_parquet("trips.parquet") t.col_names # e.g. ['trip.begin.lon', 'trip.begin.lat', ...] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
Unsupported Parquet types are not silently imported as schema-less
object()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.list_serializer¶ ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default,
"arrow", stores Arrow list batches directly and is much faster for deeply nested orlist<struct<...>>columns. The tradeoff is that accessing those list columns later requires PyArrow. Use"msgpack"to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.separate_nested_cols¶ (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to
True. In particular, a single unnamed top-levellist<struct<...>>field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Useseparate_nested_cols=Falsewhen closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.max_rows¶ (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root
list<struct<...>>files imported withseparate_nested_cols=True, this limits flattened element rows.**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
max_rowsis negative.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
Parquet interoperability¶
Parquet import/export is intended as logical data interchange between Parquet
and Blosc2 CTable, not as exact preservation of Parquet’s physical layout. For
example, Parquet files whose top-level schema is an unnamed list<struct<...>>
may be imported as a regular CTable whose rows are the list elements and whose
nested scalar fields are exposed as ordinary dotted columns. Exporting such a
table writes a valid logical Parquet table, but does not attempt to reconstruct
the original unnamed root-list grouping, row groups, encoding choices, or file
metadata exactly.
Null policy¶
Nullable scalar CTable columns are represented with per-column sentinel values,
not native validity bitmaps. When CTable has to infer those sentinels, the
selection can be customized with NullPolicy and scoped with
null_policy():
policy = blosc2.NullPolicy(
signed_int_strategy="max",
string_value="<NULL>",
column_null_values={"user_id": -1, "country": "NA"},
)
with blosc2.null_policy(policy):
table = blosc2.CTable.from_parquet("data.parquet")
The same policy is used by explicit nullable schema specs when no
null_value is supplied:
from dataclasses import dataclass
@dataclass
class Row:
user_id: int = blosc2.field(blosc2.int64(nullable=True))
country: str = blosc2.field(blosc2.string(nullable=True))
with blosc2.null_policy(policy):
table = blosc2.CTable(Row)
Sentinels are resolved in this order: explicit null_value in the schema,
NullPolicy.column_null_values for a matching column, then the type-wide
NullPolicy default. Columns without nullable=True or an explicit
null_value are not nullable.
|
Default sentinels for inferred CTable scalar nulls. |
|
Temporarily set the default policy for CTable null sentinel inference. |
Return the current default null policy. |
- class blosc2.NullPolicy(string_value: str = '__BLOSC2_NULL__', bytes_value: bytes = b'__BLOSC2_NULL__', float_value: float = nan, bool_value: int = 255, signed_int_strategy: ~typing.Literal['min', 'max'] = 'min', unsigned_int_strategy: ~typing.Literal['min', 'max'] = 'max', timestamp_value: int = -9223372036854775808, column_null_values: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]¶
Default sentinels for inferred CTable scalar nulls.
CTable nullable scalar columns are represented with per-column sentinel values. This policy is used when CTable has to infer those sentinels, such as when importing nullable scalar Arrow or Parquet columns without an explicit column-level null sentinel. The selected sentinel is stored in the resulting CTable schema, so existing tables remain self-describing.
Examples
Use
blosc2.null_policy()to apply a policy while creating a CTable from data with nullable scalar columns:policy = blosc2.NullPolicy( signed_int_strategy="max", string_value="<NULL>", column_null_values={"user_id": -1, "country": "NA"}, ) with blosc2.null_policy(policy): table = blosc2.CTable.from_parquet("data.parquet")
The same policy is used for explicit nullable schema specs:
@dataclass class Row: user_id: int = blosc2.field(blosc2.int64(nullable=True)) country: str = blosc2.field(blosc2.string(nullable=True)) with blosc2.null_policy(policy): table = blosc2.CTable(Row)
column_null_valuestakes precedence over the type-wide defaults in the policy. This is useful when a particular column needs a sentinel that is known not to collide with its real values.Methods
sentinel_for_arrow_type(pa, pa_type)Return the default sentinel for pa_type, or
Noneif unsupported.
- blosc2.null_policy(policy: NullPolicy)¶
Temporarily set the default policy for CTable null sentinel inference.
- blosc2.get_null_policy() NullPolicy[source]¶
Return the current default null policy.
Attributes¶
Ordered list of stored column names. |
|
Read-only view of the computed-column definitions. |
|
Total number of columns, including computed (virtual) columns. |
|
Total compressed size in bytes (all columns + valid_rows mask). |
|
Total uncompressed size in bytes (all columns + valid_rows mask). |
|
The compiled schema that drives this table's columns and validation. |
|
Parent table when this instance is a row-filter or column-projection view (created by |
- CTable.col_names: list[str]¶
Ordered list of stored column names. Computed columns are not included; access those via
computed_columns.
- property CTable.computed_columns: dict[str, dict]¶
Read-only view of the computed-column definitions.
Each value is a dict with keys
expression,col_deps,lazy(blosc2.LazyExpr), anddtype.
- property CTable.nrows: int¶
- property CTable.ncols: int¶
Total number of columns, including computed (virtual) columns.
- property CTable.cbytes: int¶
Total compressed size in bytes (all columns + valid_rows mask).
- property CTable.nbytes: int¶
Total uncompressed size in bytes (all columns + valid_rows mask).
- property CTable.schema: CompiledSchema¶
The compiled schema that drives this table’s columns and validation.
- CTable.base: CTable | None¶
Parent table when this instance is a row-filter or column-projection view (created by
where(),select(), orview()).Nonefor top-level tables. Structural mutations such asadd_column()anddrop_column()are blocked on views.
Inserting data¶
|
Append a single row to the table. |
|
Append multiple rows at once. |
- CTable.append(data: list | void | ndarray) None[source]¶
Append a single row to the table.
data may be a list, tuple,
numpy.void, or structurednumpy.ndarraywhose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. RaisesValueErrorif the table is read-only or a view.For tables with nested (dotted) column names the row dict may be supplied either as a flat mapping of dotted keys or as a nested dict that mirrors the original struct shape — both are accepted and automatically flattened to the physical dotted leaf names:
# flat dotted keys t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8, "payment.fare": 12.5}) # original nested dict (auto-flattened) t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}})
- CTable.extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]¶
Append multiple rows at once.
data may be:
a dict of arrays
{"col": array, ...}— all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;a list of rows, each compatible with
append();another CTable — columns are matched by name.
Pass
validate=Falseto skip per-row Pydantic validation on trusted bulk imports. RaisesValueErrorif the table is read-only or a view.For tables with nested (dotted) column names both the dict-of-arrays and list-of-dicts forms accept the original nested dict shape and auto-flatten it to physical dotted leaf names:
# nested dict of arrays t.extend({ "trip": {"begin": {"lon": lons, "lat": lats}}, "payment": {"fare": fares}, }) # list of nested dicts t.extend([ {"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}}, {"trip": {"begin": {"lon": -87.5, "lat": 41.7}}, "payment": {"fare": 8.0}}, ])
Querying¶
Boolean expressions¶
Use bitwise operators (&, |, ~) or string expressions for
row-wise boolean logic. Python’s logical operators and, or and
not cannot be overloaded and therefore do not build lazy column
expressions.
Use column expressions with explicit parentheses around comparisons:
t.where((t.amount > 100) & (t.region == "North"))
t.where(~t.returned)
or use string expressions when that reads better:
t.where("amount > 100 and region == 'North'")
t.where("not returned")
t["not returned"]
The last three forms for negating a boolean column are equivalent:
t.where(~t.returned), t.where("not returned"), and
t["not returned"].
Indexing & projection¶
CTable indexing is type-driven:
t["amount"] # column access
t[3] # one row as a namedtuple-like object
t[3:8] # row view
t[[1, 4, 7]] # gathered-row view
t[mask] # filtered row view
t[t.amount > 100] # LazyExpr filtered row view, like where()
t[["region", "amount"]] # projected column view
String keys first try exact column-name lookup. If the string is not a
column name, it is interpreted as a boolean expression and behaves like
CTable.where(). Boolean LazyExpr and boolean
NDArray keys also behave like CTable.where(), so computed
column predicates such as t[t.temperature_f > 70] are supported.
For explicit filtered projection, use:
t.where("amount > 100", columns=["region", "amount"])
When a NumPy structured array is needed, materialize explicitly:
np.asarray(t[:10])
|
Return a row-filtered view matching a boolean predicate. |
|
Return a row-filter view backed by a boolean mask array without copying data. |
|
Return a column-projection view exposing only cols. |
|
Return a view of the first N live rows (default 5). |
|
Return a view of the last N live rows (default 5). |
|
Return a read-only view of n randomly chosen live rows. |
|
Return a copy of the table sorted by one or more columns. |
|
Iterate rows in sorted order without materializing a full copy. |
|
Return a deferred group-by object for this table. |
- CTable.where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]¶
Return a row-filtered view matching a boolean predicate.
Signature:
where(expr_result) -> CTable
The predicate can be supplied as a boolean
blosc2.LazyExpr, a booleanblosc2.NDArray, a boolean NumPy array, a booleanColumn, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.The returned object is a
CTableview sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.- Parameters:
expr_result¶ – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g.
"value * category >= 150". Column objects can also be used in Python expressions, e.g.(t.value * t.category) >= 150.- Returns:
A view over the same columns containing only rows where the predicate is true and the source row is live. When
columnsis provided, the returned view is additionally projected to that ordered subset of columns.- Return type:
- Raises:
TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.
Examples
Filter using a string expression:
view = t.where("value * category >= 150") slim = t.where("value * category >= 150", columns=["value", "category"])
Filter using column arithmetic:
view = t.where((t.value * t.category) >= 150)
Blosc2 lazy functions can be used in column expressions:
view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)
For column names that are not valid Python identifiers, use item access:
view = t.where((t["unit price"] * t["quantity"]) > 100)
For tables with nested (dotted) column names, dotted leaf names and attribute-chain proxies work in both string and expression forms:
view = t.where("trip.begin.lon > -87.7 and payment.fare > 10") view = t.where(t.trip.begin.lon > -87.7)
Notes
Use bitwise operators (
&,|,~) or string expressions for element-wise boolean logic. Python’s logical operatorsand,orandnotcannot be overloaded and therefore do not build lazy column expressions.Use:
t.where((t.x > 0) & (t.y < 10)) t.where(~t.returned) t.where("not returned")
not:
t.where((t.x > 0) and (t.y < 10)) t.where(not t.returned)
- CTable.view(new_valid_rows)[source]¶
Return a row-filter view backed by a boolean mask array without copying data.
- CTable.select(cols: list[str]) CTable[source]¶
Return a column-projection view exposing only cols.
The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.
- Parameters:
cols¶ –
Ordered list of column names to keep. For tables with nested (dotted) column names, a struct-prefix name automatically expands to all descendant leaves:
t.select(["trip.begin"]) # expands to trip.begin.lon, trip.begin.lat t.select(["trip"]) # expands to all trip.* leaves
- Raises:
KeyError – If any name in cols is not a column of this table (and does not match any struct prefix).
ValueError – If cols is empty.
- CTable.sample(n: int, *, seed: int | None = None) CTable[source]¶
Return a read-only view of n randomly chosen live rows.
- CTable.sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]¶
Return a copy of the table sorted by one or more columns.
- Parameters:
cols¶ –
Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on. For tables with nested (dotted) column names, pass the dotted leaf name directly:
t.sort_by("trip.begin.lon") t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
inplace¶ – If
True, rewrite the physical data in place and returnself(likecompact()but sorted). IfFalse(default), return a new in-memory CTable leaving this one untouched.
- Raises:
ValueError – If called on a view or a read-only table when
inplace=True.KeyError – If any column name is not found.
TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).
- CTable.iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]¶
Iterate rows in sorted order without materializing a full copy.
Uses a FULL index when available (no sort needed); otherwise falls back to
np.lexsorton live physical positions. Yields namedtuple-like row objects in the same way as__iter__.The sorted positions array is stored as a compressed
blosc2.NDArrayto keep RAM usage low for large tables.batch_sizepositions are decompressed at a time during iteration.- Parameters:
cols¶ – Column name or list of column names to sort by.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
start¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.stop¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.step¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.batch_size¶ – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.
- CTable.group_by(keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]¶
Return a deferred group-by object for this table.
- Parameters:
keys¶ – Column name or sequence of column names to group by.
sort¶ – If
True, sort the result by the group keys. The defaultFalsepreserves the hash aggregation order and is usually faster.dropna¶ – If
True(default), rows with null/NaN group keys are skipped. IfFalse, null/NaN keys form their own group.engine¶ – Execution engine. Phase 1 accepts
"auto"and uses the NumPy chunked implementation.chunk_size¶ – Optional number of physical rows processed per chunk.
- Returns:
A lightweight deferred operation builder. Call methods such as
.size(),.count(column)or.agg({...})to materialize a grouped result as a newCTable.- Return type:
Group-by reductions¶
CTable.group_by() returns a lightweight deferred group-by object. It is
not a table view; methods such as size(),
count(), sum(),
argmax(), and agg()
materialize a new CTable with
one row per group:
by_city = t.group_by("city", sort=True)
counts = by_city.size() # row count per city / COUNT(*)
non_null = by_city.count("sales") # non-null sales count / COUNT(sales)
totals = by_city.sum("sales") # equivalent to agg({"sales": "sum"})
means = by_city.mean("sales")
mins = by_city.min("sales")
maxs = by_city.max("sales")
min_rows = by_city.argmin("sales") # logical row position of min sales
max_rows = by_city.argmax("sales") # logical row position of max sales
Grouped results are in-memory by default. Pass urlpath= to a terminal
method to write the result as a persistent CTable:
totals = by_city.sum("sales", urlpath="sales_by_city.b2d")
For array-oriented grouped reductions without a CTable, see
blosc2.group_reduce().
- class blosc2.CTableGroupBy(table: CTable, keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]¶
Deferred group-by operation returned by
CTable.group_by().The object stores the source table, grouping keys, and execution options. It is not a
CTableview and does not materialize grouped data until a terminal method such assize(),count(), oragg()is called.Methods
agg(aggregations, *[, urlpath])Aggregate value columns per group.
argmax(column, *[, urlpath])Return logical row positions of maximum non-null column values per group.
argmin(column, *[, urlpath])Return logical row positions of minimum non-null column values per group.
count(column, *[, urlpath])Return non-null value counts for column per group.
max(column, *[, urlpath])Return maximum values of column per group.
mean(column, *[, urlpath])Return means of column per group.
min(column, *[, urlpath])Return minimum values of column per group.
size(*[, urlpath])Return row counts per group as a new
CTable.sum(column, *[, urlpath])Return sums of column per group.
- agg(aggregations: Mapping[str, str | Sequence[str]], *, urlpath: str | None = None)[source]¶
Aggregate value columns per group.
- Parameters:
aggregations¶ – Mapping from input column name to an aggregation name or list of names. Supported operations in Phase 1 are
"count","sum","mean","min","max","argmin","argmax"and the special row-count spelling{"*": "size"}.
- argmax(column: str, *, urlpath: str | None = None)[source]¶
Return logical row positions of maximum non-null column values per group.
Ties keep the first row in the grouped input table or view. Groups with no non-null values for column receive
-1.
- argmin(column: str, *, urlpath: str | None = None)[source]¶
Return logical row positions of minimum non-null column values per group.
Ties keep the first row in the grouped input table or view. Groups with no non-null values for column receive
-1.
- count(column: str, *, urlpath: str | None = None)[source]¶
Return non-null value counts for column per group.
This is equivalent to SQL
COUNT(column)and togroup_by(...).agg({column: "count"}).
- max(column: str, *, urlpath: str | None = None)[source]¶
Return maximum values of column per group.
This is equivalent to
group_by(...).agg({column: "max"}).
- mean(column: str, *, urlpath: str | None = None)[source]¶
Return means of column per group.
This is equivalent to
group_by(...).agg({column: "mean"}).
- min(column: str, *, urlpath: str | None = None)[source]¶
Return minimum values of column per group.
This is equivalent to
group_by(...).agg({column: "min"}).
Mutations¶
In addition to physical schema changes such as CTable.add_column(),
CTables can host computed columns backed by a lazy expression over stored
columns. Computed columns are read-only, use no extra storage, participate in
display, filtering, sorting, and aggregates, and are persisted across
CTable.save(), CTable.load(), and CTable.open().
When a computed result should become a normal stored column, use
CTable.materialize_computed_column(). The materialized column is a stored
snapshot that can be indexed like any other stored column. New rows inserted
later via CTable.append() or CTable.extend() auto-fill omitted
materialized-column values from the recorded expression metadata.
|
Mark one or more rows as deleted (tombstone deletion). |
Physically rewrite every column array keeping only live rows. |
|
|
Add a new column filled from the default declared in spec. |
|
Add a read-only virtual column computed from stored columns. |
|
Materialize a computed column into a new stored snapshot column. |
Remove a computed column from the table. |
|
|
Remove a column from the table. |
|
Rename a column. |
- CTable.delete(ind: int | slice | str | Iterable) None[source]¶
Mark one or more rows as deleted (tombstone deletion).
ind may be a logical row index (
int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed untilcompact()is called. RaisesValueErrorif the table is read-only or a view.
- CTable.compact()[source]¶
Physically rewrite every column array keeping only live rows.
Closes the gaps left by prior
delete()calls by shuffling live data to the front of each column array. The underlying NDArray allocations are not resized — each column retains its original capacity. To actually reclaim memory, usecopy()withcompact=Trueinstead, which allocates fresh arrays sized to the live row count. All existing indexes are dropped and must be recreated afterwards. RaisesValueErrorif the table is read-only or a view.
- CTable.add_column(name: str, spec: SchemaSpec | Field) None[source]¶
Add a new column filled from the default declared in spec.
- Parameters:
name¶ – Column name. Must follow the same naming rules as schema fields.
spec¶ – A schema descriptor such as
b2.int64(ge=0)or a field descriptor such asb2.field(b2.int64(ge=0), default=0). When the table already has live rows, useblosc2.field(...)with a default declared so those rows can be backfilled.
- Raises:
ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.
TypeError – If a declared default cannot be coerced to spec’s dtype.
- CTable.add_computed_column(name: str, expr: str | LazyExpr | Callable[[dict[str, Any]], LazyExpr], *, dtype: dtype | None = None) None[source]¶
Add a read-only virtual column computed from stored columns.
A computed column has no physical storage. It is backed by a
blosc2.LazyExprand is evaluated when values are read, filtered, displayed, exported, or aggregated. Because it is virtual, it is read-only, cannot be indexed directly, and is not supplied inappend()/extend()inputs. To store and optionally index a computed result, useadd_generated_column()or materialize an existing computed column withmaterialize_computed_column().Supported signatures are:
add_computed_column(name, "price * qty", dtype=None) add_computed_column(name, lazy_expr, dtype=None) add_computed_column(name, lambda cols: cols["price"] * cols["qty"], dtype=None)
- Parameters:
name¶ – Name of the virtual computed column. It must be a valid column name and must not collide with an existing stored or computed column.
expr¶ –
Definition of the virtual column. Accepted forms:
str: scalar expression over stored scalar columns, e.g."price * qty".blosc2.LazyExpr: lazy expression over stored columns of this table.callable: called as
expr(self._cols)and must return ablosc2.LazyExprover stored columns of this table.
Expressions must depend only on stored columns of this table; computed columns cannot depend on other computed columns in this version. Fixed-shape ndarray columns are not accepted in computed column expressions yet. For row-wise ndarray projections or reductions, use
add_generated_column()withvalues=t.ndarray_col.row_transformer....dtype¶ – Optional dtype override for the computed values. When omitted, the dtype is inferred from the resulting
blosc2.LazyExpr. This changes the dtype reported by the CTable column wrapper; it does not create physical storage.
Examples
Add a computed column from a string expression and use it like a normal read-only column:
t.add_computed_column("total", "price * qty") assert t.total[:].shape == (t.nrows,)
Add a computed column from a callable. The callable receives the table’s stored column mapping:
t.add_computed_column( "price_with_tax", lambda cols: cols["price"] * 1.21, dtype=np.float64, )
Callable expressions can use normal Python logic while still returning a lazy expression:
def total_expr(cols): base = cols["price"] * cols["qty"] return base * 1.21 if include_tax else base t.add_computed_column("total", total_expr)
They are also convenient for reusable, parameterized helpers:
def ratio(num, den): return lambda cols: cols[num] / cols[den] t.add_computed_column("margin", ratio("profit", "revenue"))
Computed columns participate in filters and aggregates:
expensive = t.where(t.total > 100) total_revenue = t.total.sum()
Computed columns are virtual and read-only. Materialize one when a stored snapshot or an indexable column is needed:
t.materialize_computed_column("total", new_name="total_stored") t.create_index("total_stored")
For maintained stored results, prefer generated columns:
t.add_generated_column( "total_stored", values="price * qty", dtype=blosc2.float64(), create_index=True, )
- Raises:
ValueError – If called on a view or read-only table, if name already exists, or if an expression operand does not reference a stored column of this table.
TypeError – If expr has an unsupported form, does not produce a
blosc2.LazyExpr, references unsupported source columns, or if aRowTransformeris passed. Row transformers are only accepted byadd_generated_column().
- CTable.materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]¶
Materialize a computed column into a new stored snapshot column.
- Parameters:
- Raises:
ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.
KeyError – If name is not a computed column.
TypeError – If dtype is incompatible with the computed values.
- CTable.drop_computed_column(name: str) None[source]¶
Remove a computed column from the table.
- Parameters:
name¶ – Name of the computed column to remove.
- Raises:
KeyError – If name is not a computed column.
ValueError – If called on a view.
- CTable.drop_column(name: str) None[source]¶
Remove a column from the table.
On disk tables the corresponding persisted column leaf is deleted.
- Raises:
ValueError – If the table is read-only, is a view, or name is the last column.
KeyError – If name does not exist.
- CTable.rename_column(old: str, new: str) None[source]¶
Rename a column.
On disk tables the corresponding persisted column leaf is renamed.
Renaming a flat column to a dotted name (e.g.
"trip.begin.lon") promotes it to a nested leaf column: it will be stored under the hierarchical path/_cols/trip/begin/lonon disk and can be accessed viat["trip.begin.lon"]or the attribute-chain proxyt.trip.begin.lon. This is the primary way to define nested columns when importing from non-Arrow sources:t.rename_column("trip_begin_lon", "trip.begin.lon") t["trip.begin.lon"].mean() # works as a regular Column
- Raises:
ValueError – If the table is read-only, is a view, or new already exists.
KeyError – If old does not exist.
Indexes¶
CTable indexes are created with CTable.create_index() and returned as
blosc2.Index handles. For tables, Index refers to an entry stored
in the table index catalog and delegates maintenance operations such as
drop(), rebuild(), and compact() back to the owning table. Users
normally only receive these handles from the CTable API; they do not instantiate
them directly.
Indexes can target stored columns or direct expressions over stored columns
via create_index(expression=...). This lets queries reuse indexes for
derived predicates without adding either a computed column or a materialized
stored one. A matching FULL direct-expression index can also be reused by
ordering paths such as CTable.sort_by() when sorting by a computed column
backed by the same expression. OPSI indexes are a separate exact-filtering
tier with a tunable number of iterative ordering cycles; they are not intended
to converge to a completely sorted FULL/CSI index, so use FULL when
globally sorted ordered reuse is required.
|
Build and register an index for a stored column or table expression. |
|
Return the index handle for a stored-column or expression target. |
Return a list of |
|
|
Remove an index and delete any sidecar files. |
|
Drop and recreate an index with the same parameters. |
|
Compact an index, merging any incremental append runs. |
- CTable.create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]¶
Build and register an index for a stored column or table expression.
For tables with nested (dotted) column names, pass the dotted leaf name directly:
t.create_index("trip.begin.lon") t.where("trip.begin.lon > -87.7").nrows # index is used automatically
- CTable.index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Return the index handle for a stored-column or expression target.
- CTable.indexes¶
Return a list of
blosc2.Indexhandles for all active indexes.
- CTable.drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]¶
Remove an index and delete any sidecar files.
- CTable.rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Drop and recreate an index with the same parameters.
- CTable.compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Compact an index, merging any incremental append runs.
See blosc2.Index for the returned handle attributes and methods.
Persistence¶
Persist CTables to disk or interchange formats, and restore them later without losing schema information. These methods cover native Blosc2 persistence as well as import/export paths for CSV, Arrow, and Parquet data.
|
Load a persistent table from urlpath into RAM. |
|
Open a persistent CTable from urlpath. |
|
Persist this table to disk at urlpath. |
|
Write this table to a compact |
|
Write this table to a directory-backed store. |
|
Write all live rows to a CSV file. |
Convert all live rows to a |
|
|
Write this table to a Parquet file batch-wise using pyarrow. |
|
Build a |
|
Read a Parquet file into a |
|
Build a |
- classmethod CTable.load(urlpath: str) CTable[source]¶
Load a persistent table from urlpath into RAM.
The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.
- Parameters:
urlpath¶ – Path to the table root directory.
- Raises:
FileNotFoundError – If urlpath does not contain a CTable.
ValueError – If the metadata at urlpath does not identify a CTable.
- classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]¶
Open a persistent CTable from urlpath.
- CTable.save(urlpath: str, *, overwrite: bool = False) None[source]¶
Persist this table to disk at urlpath.
This writes a standalone copy and returns
None; usecopy()directly when the copiedCTableobject is needed.Only live rows are written — the on-disk table is always compacted. A
.b2zsuffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a.b2dsuffix for directory-backed stores when possible so the format is clear.- Parameters:
urlpath¶ – Destination path. Use a
.b2zsuffix for a compact zip-backed store; any other suffix creates a directory-backed store. A.b2dsuffix is recommended for directory-backed stores.overwrite¶ – If
False(default), raiseValueErrorwhen urlpath already exists. Set toTrueto replace an existing table.
- Raises:
ValueError – If urlpath already exists and
overwrite=False.
- CTable.to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a compact
.b2zcontainer..b2zis the compact zip-backed CTable format. For persistent, non-view directory-backed tables andcompact=False, this uses a fast physical-pack path: the backingTreeStoredirectory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A.b2dsuffix is recommended for directory-backed stores, but not required.For in-memory tables, views, existing
.b2ztables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new.b2zstore.Examples
Fast-pack an existing directory-backed table into a compact zip store:
table = blosc2.CTable.open("data.b2d", mode="r") table.to_b2z("data.b2z", overwrite=True) table.close()
Materialize a filtered view into a new compact store:
view = table.where(table["score"] > 10) view.to_b2z("high-score.b2z", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2dtable:table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
- CTable.to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a directory-backed store.
Directory-backed CTable stores may use any path that does not end in
.b2z; using a.b2dsuffix is recommended for clarity. For persistent, non-view.b2ztables opened read-only andcompact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.For in-memory tables, views, writable
.b2ztables, existing directory-backed tables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new directory-backed store.Examples
Fast-unpack an existing compact zip store into a directory-backed table:
table = blosc2.CTable.open("data.b2z", mode="r") table.to_b2d("data.b2d", overwrite=True) table.close()
Materialize a filtered view into a directory-backed store:
view = table.where(table["score"] > 10) view.to_b2d("high-score.b2d", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2ztable:table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
- CTable.to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]¶
Write all live rows to a CSV file.
Uses Python’s stdlib
csvmodule — no extra dependency required. Fixed-shape ndarray column cells are serialised as JSON arrays for readability and shape safety (e.g."[1.0, 2.0, 3.0]").
- CTable.to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]¶
Write this table to a Parquet file batch-wise using pyarrow.
- classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.Nested struct flattening: top-level Arrow
struct<…>fields are automatically and recursively flattened into dotted leaf columns. For example, a fieldtrip: struct<begin: struct<lon: float64, lat: float64>>becomes two CTable columnstrip.begin.lonandtrip.begin.lat. Each leaf is stored as an independent compressedNDArray. Row reads viat[i]reconstruct the original nested dict shape. Uset["trip.begin.lon"]ort.trip.begin.lonto access a leaf:import pyarrow as pa, blosc2 trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))]) schema = pa.schema([pa.field("trip", trip_type)]) t = blosc2.CTable.from_arrow(schema, batches) t.col_names # ['trip.begin.lon'] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Non-structstructcolumns (not containing only scalar leaves) are imported asstruct()columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.list_serializerselects the backend serializer for imported list columns."msgpack"is the default;"arrow"stores Arrow list batches directly and can be much faster for deeply nested list columns.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.Nested struct flattening: top-level Parquet
struct<…>fields are automatically and recursively flattened into dotted leaf columns — the same as infrom_arrow(). For example, a Parquet file that contains a columntrip: struct<begin: struct<lon: double, lat: double>>produces two CTable columnstrip.begin.lonandtrip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:t = blosc2.CTable.from_parquet("trips.parquet") t.col_names # e.g. ['trip.begin.lon', 'trip.begin.lat', ...] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
Unsupported Parquet types are not silently imported as schema-less
object()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.list_serializer¶ ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default,
"arrow", stores Arrow list batches directly and is much faster for deeply nested orlist<struct<...>>columns. The tradeoff is that accessing those list columns later requires PyArrow. Use"msgpack"to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.separate_nested_cols¶ (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to
True. In particular, a single unnamed top-levellist<struct<...>>field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Useseparate_nested_cols=Falsewhen closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.max_rows¶ (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root
list<struct<...>>files imported withseparate_nested_cols=True, this limits flattened element rows.**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
max_rowsis negative.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
Inspection & statistics¶
Compute common descriptive statistics directly on CTable data without
materializing rows first. These methods operate column-wise on the compressed
representation, making it easy to summarize distributions or measure
relationships between numeric columns.
|
Return the |
Get information about this table. |
|
Return a JSON-compatible dict describing this table's schema. |
|
Print a per-column statistical summary. |
|
Return the covariance matrix as a numpy array. |
- CTable.column_schema(name: str) CompiledColumn[source]¶
Return the
CompiledColumndescriptor for name.- Raises:
KeyError – If name is not a column in this table.
- CTable.info()¶
Get information about this table.
Examples
>>> print(t.info) >>> t.info()
- CTable.schema_dict() dict[str, Any][source]¶
Return a JSON-compatible dict describing this table’s schema.
- CTable.describe() None[source]¶
Print a per-column statistical summary.
Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.
- CTable.cov() ndarray[source]¶
Return the covariance matrix as a numpy array.
Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise
TypeError.- Returns:
Shape
(ncols, ncols). Column order matchescol_names.- Return type:
numpy.ndarray
- Raises:
TypeError – If any column has an unsupported dtype (complex, string, …).
ValueError – If the table has fewer than 2 live rows (covariance undefined).
Column¶
A lazy column accessor returned by table["col_name"] or table.col_name.
All index operations and aggregates apply the table’s tombstone mask
(_valid_rows) so deleted rows are silently excluded.
- class blosc2.Column(table: CTable, col_name: str, mask=None)[source]¶
Column view for a
CTable, with vectorized operations and reductions.- Attributes:
dtypeNumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).infoGet information about this column.
info_itemsStructured summary items used by
info.is_computedTrue if this column is a virtual computed column (read-only).
is_dictionaryTrue if this column is a dictionary-encoded string column.
is_generatedTrue if this column is a stored generated/materialized column.
- is_list
is_ndarrayTrue if this column stores fixed-shape N-D array values per row.
is_staleTrue if this generated column needs to be refreshed before use.
is_varlen_scalarTrue if this column holds variable-length scalar strings or bytes.
item_ndimNumber of per-row item dimensions.
item_shapePer-row item shape;
()for scalar columns.item_sizeNumber of scalar values stored in each row item.
ndimNumber of logical dimensions.
null_valueThe sentinel value that represents NULL for this column, or
None.row_transformerBuild row-wise projections/reductions for generated columns.
shapeLogical shape of the live column values.
sizeNumber of live scalar values in the logical column array.
viewReturn a
ColumnViewIndexerfor creating logical sub-views.
Methods
all()Return True if every live, non-null value is True.
any()Return True if at least one live, non-null value is True.
argmax([axis, where])Index of the maximum live, non-null value.
argmin([axis, where])Index of the minimum live, non-null value.
assign(data)Replace all live values in this column with data.
is_null()Return a boolean array True where the live value is the null sentinel.
isin(values)Return a boolean array True where the live value is in values.
iter_chunks([size])Iterate over live column values in chunks of size rows.
max([axis, where])Maximum live, non-null value.
mean([axis, where])Arithmetic mean of all live, non-null values.
min([axis, where])Minimum live, non-null value.
norm([ord, axis, where])Vector/matrix norm of a fixed-shape ndarray column.
notnull()Return a boolean array True where the live value is not the null sentinel.
Return the number of live rows whose value equals the null sentinel.
read_stale([key])Read stored values even when this generated column is marked stale.
std([ddof, axis, where])Standard deviation of all live, non-null values (single-pass, Welford's algorithm).
sum([dtype, axis, where, jit, jit_backend])Sum of all live, non-null values.
summary()Return and print a compact summary for this column.
unique()Return sorted array of unique live, non-null values.
Return a
{value: count}dict sorted by count descending.Special methods
Return the number of live (non-deleted) values in this column.
Iterate over live column values in insertion order, skipping deleted rows.
Column.__getitem__(key)Return values for the given logical index.
Column.__setitem__(key, value)Set one or more live column values; accepts the same index forms as
__getitem__().- __len__()[source]¶
Return the number of live (non-deleted) values in this column.
Return the number of live (non-deleted) values in this column.
- __iter__()[source]¶
Iterate over live column values in insertion order, skipping deleted rows.
Iterate over live values in insertion order, skipping deleted rows.
- __getitem__(key: int | slice | list | ndarray)[source]¶
Return values for the given logical index.
int→ scalarslice→numpy.ndarraylist / np.ndarray→numpy.ndarraybool np.ndarray→numpy.ndarray
For a writable logical sub-view use
view.
- __setitem__(key: int | slice | list | ndarray, value)[source]¶
Set one or more live column values; accepts the same index forms as
__getitem__().Set one or more live column values. Accepts the same index forms as
__getitem__().
- all() bool[source]¶
Return True if every live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.
- any() bool[source]¶
Return True if at least one live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.
- argmax(axis=None, *, where=None)[source]¶
Index of the maximum live, non-null value.
For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape
(nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).
- argmin(axis=None, *, where=None)[source]¶
Index of the minimum live, non-null value.
For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape
(nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).
- assign(data) None[source]¶
Replace all live values in this column with data.
Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.
- Parameters:
data¶ – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.
- Raises:
ValueError – If
len(data)does not match the number of live rows, or the table is opened read-only.TypeError – If values cannot be coerced to the column’s dtype.
- is_null() ndarray[source]¶
Return a boolean array True where the live value is the null sentinel.
For varlen scalar columns (vlstring/vlbytes) nullability is represented as native
Nonevalues, so this returns True wherever the value isNone. For dictionary columns, returns True where the code equals the null_code (-1by default).
- isin(values) ndarray[source]¶
Return a boolean array True where the live value is in values.
For dictionary columns this performs efficient integer-code membership testing (no decoding of all values). Values absent from the dictionary are treated as not-present.
For non-dictionary columns this decodes all live values and tests membership in a set.
- iter_chunks(size: int = 65536)[source]¶
Iterate over live column values in chunks of size rows.
Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.
- Parameters:
size¶ – Number of live rows per yielded chunk. Defaults to 65 536.
- Yields:
numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.
Examples
>>> for chunk in t["score"].iter_chunks(size=100_000): ... process(chunk)
- max(axis=None, *, where=None)[source]¶
Maximum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- mean(axis=None, *, where=None)[source]¶
Arithmetic mean of all live, non-null values.
Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.
- min(axis=None, *, where=None)[source]¶
Minimum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- norm(ord=None, axis=None, *, where=None)[source]¶
Vector/matrix norm of a fixed-shape ndarray column.
The column is treated as a logical array of shape
(nrows, *item_shape). For example,axis=1computes one norm per row for a 1-D item shape.
- notnull() ndarray[source]¶
Return a boolean array True where the live value is not the null sentinel.
- null_count() int[source]¶
Return the number of live rows whose value equals the null sentinel.
Returns
0in O(1) if nonull_valueis configured for this column and the column is not a varlen scalar column.
- read_stale(key=slice(None, None, None))[source]¶
Read stored values even when this generated column is marked stale.
This is an explicit escape hatch for inspecting the last materialized values. Normal reads raise for stale generated columns so outdated values are not used accidentally.
- std(ddof: int = 0, axis=None, *, where=None)[source]¶
Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).
- Parameters:
ddof¶ – Delta degrees of freedom.
0(default) gives the population std;1gives the sample std (divides by N-1).where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.
dtypes¶ (Supported)
skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)
float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)
- sum(dtype=None, axis=None, *, where=None, jit=None, jit_backend=None)[source]¶
Sum of all live, non-null values.
Returns zero for an empty column or filtered view.
Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.
- Parameters:
dtype¶ – Optional accumulator dtype. When omitted, float columns use
np.float64, complex columns usenp.complex128, and integer / bool columns usenp.int64.where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.
jit¶ – Optional miniexpr JIT policy passed to the lazy reduction engine.
jit_backend¶ – Optional miniexpr JIT backend. Use
"tcc"or"cc".
Examples
Sum values matching a predicate without materializing a filtered view:
total = t["amount"].sum(where=t.category == 3)
Combine several column predicates:
total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))
Nullable sentinel values are skipped automatically:
# Equivalent to summing only live rows where predicate is true and # t.col2 is not its configured null sentinel. total = t.col2.sum(where=t.col1 < 300)
- summary() str[source]¶
Return and print a compact summary for this column.
For fixed-shape ndarray columns this includes logical shape, storage, and row-norm statistics when numeric. Scalar columns fall back to
info.
- unique() ndarray[source]¶
Return sorted array of unique live, non-null values.
Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.
- value_counts() dict[source]¶
Return a
{value: count}dict sorted by count descending.Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.
Example
>>> t["active"].value_counts() {True: 8432, False: 1568}
- property dtype¶
NumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).
- property info: _CTableInfoReporter¶
Get information about this column.
The report includes both logical/live-row details and, when available, the physical storage details used internally by lazy predicates.
Examples
>>> print(t["score"].info) >>> t["score"].info()
- property item_ndim: int¶
Number of per-row item dimensions.
- property item_shape: tuple[int, ...]¶
Per-row item shape;
()for scalar columns.
- property item_size: int¶
Number of scalar values stored in each row item.
- property ndim: int¶
Number of logical dimensions.
- property null_value¶
The sentinel value that represents NULL for this column, or
None.
- property row_transformer: RowTransformer¶
Build row-wise projections/reductions for generated columns.
- property shape: tuple[int, ...]¶
Logical shape of the live column values.
- property size: int¶
Number of live scalar values in the logical column array.
- property view: ColumnViewIndexer¶
Return a
ColumnViewIndexerfor creating logical sub-views.Examples
Read a sub-view for chained aggregates:
sub = t.price.view[2:10] sub.sum()
Bulk write through a sub-view:
t.price.view[0:5][:] = np.zeros(5)
Attributes¶
NumPy dtype of the underlying storage, or |
|
The sentinel value that represents NULL for this column, or |
|
Build row-wise projections/reductions for generated columns. |
- property Column.dtype¶
NumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).
- property Column.null_value¶
The sentinel value that represents NULL for this column, or
None.
- property Column.row_transformer: RowTransformer¶
Build row-wise projections/reductions for generated columns.
Data access¶
Return a |
|
|
Iterate over live column values in chunks of size rows. |
|
Replace all live values in this column with data. |
- property Column.view: ColumnViewIndexer¶
Return a
ColumnViewIndexerfor creating logical sub-views.Examples
Read a sub-view for chained aggregates:
sub = t.price.view[2:10] sub.sum()
Bulk write through a sub-view:
t.price.view[0:5][:] = np.zeros(5)
- Column.iter_chunks(size: int = 65536)[source]¶
Iterate over live column values in chunks of size rows.
Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.
- Parameters:
size¶ – Number of live rows per yielded chunk. Defaults to 65 536.
- Yields:
numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.
Examples
>>> for chunk in t["score"].iter_chunks(size=100_000): ... process(chunk)
- Column.assign(data) None[source]¶
Replace all live values in this column with data.
Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.
- Parameters:
data¶ – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.
- Raises:
ValueError – If
len(data)does not match the number of live rows, or the table is opened read-only.TypeError – If values cannot be coerced to the column’s dtype.
Row transformers¶
Column.row_transformer builds row-wise projections and reductions for
fixed-shape ndarray columns. Use these transformers with
CTable.add_generated_column() when the generated value should be computed
from each row’s ndarray payload rather than from scalar columns:
t.add_generated_column(
"embedding_norm",
values=t.embedding.row_transformer.norm(axis=0),
dtype=blosc2.float64(),
)
t.add_generated_column(
"image_mean_rgb",
values=t.image.row_transformer.mean(axis=(0, 1)),
dtype=blosc2.ndarray((3,), dtype=blosc2.float32()),
)
- class blosc2.RowTransformer(source: str, *, selection=(), op: str | None = None, axis=None, ord=None)[source]¶
Row-wise transformer for fixed-shape ndarray columns.
A row transformer sees one table row at a time. For a source column with physical shape
(nrows, *item_shape), axes passed to reductions are axes withinitem_shape(so they are shifted by one for batch evaluation).Methods
argmax
argmin
max
mean
min
norm
sum
Nullable helpers¶
Return a boolean array True where the live value is the null sentinel. |
|
Return a boolean array True where the live value is not the null sentinel. |
|
Return the number of live rows whose value equals the null sentinel. |
- Column.is_null() ndarray[source]¶
Return a boolean array True where the live value is the null sentinel.
For varlen scalar columns (vlstring/vlbytes) nullability is represented as native
Nonevalues, so this returns True wherever the value isNone. For dictionary columns, returns True where the code equals the null_code (-1by default).
Unique values¶
Return sorted array of unique live, non-null values. |
|
Return a |
Aggregates¶
Null sentinel values are automatically excluded from all aggregates.
|
Sum of all live, non-null values. |
|
Minimum live, non-null value. |
|
Maximum live, non-null value. |
|
Index of the minimum live, non-null value. |
|
Index of the maximum live, non-null value. |
|
Arithmetic mean of all live, non-null values. |
|
Standard deviation of all live, non-null values (single-pass, Welford's algorithm). |
Return True if at least one live, non-null value is True. |
|
Return True if every live, non-null value is True. |
- Column.sum(dtype=None, axis=None, *, where=None, jit=None, jit_backend=None)[source]¶
Sum of all live, non-null values.
Returns zero for an empty column or filtered view.
Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.
- Parameters:
dtype¶ – Optional accumulator dtype. When omitted, float columns use
np.float64, complex columns usenp.complex128, and integer / bool columns usenp.int64.where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.
jit¶ – Optional miniexpr JIT policy passed to the lazy reduction engine.
jit_backend¶ – Optional miniexpr JIT backend. Use
"tcc"or"cc".
Examples
Sum values matching a predicate without materializing a filtered view:
total = t["amount"].sum(where=t.category == 3)
Combine several column predicates:
total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))
Nullable sentinel values are skipped automatically:
# Equivalent to summing only live rows where predicate is true and # t.col2 is not its configured null sentinel. total = t.col2.sum(where=t.col1 < 300)
- Column.min(axis=None, *, where=None)[source]¶
Minimum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- Column.max(axis=None, *, where=None)[source]¶
Maximum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- Column.argmin(axis=None, *, where=None)[source]¶
Index of the minimum live, non-null value.
For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape
(nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).
- Column.argmax(axis=None, *, where=None)[source]¶
Index of the maximum live, non-null value.
For fixed-shape ndarray columns, this follows NumPy axis semantics on the logical array of shape
(nrows, *item_shape). For scalar columns, the result is the logical row position within this column (or filtered view).
- Column.mean(axis=None, *, where=None)[source]¶
Arithmetic mean of all live, non-null values.
Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.
- Column.std(ddof: int = 0, axis=None, *, where=None)[source]¶
Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).
- Parameters:
ddof¶ – Delta degrees of freedom.
0(default) gives the population std;1gives the sample std (divides by N-1).where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.
dtypes¶ (Supported)
skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)
float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)
- Column.any() bool[source]¶
Return True if at least one live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.
- Column.all() bool[source]¶
Return True if every live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.
Schema Specs¶
Schema specs are passed to field() to declare a column’s type,
storage constraints, and optional null sentinel. They are also
available directly in the blosc2 namespace (e.g. blosc2.int64).
- blosc2.field(spec: ~blosc2.schema.SchemaSpec, *, default=<dataclasses._MISSING_TYPE object>, cparams: dict[str, ~typing.Any] | None = None, dparams: dict[str, ~typing.Any] | None = None, chunks: tuple[int, ...] | None = None, blocks: tuple[int, ...] | None = None) Field[source]¶
Attach a Blosc2 schema spec and per-column storage options to a dataclass field.
- Parameters:
spec¶ – A schema descriptor such as
b2.int64(ge=0)orb2.float64().default¶ – Default value for the field. Omit for required fields.
cparams¶ – Compression parameters for this column’s NDArray.
dparams¶ – Decompression parameters for this column’s NDArray.
chunks¶ – Chunk shape for this column’s NDArray.
blocks¶ – Block shape for this column’s NDArray.
Examples
>>> from dataclasses import dataclass >>> import blosc2 as b2 >>> @dataclass ... class Row: ... id: int = b2.field(b2.int64(ge=0)) ... score: float = b2.field(b2.float64(ge=0, le=100)) ... active: bool = b2.field(b2.bool(), default=True)
Numeric¶
|
8-bit signed integer column (−128 … 127). |
|
16-bit signed integer column (−32 768 … 32 767). |
|
32-bit signed integer column (−2 147 483 648 … 2 147 483 647). |
|
64-bit signed integer column. |
|
8-bit unsigned integer column (0 … 255). |
|
16-bit unsigned integer column (0 … 65 535). |
|
32-bit unsigned integer column (0 … 4 294 967 295). |
|
64-bit unsigned integer column. |
|
32-bit floating-point column (single precision). |
|
64-bit floating-point column (double precision). |
|
Timestamp column stored as signed 64-bit epoch offsets. |
- class blosc2.int8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
8-bit signed integer column (−128 … 127).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int8
- class blosc2.int16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
16-bit signed integer column (−32 768 … 32 767).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int16
- class blosc2.int32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
32-bit signed integer column (−2 147 483 648 … 2 147 483 647).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int32
- class blosc2.int64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
64-bit signed integer column.
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int64
- class blosc2.uint8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
8-bit unsigned integer column (0 … 255).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint8
- class blosc2.uint16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
16-bit unsigned integer column (0 … 65 535).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint16
- class blosc2.uint32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
32-bit unsigned integer column (0 … 4 294 967 295).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint32
- class blosc2.uint64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
64-bit unsigned integer column.
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint64
- class blosc2.float32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
32-bit floating-point column (single precision).
Methods
python_typealias of
floatto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
float32
- class blosc2.float64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
64-bit floating-point column (double precision).
Methods
python_typealias of
floatto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
float64
- class blosc2.timestamp(*, unit: str = 'us', timezone: str | None = None, nullable: bool = False, null_value=None)[source]¶
Timestamp column stored as signed 64-bit epoch offsets.
The physical storage dtype is
int64.unitfollows Arrow/NumPy datetime units:"s","ms","us"or"ns".timezoneis metadata preserved for Arrow/Parquet roundtrips.Methods
python_typealias of
objectto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int64
Complex¶
64-bit complex number column (two 32-bit floats). |
|
128-bit complex number column (two 64-bit floats). |
Boolean¶
|
Boolean column. |
- class blosc2.bool(*, nullable: bool = False, null_value=None)[source]¶
Boolean column.
Nullable bool columns use uint8 physical storage with values
0(false),1(true), and255(null).Methods
python_typealias of
boolto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
bool
Text & binary¶
|
Fixed-width Unicode string column. |
|
Fixed-width bytes column. |
|
Build a variable-length scalar string schema descriptor. |
|
Build a variable-length scalar bytes schema descriptor. |
- class blosc2.string(*, min_length=None, max_length=None, pattern=None, nullable: bool = False, null_value=None)[source]¶
Fixed-width Unicode string column.
- Parameters:
max_length¶ – Maximum number of characters. Determines the NumPy
U<n>dtype. Defaults to 32 if not specified.min_length¶ – Minimum number of characters (validation only, no effect on dtype).
pattern¶ – Regex pattern the value must match (validation only).
nullable¶ – If
Trueandnull_valueis not set, choose a null sentinel from the current CTable null policy when the schema is compiled.null_value¶ – Explicit null sentinel. Takes precedence over
nullable=True.
Methods
python_typealias of
strto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
- class blosc2.bytes(*, min_length=None, max_length=None, nullable: bool = False, null_value=None)[source]¶
Fixed-width bytes column.
- Parameters:
max_length¶ – Maximum number of bytes. Determines the NumPy
S<n>dtype. Defaults to 32 if not specified.min_length¶ – Minimum number of bytes (validation only, no effect on dtype).
nullable¶ – If
Trueandnull_valueis not set, choose a null sentinel from the current CTable null policy when the schema is compiled.null_value¶ – Explicit null sentinel. Takes precedence over
nullable=True.
Methods
python_typealias of
bytesto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
- blosc2.vlstring(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLStringSpec[source]¶
Build a variable-length scalar string schema descriptor.
Use this as an explicit opt-in when a CTable column holds long or wildly variable-length strings that would waste space in a fixed-width
string(max_length=N)column. Must be requested viablosc2.field(blosc2.vlstring())— it is never inferred automatically from plainstrannotations.
- blosc2.vlbytes(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLBytesSpec[source]¶
Build a variable-length scalar bytes schema descriptor.
Use this as an explicit opt-in when a CTable column holds long or wildly variable-length byte strings. Must be requested via
blosc2.field(blosc2.vlbytes())— it is never inferred automatically from plainbytesannotations.
Array, encoded, and compound specs¶
|
Build a fixed-shape N-D array descriptor for CTable columns. |
|
Build a dictionary-encoded string column descriptor. |
|
Build a structured schema descriptor for dict-like CTable values. |
|
Build a list-valued schema descriptor for CTable and ListArray. |
|
Build a schema-less Python object column descriptor for CTable. |
- blosc2.ndarray(item_shape, dtype=<class 'numpy.float64'>, *, nullable: ~blosc2.schema.bool = False, null_value=None) NDArraySpec[source]¶
Build a fixed-shape N-D array descriptor for CTable columns.
- blosc2.dictionary(*, index_type=None, value_type=None, ordered: bool = False, nullable: bool = True) DictionarySpec[source]¶
Build a dictionary-encoded string column descriptor.
Dictionary columns store repeated string values as compact
int32codes with a separate global dictionary of unique string values. This matches Arrow dictionary encoding and is ideal for low-cardinality string columns such as categories or enumerated values.- Parameters:
index_type¶ – The physical type for category codes. Must be
blosc2.int32()in v1. Defaults toblosc2.int32()when not specified.value_type¶ – The type of dictionary values. Must be
blosc2.vlstring()in v1. Defaults toblosc2.vlstring()when not specified.ordered¶ – If
True, dictionary order is semantically meaningful.nullable¶ – If
True(default), null row values are allowed (stored as code-1).
- blosc2.struct(fields: dict[str, SchemaSpec], *, nullable: bool = False) StructSpec[source]¶
Build a structured schema descriptor for dict-like CTable values.
Top-level struct columns store one dictionary (or
Nonewhen nullable) per row. Struct specs may also be nested as list item specs.
- blosc2.list(item_spec: SchemaSpec, *, nullable: bool = False, storage: str = 'batch', serializer: str = 'msgpack', batch_rows: int | None = None, items_per_block: int | None = None) ListSpec[source]¶
Build a list-valued schema descriptor for CTable and ListArray.
- blosc2.object(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) ObjectSpec[source]¶
Build a schema-less Python object column descriptor for CTable.
Values are stored via batched msgpack serialization. Prefer typed specs such as
struct(),list(),vlstring(), orvlbytes()when the data has a stable schema; useobjectfor heterogeneous per-row payloads.
Timestamp columns¶
Timestamp columns are declared with blosc2.timestamp and store signed
64-bit epoch offsets with timestamp metadata. Column reads return
numpy.datetime64 values, comparisons accept numpy.datetime64 values,
ISO-like strings, or Python datetime objects, and Arrow/Parquet import/export
roundtrips timestamp units and time zones:
from dataclasses import dataclass
import numpy as np
import blosc2 as b2
@dataclass
class Event:
when: np.datetime64 = b2.field(b2.timestamp(unit="us", nullable=True))
value: int = b2.field(b2.int64())
table = b2.CTable(Event)
table.append(["2025-01-01T12:00:00", 42])
recent = table[table.when >= np.datetime64("2025-01-01", "us")]
Object columns¶
Schema-less object columns are declared with blosc2.object() and store one
msgpack-serializable Python object (or None when nullable) per row in
batched variable-length storage. Prefer typed specs such as blosc2.struct()
or blosc2.list() when the payload has a stable schema; use object columns
for heterogeneous per-row payloads:
from dataclasses import dataclass
import blosc2 as b2
@dataclass
class Event:
id: int = b2.field(b2.int64())
payload: object = b2.field(b2.object(nullable=True))
table.append([1, {"kind": "click", "xy": [10, 20]}])
table.append([2, ("custom", {"nested": True})])
table.append([3, None])
Object columns have no fixed Arrow type, so CTable.to_arrow() and
CTable.to_parquet() raise for them unless users first convert the payloads
to a typed representation. They are not used as an implicit fallback during
Parquet import; unsupported Arrow/Parquet types still raise unless explicitly
imported through CTable.from_arrow() with object_fallback=True.
Nested fields¶
CTable supports first-class nested struct schemas by physically flattening
struct leaves into independent compressed columns. This keeps analytics fast
(each leaf is an ordinary NDArray), while preserving the
logical nested row shape on read.
Automatic flattening from Arrow / Parquet
When CTable.from_arrow() or CTable.from_parquet() encounters a
top-level struct<…> field, it recursively flattens every scalar leaf into a
dotted column name and stores each leaf as its own physical column:
import pyarrow as pa
import blosc2
trip_type = pa.struct([
("begin", pa.struct([("lon", pa.float64()), ("lat", pa.float64())])),
("end", pa.struct([("lon", pa.float64()), ("lat", pa.float64())])),
])
schema = pa.schema([pa.field("trip", trip_type),
pa.field("fare", pa.float64())])
batch = pa.record_batch(
[pa.array([{"begin": {"lon": -87.6, "lat": 41.8},
"end": {"lon": -87.7, "lat": 41.9}}],
type=trip_type),
pa.array([12.5])],
schema=schema,
)
t = blosc2.CTable.from_arrow(schema, [batch])
# t.col_names → ['trip.begin.lon', 'trip.begin.lat',
# 'trip.end.lon', 'trip.end.lat', 'fare']
Column access
Nested leaves are accessed with their dotted logical name or via chained attribute proxies:
t["trip.begin.lon"].mean() # Column object (fast path)
t.trip.begin.lon.max() # attribute proxy, same column
A literal ., /, or \\ inside an Arrow field name is escaped with a
backslash in the logical column name. For example, path segments
("trip.info", "begin/point", "lon.deg") become:
t[r"trip\.info.begin\/point.lon\.deg"]
Such leaves are stored with percent-encoded path segments under _cols; the
example above is stored at _cols/trip%2Einfo/begin%2Fpoint/lon%2Edeg.
Filtering and expressions
Dotted names work everywhere a flat column name would:
t.where("trip.begin.lon > -87.7 and fare > 10")
t.where(t.trip.begin.lon > -87.7)
Select / projection
A struct prefix expands to all descendant leaves:
t.select(["trip.begin"]) # → columns trip.begin.lon, trip.begin.lat
t.select(["trip"]) # → all four trip.* leaves
Indexes and aggregates
Scalar leaf columns support all the same operations as flat columns:
t.create_index(col_name="trip.begin.lon")
t.where("trip.begin.lon > -87.7").nrows # uses the index
Row reconstruction
Single-row access reconstructs the original nested dict shape:
row = t[0]
row.trip # → {"begin": {"lon": ..., "lat": ...}, "end": {...}}
row.fare # → 12.5
Inserting nested rows
CTable.append() and CTable.extend() accept either the flat dotted
form or the original nested dict / list-of-dicts shape:
# flat dotted keys
t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8,
"trip.end.lon": -87.7, "trip.end.lat": 41.9, "fare": 12.5})
# original nested dict (auto-flattened)
t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8},
"end": {"lon": -87.7, "lat": 41.9}},
"fare": 12.5})
# extend with a list of nested dicts
t.extend([
{"trip": {"begin": {"lon": -87.6, "lat": 41.8},
"end": {"lon": -87.7, "lat": 41.9}}, "fare": 12.5},
{"trip": {"begin": {"lon": -87.5, "lat": 41.7},
"end": {"lon": -87.8, "lat": 41.6}}, "fare": 8.0},
])
Physical storage layout
Leaf columns are stored under a hierarchical path in the backing container:
/_cols/trip/begin/lon, /_cols/trip/begin/lat, etc. Intermediate nodes
are namespaces only; no data is stored at non-leaf levels.
Arrow / Parquet round-trip
CTable.to_parquet() and CTable.to_arrow() reconstruct the original
nested Arrow schema from the stored metadata, so round-trips are lossless:
t.to_parquet("out.parquet") # Arrow schema has top-level "trip" struct
Struct columns¶
Struct columns are declared with blosc2.struct() and store one dictionary
(or None when nullable) per row in batched variable-length storage. They are
also used when importing top-level Arrow/Parquet struct<...> columns when
not using the nested-leaf flattening path described above:
from dataclasses import dataclass
import blosc2 as b2
@dataclass
class Product:
properties: dict = b2.field(
b2.struct({"code": b2.int32(), "label": b2.vlstring()}, nullable=True)
)
table.append([{"code": 1, "label": "fresh"}])
table.append([None])
List columns¶
List columns are declared with blosc2.list(), for example:
from dataclasses import dataclass
import blosc2 as b2
@dataclass
class Product:
code: str = b2.field(b2.string(max_length=8))
tags: list[str] = b2.field(b2.list(b2.string(), nullable=True))
Whole-cell replacement is supported, so users should reassign modified lists:
row_tags = table.tags[0]
row_tags.append("extra") # local Python list only
table.tags[0] = row_tags # explicit write-back