DataFrame#

Most DataFrame methods are lazy, meaning that they do not execute computation immediately when invoked. Instead, these operations are enqueued in the DataFrame's internal query plan, and are only executed when Execution DataFrame methods are called.

DataFrame #

DataFrame(builder: LogicalPlanBuilder)

A Daft DataFrame is a table of data.

It has columns, where each column has a type and the same number of items (rows) as all other columns.

Constructs a DataFrame according to a given LogicalPlan.

Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

Parameters:

Name	Type	Description	Default
`builder`	`LogicalPlanBuilder`	LogicalPlan describing the steps required to arrive at this DataFrame	required

Methods:

Name	Description
`__contains__`	Returns whether the column exists in the dataframe.
`__getitem__`	Gets a column from the DataFrame as an Expression (`df["mycol"]`).
`__iter__`	Alias of `self.iter_rows()` with default arguments for convenient access of data.
`__len__`	Returns the count of rows when dataframe is materialized.
`agg`	Perform aggregations on this DataFrame.
`agg_concat`	Performs a global list concatenation agg on the DataFrame.
`agg_list`	Performs a global list agg on the DataFrame.
`agg_set`	Performs a global set agg on the DataFrame (ignoring nulls).
`any_value`	Returns an arbitrary value on this DataFrame.
`collect`	Executes the entire DataFrame and materializes the results.
`concat`	Concatenates two DataFrames together in a "vertical" concatenation.
`count`	Performs a global count on the DataFrame.
`count_rows`	Executes the Dataframe to count the number of rows.
`describe`	Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.
`distinct`	Computes distinct rows, dropping duplicates.
`drop_duplicates`	Computes distinct rows, dropping duplicates.
`drop_nan`	Drops rows that contains NaNs. If cols is None it will drop rows with any NaN value.
`drop_null`	Drops rows that contains NaNs or NULLs. If cols is None it will drop rows with any NULL value.
`except_all`	Returns the set difference of two DataFrames, considering duplicates.
`except_distinct`	Returns the set difference of two DataFrames.
`exclude`	Drops columns from the current DataFrame by name.
`explain`	Prints the (logical and physical) plans that will be executed to produce this DataFrame.
`explode`	Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.
`filter`	Filters rows via a predicate expression, similar to SQL `WHERE`.
`groupby`	Performs a GroupBy on the DataFrame for aggregation.
`intersect`	Returns the intersection of two DataFrames.
`intersect_all`	Returns the intersection of two DataFrames, including duplicates.
`into_batches`	Splits or coalesces DataFrame to partitions of size `batch_size`.
`into_partitions`	Splits or coalesces DataFrame to `num` partitions. Order is preserved.
`iter_partitions`	Begin executing this dataframe and return an iterator over the partitions.
`iter_rows`	Return an iterator of rows for this dataframe.
`join`	Column-wise join of the current DataFrame with an `other` DataFrame, similar to a SQL `JOIN`.
`limit`	Limits the rows in the DataFrame to the first `N` rows, similar to a SQL `LIMIT`.
`max`	Performs a global max on the DataFrame.
`mean`	Performs a global mean on the DataFrame.
`melt`	Alias for unpivot.
`min`	Performs a global min on the DataFrame.
`num_partitions`	Returns the number of partitions that will be used to execute this DataFrame.
`offset`	Returns a new DataFrame by skipping the first `N` rows, similar to a SQL `Offset`.
`pipe`	Apply the function to this DataFrame.
`pivot`	Pivots a column of the DataFrame and performs an aggregation on the values.
`repartition`	Repartitions DataFrame to `num` partitions.
`sample`	Samples rows from the DataFrame.
`schema`	Returns the Schema of the DataFrame, which provides information about each column, as a Python object.
`select`	Creates a new DataFrame from the provided expressions, similar to a SQL `SELECT`.
`show`	Executes enough of the DataFrame in order to display the first `n` rows.
`sort`	Sorts DataFrame globally.
`stddev`	Performs a global standard deviation on the DataFrame.
`sum`	Performs a global sum on the DataFrame.
`summarize`	Returns column statistics for the DataFrame.
`to_arrow`	Converts the current DataFrame to a pyarrow Table.
`to_arrow_iter`	Return an iterator of pyarrow recordbatches for this dataframe.
`to_dask_dataframe`	Converts the current Daft DataFrame to a Dask DataFrame.
`to_pandas`	Converts the current DataFrame to a pandas DataFrame.
`to_pydict`	Converts the current DataFrame to a python dictionary. The dictionary contains Python lists of Python objects for each column.
`to_pylist`	Converts the current Dataframe into a python list.
`to_ray_dataset`	Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray.
`to_torch_iter_dataset`	Convert the current DataFrame into a `Torch IterableDataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>`__ for use with PyTorch.
`to_torch_map_dataset`	Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.
`transform`	Apply a function that takes and returns a DataFrame.
`union`	Returns the distinct union of two DataFrames.
`union_all`	Returns the union of two DataFrames, including duplicates.
`union_all_by_name`	Returns the union of two DataFrames, including duplicates, with columns matched by name.
`union_by_name`	Returns the distinct union by name.
`unique`	Computes distinct rows, dropping duplicates.
`unpivot`	Unpivots a DataFrame from wide to long format.
`where`	Filters rows via a predicate expression, similar to SQL `WHERE`.
`with_column`	Adds a column to the current DataFrame with an Expression, equivalent to a `select` with all current columns and the new one.
`with_column_renamed`	Renames a column in the current DataFrame.
`with_columns`	Adds columns to the current DataFrame with Expressions, equivalent to a `select` with all current columns and the new ones.
`with_columns_renamed`	Renames multiple columns in the current DataFrame.
`write_bigtable`	Write a DataFrame into a Google Cloud Bigtable table.
`write_clickhouse`	Writes the DataFrame to a ClickHouse table.
`write_csv`	Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.
`write_deltalake`	Writes the DataFrame to a Delta Lake table, returning a new DataFrame with the operations that occurred.
`write_huggingface`	Write a DataFrame into a Hugging Face dataset.
`write_iceberg`	Writes the DataFrame to an Iceberg table, returning a new DataFrame with the operations that occurred.
`write_json`	Writes the DataFrame as JSON files, returning a new DataFrame with paths to the files that were written.
`write_lance`	Writes the DataFrame to a Lance table.
`write_parquet`	Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.
`write_sink`	Writes the DataFrame to the given DataSink.
`write_turbopuffer`	Writes the DataFrame to a Turbopuffer namespace.

Attributes:

Name	Type	Description
`column_names`	`list[str]`	Returns column names of DataFrame as a list of strings.
`columns`	`list[Expression]`	Returns column of DataFrame as a list of Expressions.

Source code in daft/dataframe/dataframe.py

def __init__(self, builder: LogicalPlanBuilder) -> None:
    """Constructs a DataFrame according to a given LogicalPlan.

    Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

    Args:
        builder: LogicalPlan describing the steps required to arrive at this DataFrame
    """
    if not isinstance(builder, LogicalPlanBuilder):
        if isinstance(builder, dict):
            raise ValueError(
                "DataFrames should be constructed with a dictionary of columns using `daft.from_pydict`"
            )
        if isinstance(builder, list):
            raise ValueError(
                "DataFrames should be constructed with a list of dictionaries using `daft.from_pylist`"
            )
        raise ValueError(f"Expected DataFrame to be constructed with a LogicalPlanBuilder, received: {builder}")

    self.__builder = builder
    self._result_cache: PartitionCacheEntry | None = None
    self._preview = Preview(partition=None, total_rows=None)
    self._num_preview_rows = get_context().daft_execution_config.num_preview_rows

column_names #

column_names: list[str]

Returns column names of DataFrame as a list of strings.

Returns:

Type	Description
`list[str]`	List[str]: Column names of this DataFrame.

columns #

columns: list[Expression]

Returns column of DataFrame as a list of Expressions.

Returns:

Type	Description
`list[Expression]`	List[Expression]: Columns of this DataFrame.

contains #

__contains__(col_name: str) -> bool

Returns whether the column exists in the dataframe.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	column name	required

Returns:

Name	Type	Description
`bool`	`bool`	whether the column exists in the dataframe.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> "x" in df

True

Source code in daft/dataframe/dataframe.py

def __contains__(self, col_name: str) -> bool:
    """Returns whether the column exists in the dataframe.

    Args:
        col_name (str): column name

    Returns:
        bool: whether the column exists in the dataframe.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> "x" in df
        True

    """
    return col_name in self.column_names

getitem #

__getitem__(item: int) -> Expression

__getitem__(item: str) -> Expression

__getitem__(item: slice) -> DataFrame

__getitem__(item: Iterable) -> DataFrame

__getitem__(item: int | str | slice | Iterable[str | int]) -> Union[Expression, DataFrame]

Gets a column from the DataFrame as an Expression (df["mycol"]).

Parameters:

Name	Type	Description	Default
`item`	`Union[int, str, slice, Iterable[Union[str, int]]]`	The column to get. Can be an integer index, a string column name, a slice for multiple columns, or an iterable of column names or indices.	required

Returns:

Type	Description
`Union[Expression, DataFrame]`	Union[Expression, DataFrame]: If a single column is requested, returns an Expression representing that column.
`Union[Expression, DataFrame]`	If multiple columns are requested (via a slice or iterable), returns a new DataFrame containing those columns.

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
>>> df["a"]  # Get a single column
>>> df["b"]  # Get another single column
>>> df[0]  # Get the first column by index
>>> df[1:3]  # Get a slice of columns
>>> df[["a", "c"]]  # Get multiple columns by name
>>> df[["a", 1]]  # Get multiple columns by name and index
>>> df[0:2]  # Get a slice of columns by index
>>> df[["a", "b", 2]]  # Get a mix of column names and indices

col(a)
col(b)
col(a)
╭───────┬───────╮
│ b     ┆ c     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────╮
│ a     ┆ c     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────┬───────╮
│ a     ┆ b     ┆ c     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╰───────┴───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)

Source code in daft/dataframe/dataframe.py

def __getitem__(self, item: int | str | slice | Iterable[str | int]) -> Union[Expression, "DataFrame"]:
    """Gets a column from the DataFrame as an Expression (``df["mycol"]``).

    Args:
        item (Union[int, str, slice, Iterable[Union[str, int]]]): The column to get. Can be an integer index, a string column name, a slice for multiple columns, or an iterable of column names or indices.

    Returns:
        Union[Expression, DataFrame]: If a single column is requested, returns an Expression representing that column.
        If multiple columns are requested (via a slice or iterable), returns a new DataFrame containing those columns.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
        >>> df["a"]  # Get a single column
        col(a)
        >>> df["b"]  # Get another single column
        col(b)
        >>> df[0]  # Get the first column by index
        col(a)
        >>> df[1:3]  # Get a slice of columns
        ╭───────┬───────╮
        │ b     ┆ c     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[["a", "c"]]  # Get multiple columns by name
        ╭───────┬───────╮
        │ a     ┆ c     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[["a", 1]]  # Get multiple columns by name and index
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[0:2]  # Get a slice of columns by index
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[["a", "b", 2]]  # Get a mix of column names and indices
        ╭───────┬───────┬───────╮
        │ a     ┆ b     ┆ c     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)

    """
    result: Expression | None

    if isinstance(item, int):
        schema = self._builder.schema()
        if item < -len(schema) or item >= len(schema):
            raise ValueError(f"{item} out of bounds for {schema}")
        result = ExpressionsProjection.from_schema(schema)[item]
        assert result is not None
        return result
    elif isinstance(item, str):
        schema = self._builder.schema()
        if item not in schema.column_names() and item != "*":
            raise ValueError(f"{item} does not exist in schema {schema}")

        return col(item)
    elif isinstance(item, Iterable):
        schema = self._builder.schema()

        columns = []
        for it in item:
            if isinstance(it, str):
                result = col(schema[it].name)
                columns.append(result)
            elif isinstance(it, int):
                if it < -len(schema) or it >= len(schema):
                    raise ValueError(f"{it} out of bounds for {schema}")
                field = list(self._builder.schema())[it]
                columns.append(col(field.name))
            else:
                raise ValueError(f"unknown indexing type: {type(it)}")
        return self.select(*columns)
    elif isinstance(item, slice):
        schema = self._builder.schema()
        columns_exprs: ExpressionsProjection = ExpressionsProjection.from_schema(schema)
        selected_columns = columns_exprs[item]
        return self.select(*[typing.cast("ColumnInputType", c) for c in selected_columns])
    else:
        raise ValueError(f"unknown indexing type: {type(item)}")

iter #

__iter__() -> Iterator[dict[str, Any]]

Alias of self.iter_rows() with default arguments for convenient access of data.

Returns:

Type	Description
`Iterator[dict[str, Any]]`	Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
`Iterator[dict[str, Any]]`	mapping column names to values.

Examples:

>>> import daft
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
>>> for row in df:
...     print(row)

{'foo': 1, 'bar': 'a'}
{'foo': 2, 'bar': 'b'}
{'foo': 3, 'bar': 'c'}

Tip

See also df.iter_rows(): iterator over rows with more options

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def __iter__(self) -> Iterator[dict[str, Any]]:
    """Alias of `self.iter_rows()` with default arguments for convenient access of data.

    Returns:
        Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
        mapping column names to values.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
        >>> for row in df:
        ...     print(row)
        {'foo': 1, 'bar': 'a'}
        {'foo': 2, 'bar': 'b'}
        {'foo': 3, 'bar': 'c'}

    Tip:
        See also [`df.iter_rows()`][daft.DataFrame.iter_rows]: iterator over rows with more options
    """
    return self.iter_rows(results_buffer_size=None)

len #

__len__() -> int

Returns the count of rows when dataframe is materialized.

If dataframe is not materialized yet, raises a runtime error.

Returns:

Name	Type	Description
`int`	`int`	count of rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df = df.collect()
>>> len(df)

Source code in daft/dataframe/dataframe.py

def __len__(self) -> int:
    """Returns the count of rows when dataframe is materialized.

    If dataframe is not materialized yet, raises a runtime error.

    Returns:
        int: count of rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df = df.collect()
        >>> len(df)
        3

    """
    if self._result is not None:
        return len(self._result)

    message = (
        "Cannot call len() on an unmaterialized dataframe:"
        " either materialize your dataframe with df.collect() first before calling len(),"
        " or use `df.count_rows()` instead which will calculate the total number of rows."
    )
    raise RuntimeError(message)

agg #

agg(*to_agg: Expression | Iterable[Expression]) -> DataFrame

Perform aggregations on this DataFrame.

Allows for mixed aggregations for multiple columns and will return a single row that aggregated the entire DataFrame.

Parameters:

Name	Type	Description	Default
`*to_agg`	`Expression`	aggregation expressions	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with aggregated results

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict(
...     {"student_id": [1, 2, 3, 4], "test1": [0.5, 0.4, 0.6, 0.7], "test2": [0.9, 0.8, 0.7, 1.0]}
... )
>>> agg_df = df.agg(
...     df["test1"].mean(),
...     df["test2"].mean(),
...     ((df["test1"] + df["test2"]) / 2).min().alias("total_min"),
...     ((df["test1"] + df["test2"]) / 2).max().alias("total_max"),
... )
>>> agg_df.show()

╭─────────┬────────────────────┬────────────────────┬───────────╮
│ test1   ┆ test2              ┆ total_min          ┆ total_max │
│ ---     ┆ ---                ┆ ---                ┆ ---       │
│ Float64 ┆ Float64            ┆ Float64            ┆ Float64   │
╞═════════╪════════════════════╪════════════════════╪═══════════╡
│ 0.55    ┆ 0.8500000000000001 ┆ 0.6000000000000001 ┆ 0.85      │
╰─────────┴────────────────────┴────────────────────┴───────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg(self, *to_agg: Expression | Iterable[Expression]) -> "DataFrame":
    """Perform aggregations on this DataFrame.

    Allows for mixed aggregations for multiple columns and will return a single row that aggregated the entire DataFrame.

    Args:
        *to_agg (Expression): aggregation expressions

    Returns:
        DataFrame: DataFrame with aggregated results

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict(
        ...     {"student_id": [1, 2, 3, 4], "test1": [0.5, 0.4, 0.6, 0.7], "test2": [0.9, 0.8, 0.7, 1.0]}
        ... )
        >>> agg_df = df.agg(
        ...     df["test1"].mean(),
        ...     df["test2"].mean(),
        ...     ((df["test1"] + df["test2"]) / 2).min().alias("total_min"),
        ...     ((df["test1"] + df["test2"]) / 2).max().alias("total_max"),
        ... )
        >>> agg_df.show()
        ╭─────────┬────────────────────┬────────────────────┬───────────╮
        │ test1   ┆ test2              ┆ total_min          ┆ total_max │
        │ ---     ┆ ---                ┆ ---                ┆ ---       │
        │ Float64 ┆ Float64            ┆ Float64            ┆ Float64   │
        ╞═════════╪════════════════════╪════════════════════╪═══════════╡
        │ 0.55    ┆ 0.8500000000000001 ┆ 0.6000000000000001 ┆ 0.85      │
        ╰─────────┴────────────────────┴────────────────────┴───────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    to_agg_list = (
        list(to_agg[0])
        if (len(to_agg) == 1 and not isinstance(to_agg[0], Expression))
        else list(typing.cast("tuple[Expression]", to_agg))
    )

    for expr in to_agg_list:
        if not isinstance(expr, Expression):
            raise ValueError(f"DataFrame.agg() only accepts expression type, received: {type(expr)}")

    return self._agg(to_agg_list, group_by=None)

agg_concat #

agg_concat(*cols: ColumnInputType) -> DataFrame

Performs a global list concatenation agg on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns that are lists to concatenate	`()`

Returns: DataFrame: Globally aggregated list. Should be a single row.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"col_a": [[1, 2], [3, 4]]})
>>> df = df.agg_concat("col_a")
>>> df.show()

╭──────────────╮
│ col_a        │
│ ---          │
│ List[Int64]  │
╞══════════════╡
│ [1, 2, 3, 4] │
╰──────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg_concat(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global list concatenation agg on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns that are lists to concatenate
    Returns:
        DataFrame: Globally aggregated list. Should be a single row.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"col_a": [[1, 2], [3, 4]]})
        >>> df = df.agg_concat("col_a")
        >>> df.show()
        ╭──────────────╮
        │ col_a        │
        │ ---          │
        │ List[Int64]  │
        ╞══════════════╡
        │ [1, 2, 3, 4] │
        ╰──────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.string_agg, cols)

agg_list #

agg_list(*cols: ColumnInputType) -> DataFrame

Performs a global list agg on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to form into a list	`()`

Returns: DataFrame: Globally aggregated list. Should be a single row.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.agg_list("col_a")
>>> df.show()

╭─────────────╮
│ col_a       │
│ ---         │
│ List[Int64] │
╞═════════════╡
│ [1, 2, 3]   │
╰─────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg_list(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global list agg on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to form into a list
    Returns:
        DataFrame: Globally aggregated list. Should be a single row.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.agg_list("col_a")
        >>> df.show()
        ╭─────────────╮
        │ col_a       │
        │ ---         │
        │ List[Int64] │
        ╞═════════════╡
        │ [1, 2, 3]   │
        ╰─────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.list_agg, cols)

agg_set #

agg_set(*cols: ColumnInputType) -> DataFrame

Performs a global set agg on the DataFrame (ignoring nulls).

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to form into a set	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated set. Should be a single row.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"col_a": [1, 2, 2, 3]})
>>> df = df.agg_set("col_a")
>>> df.show()

╭─────────────╮
│ col_a       │
│ ---         │
│ List[Int64] │
╞═════════════╡
│ [1, 2, 3]   │
╰─────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg_set(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global set agg on the DataFrame (ignoring nulls).

    Args:
        *cols (Union[str, Expression]): columns to form into a set

    Returns:
        DataFrame: Globally aggregated set. Should be a single row.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"col_a": [1, 2, 2, 3]})
        >>> df = df.agg_set("col_a")
        >>> df.show()
        ╭─────────────╮
        │ col_a       │
        │ ---         │
        │ List[Int64] │
        ╞═════════════╡
        │ [1, 2, 3]   │
        ╰─────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.list_agg_distinct, cols)

any_value #

any_value(*cols: ColumnInputType) -> DataFrame

Returns an arbitrary value on this DataFrame.

Values for each column are not guaranteed to be from the same row.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to get an arbitrary value from	`()`

Returns: DataFrame: DataFrame with any values.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.any_value("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def any_value(self, *cols: ColumnInputType) -> "DataFrame":
    """Returns an arbitrary value on this DataFrame.

    Values for each column are not guaranteed to be from the same row.

    Args:
        *cols (Union[str, Expression]): columns to get an arbitrary value from
    Returns:
        DataFrame: DataFrame with any values.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.any_value("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.any_value, cols)

collect #

collect(num_preview_rows: int | None = 8) -> DataFrame

Executes the entire DataFrame and materializes the results.

Parameters:

Name	Type	Description	Default
`num_preview_rows`	`int \| None`	Number of rows to preview. Defaults to 8.	`8`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with materialized results.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df = df.collect()
>>> df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def collect(self, num_preview_rows: int | None = 8) -> "DataFrame":
    """Executes the entire DataFrame and materializes the results.

    Args:
        num_preview_rows: Number of rows to preview. Defaults to 8.

    Returns:
        DataFrame: DataFrame with materialized results.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df = df.collect()
        >>> df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    self._materialize_results()
    assert self._result is not None
    dataframe_len = len(self._result)
    if num_preview_rows is not None:
        self._num_preview_rows = num_preview_rows
    else:
        self._num_preview_rows = dataframe_len
    return self

concat #

concat(other: DataFrame) -> DataFrame

Concatenates two DataFrames together in a "vertical" concatenation.

The resulting DataFrame has number of rows equal to the sum of the number of rows of the input DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	other DataFrame to concatenate	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with rows from `self` on top and rows from `other` at the bottom.

Note

DataFrames being concatenated must have exactly the same schema. You may wish to use the df.select() and expr.cast() methods to ensure schema compatibility before concatenation.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2], "b": [3, 4]})
>>> df2 = daft.from_pydict({"a": [5, 6], "b": [7, 8]})
>>> concatenated_df = df1.concat(df2)
>>> concatenated_df.show()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6     ┆ 8     │
╰───────┴───────╯
(Showing first 4 of 4 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def concat(self, other: "DataFrame") -> "DataFrame":
    """Concatenates two DataFrames together in a "vertical" concatenation.

    The resulting DataFrame has number of rows equal to the sum of the number of rows of the input DataFrames.

    Args:
        other (DataFrame): other DataFrame to concatenate

    Returns:
        DataFrame: DataFrame with rows from `self` on top and rows from `other` at the bottom.

    Note:
        DataFrames being concatenated **must have exactly the same schema**. You may wish to use the
        [df.select()][daft.DataFrame.select] and [expr.cast()][daft.expressions.Expression.cast] methods
        to ensure schema compatibility before concatenation.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2], "b": [3, 4]})
        >>> df2 = daft.from_pydict({"a": [5, 6], "b": [7, 8]})
        >>> concatenated_df = df1.concat(df2)
        >>> concatenated_df.show()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 5     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 6     ┆ 8     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
    """
    if self.schema() != other.schema():
        raise ValueError(
            f"DataFrames must have exactly the same schema for concatenation!\nExpected:\n{self.schema()}\n\nReceived:\n{other.schema()}"
        )
    builder = self._builder.concat(other._builder)
    return DataFrame(builder)

count #

count(*cols: ColumnInputType | int) -> DataFrame

Performs a global count on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression, int]`	columns to count	`()`

Returns: DataFrame: Globally aggregated count. Should be a single row.

Examples:

If no columns are specified (i.e. in the case you call df.count()), or only the literal string "", this functions very similarly to a COUNT() operation in SQL and will return a new dataframe with a single column with the name "count".

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"foo": [1, None, None], "bar": [None, 2, 2], "baz": [3, 4, 5]})
>>> df.count().show()  # equivalent to df.count("*").show()

╭────────╮
│ count  │
│ ---    │
│ UInt64 │
╞════════╡
│ 3      │
╰────────╯
(Showing first 1 of 1 rows)

However, specifying some column names would instead change the behavior to count all non-null values, similar to a SQL command for SELECT COUNT(foo), COUNT(bar) FROM df. Also, using df.count(col("*")) will expand out into count() for each column.

>>> df.count("foo", "bar").show()

╭────────┬────────╮
│ foo    ┆ bar    │
│ ---    ┆ ---    │
│ UInt64 ┆ UInt64 │
╞════════╪════════╡
│ 1      ┆ 2      │
╰────────┴────────╯
(Showing first 1 of 1 rows)

>>> df.count(df["*"]).show()

╭────────┬────────┬────────╮
│ foo    ┆ bar    ┆ baz    │
│ ---    ┆ ---    ┆ ---    │
│ UInt64 ┆ UInt64 ┆ UInt64 │
╞════════╪════════╪════════╡
│ 1      ┆ 2      ┆ 3      │
╰────────┴────────┴────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def count(self, *cols: ColumnInputType | int) -> "DataFrame":
    """Performs a global count on the DataFrame.

    Args:
        *cols (Union[str, Expression, int]): columns to count
    Returns:
        DataFrame: Globally aggregated count. Should be a single row.


    Examples:
        If no columns are specified (i.e. in the case you call `df.count()`), or only the literal string "*",
        this functions very similarly to a COUNT(*) operation in SQL and will return a new dataframe with a
        single column with the name "count".

        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"foo": [1, None, None], "bar": [None, 2, 2], "baz": [3, 4, 5]})
        >>> df.count().show()  # equivalent to df.count("*").show()
        ╭────────╮
        │ count  │
        │ ---    │
        │ UInt64 │
        ╞════════╡
        │ 3      │
        ╰────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

        However, specifying some column names would instead change the behavior to count all non-null values,
        similar to a SQL command for `SELECT COUNT(foo), COUNT(bar) FROM df`. Also, using `df.count(col("*"))`
        will expand out into count() for each column.

        >>> df.count("foo", "bar").show()
        ╭────────┬────────╮
        │ foo    ┆ bar    │
        │ ---    ┆ ---    │
        │ UInt64 ┆ UInt64 │
        ╞════════╪════════╡
        │ 1      ┆ 2      │
        ╰────────┴────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

        >>> df.count(df["*"]).show()
        ╭────────┬────────┬────────╮
        │ foo    ┆ bar    ┆ baz    │
        │ ---    ┆ ---    ┆ ---    │
        │ UInt64 ┆ UInt64 ┆ UInt64 │
        ╞════════╪════════╪════════╡
        │ 1      ┆ 2      ┆ 3      │
        ╰────────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    # Special case: treat this as a COUNT(*) operation which is likely what most people would expect
    # If user passes in "*", also do this behavior (by default it would count each column individually)
    if (
        len(cols) == 0
        or (len(cols) == 1 and isinstance(cols[0], str) and cols[0] == "*")
        or (len(cols) == 1 and isinstance(cols[0], int))
    ):
        builder = self._builder.count()
        return DataFrame(builder)

    if any(isinstance(c, str) and c == "*" for c in cols):
        # we do not support hybrid count-all and count-nonnull
        raise ValueError("Cannot call count() with both * and column names")

    # Otherwise, perform a column-wise count on the specified columns
    return self._apply_agg_fn(Expression.count, typing.cast("tuple[ColumnInputType, ...]", cols))

count_rows #

count_rows() -> int

Executes the Dataframe to count the number of rows.

Returns:

Name	Type	Description
`int`	`int`	count of the number of rows in this DataFrame.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df.count_rows()

Note

This will execute the DataFrame and return the number of rows in it.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def count_rows(self) -> int:
    """Executes the Dataframe to count the number of rows.

    Returns:
        int: count of the number of rows in this DataFrame.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df.count_rows()
        3

    Note:
        This will execute the DataFrame and return the number of rows in it.

    """
    if self._result is not None:
        return len(self._result)
    builder = self._builder.count()
    count_df = DataFrame(builder)
    # Expects builder to produce a single-partition, single-row DataFrame containing
    # a "count" column, where the lone value represents the row count for the DataFrame.
    return count_df.to_pydict()["count"][0]

describe #

describe() -> DataFrame

Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A dataframe where each row is a column name and its corresponding type.

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": ["x", "y", "z"]})
>>> df.describe().show()

╭─────────────┬────────╮
│ column_name ┆ type   │
│ ---         ┆ ---    │
│ String      ┆ String │
╞═════════════╪════════╡
│ a           ┆ Int64  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b           ┆ String │
╰─────────────┴────────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def describe(self) -> "DataFrame":
    """Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.

    Returns:
        DataFrame: A dataframe where each row is a column name and its corresponding type.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": ["x", "y", "z"]})
        >>> df.describe().show()
        ╭─────────────┬────────╮
        │ column_name ┆ type   │
        │ ---         ┆ ---    │
        │ String      ┆ String │
        ╞═════════════╪════════╡
        │ a           ┆ Int64  │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ b           ┆ String │
        ╰─────────────┴────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    builder = self.__builder.describe()
    return DataFrame(builder)

distinct #

distinct(*on: ColumnInputType) -> DataFrame

Computes distinct rows, dropping duplicates.

Optionally, specify a subset of columns to perform distinct on.

Parameters:

Name	Type	Description	Default
`*on`	`Union[str, Expression]`	columns to perform distinct on. Defaults to all columns.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame that has only distinct rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
>>> distinct_df = df.distinct()
>>> distinct_df = distinct_df.sort("x")
>>> distinct_df.show()
>>> # Pass a subset of columns to perform distinct on
>>> # Note that output for z is non-deterministic. Both 8 and 9 are possible.
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 9]})
>>> df.distinct("x", daft.col("y")).sort("x").show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)
╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def distinct(self, *on: ColumnInputType) -> "DataFrame":
    """Computes distinct rows, dropping duplicates.

    Optionally, specify a subset of columns to perform distinct on.

    Args:
        *on (Union[str, Expression]): columns to perform distinct on. Defaults to all columns.

    Returns:
        DataFrame: DataFrame that has only distinct rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
        >>> distinct_df = df.distinct()
        >>> distinct_df = distinct_df.sort("x")
        >>> distinct_df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
        >>> # Pass a subset of columns to perform distinct on
        >>> # Note that output for z is non-deterministic. Both 8 and 9 are possible.
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 9]})
        >>> df.distinct("x", daft.col("y")).sort("x").show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    builder = self._builder.distinct(column_inputs_to_expressions(on))
    return DataFrame(builder)

drop_duplicates #

drop_duplicates(*subset: ColumnInputType) -> DataFrame

Computes distinct rows, dropping duplicates.

Alias for DataFrame.distinct.

Parameters:

Name	Type	Description	Default
`*subset`	`Union[str, Expression]`	columns to perform distinct on. Defaults to all columns.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame that has only distinct rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
>>> distinct_df = df.drop_duplicates()
>>> distinct_df = distinct_df.sort("x")
>>> distinct_df.show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def drop_duplicates(self, *subset: ColumnInputType) -> "DataFrame":
    """Computes distinct rows, dropping duplicates.

    Alias for [DataFrame.distinct][daft.DataFrame.distinct].

    Args:
        *subset (Union[str, Expression]): columns to perform distinct on. Defaults to all columns.

    Returns:
        DataFrame: DataFrame that has only distinct rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
        >>> distinct_df = df.drop_duplicates()
        >>> distinct_df = distinct_df.sort("x")
        >>> distinct_df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    return self.distinct(*subset)

drop_nan #

drop_nan(*cols: ColumnInputType) -> DataFrame

Drops rows that contains NaNs. If cols is None it will drop rows with any NaN value.

If column names are supplied, it will drop only those rows that contains NaNs in one of these columns.

Parameters:

Name	Type	Description	Default
`*cols`	`str`	column names by which rows containing nans/NULLs should be filtered	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame without NaNs in specified/all columns

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1.0, 2.2, 3.5, float("nan")]})
>>> df.drop_nan().collect()  # drops rows where any column contains NaN values

╭─────────╮
│ a       │
│ ---     │
│ Float64 │
╞═════════╡
│ 1       │
├╌╌╌╌╌╌╌╌╌┤
│ 2.2     │
├╌╌╌╌╌╌╌╌╌┤
│ 3.5     │
╰─────────╯
(Showing first 3 of 3 rows)

>>> import daft
>>> df = daft.from_pydict({"a": [1.6, 2.5, 3.3, float("nan")]})
>>> df.drop_nan("a").collect()  # drops rows where column `a` contains NaN values

╭─────────╮
│ a       │
│ ---     │
│ Float64 │
╞═════════╡
│ 1.6     │
├╌╌╌╌╌╌╌╌╌┤
│ 2.5     │
├╌╌╌╌╌╌╌╌╌┤
│ 3.3     │
╰─────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def drop_nan(self, *cols: ColumnInputType) -> "DataFrame":
    """Drops rows that contains NaNs. If cols is None it will drop rows with any NaN value.

    If column names are supplied, it will drop only those rows that contains NaNs in one of these columns.

    Args:
        *cols (str): column names by which rows containing nans/NULLs should be filtered

    Returns:
        DataFrame: DataFrame without NaNs in specified/all columns

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1.0, 2.2, 3.5, float("nan")]})
        >>> df.drop_nan().collect()  # drops rows where any column contains NaN values
        ╭─────────╮
        │ a       │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1       │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 2.2     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 3.5     │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        >>> import daft
        >>> df = daft.from_pydict({"a": [1.6, 2.5, 3.3, float("nan")]})
        >>> df.drop_nan("a").collect()  # drops rows where column `a` contains NaN values
        ╭─────────╮
        │ a       │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1.6     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 2.5     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 3.3     │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    """
    if len(cols) == 0:
        columns = column_inputs_to_expressions(self.column_names)
    else:
        columns = column_inputs_to_expressions(cols)
    float_columns = [
        column
        for column in columns
        if (
            column._to_field(self.schema()).dtype == DataType.float32()
            or column._to_field(self.schema()).dtype == DataType.float64()
        )
    ]

    # avoid superfluous .where with empty iterable when nothing to filter.
    if not float_columns:
        return self

    from daft.functions import is_nan, when

    return self.where(
        ~reduce(
            lambda x, y: when(x.is_null(), lit(False)).otherwise(x) | when(y.is_null(), lit(False)).otherwise(y),
            (is_nan(x) for x in float_columns),
        )
    )

drop_null #

drop_null(*cols: ColumnInputType) -> DataFrame

Drops rows that contains NaNs or NULLs. If cols is None it will drop rows with any NULL value.

If column names are supplied, it will drop only those rows that contains NULLs in one of these columns.

Parameters:

Name	Type	Description	Default
`*cols`	`str`	column names by which rows containing nans should be filtered	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame without missing values in specified/all columns

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1.6, 2.5, None, float("NaN")]})
>>> df.drop_null("a").collect()

╭─────────╮
│ a       │
│ ---     │
│ Float64 │
╞═════════╡
│ 1.6     │
├╌╌╌╌╌╌╌╌╌┤
│ 2.5     │
├╌╌╌╌╌╌╌╌╌┤
│ NaN     │
╰─────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def drop_null(self, *cols: ColumnInputType) -> "DataFrame":
    """Drops rows that contains NaNs or NULLs. If cols is None it will drop rows with any NULL value.

    If column names are supplied, it will drop only those rows that contains NULLs in one of these columns.

    Args:
        *cols (str): column names by which rows containing nans should be filtered

    Returns:
        DataFrame: DataFrame without missing values in specified/all columns

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1.6, 2.5, None, float("NaN")]})
        >>> df.drop_null("a").collect()
        ╭─────────╮
        │ a       │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1.6     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 2.5     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ NaN     │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)


    """
    if len(cols) == 0:
        columns = column_inputs_to_expressions(self.column_names)
    else:
        columns = column_inputs_to_expressions(cols)
    return self.where(~reduce(lambda x, y: x | y, (x.is_null() for x in columns)))

except_all #

except_all(other: DataFrame) -> DataFrame

Returns the set difference of two DataFrames, considering duplicates.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to except with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the set difference of the two DataFrames, considering duplicates

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
>>> df2 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
>>> df1.except_all(df2).collect()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
╰───────┴───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def except_all(self, other: "DataFrame") -> "DataFrame":
    """Returns the set difference of two DataFrames, considering duplicates.

    Args:
        other (DataFrame): DataFrame to except with

    Returns:
        DataFrame: DataFrame with the set difference of the two DataFrames, considering duplicates

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
        >>> df1.except_all(df2).collect()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    builder = self._builder.except_all(other._builder)
    return DataFrame(builder)

except_distinct #

except_distinct(other: DataFrame) -> DataFrame

Returns the set difference of two DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to except with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the set difference of the two DataFrames

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
>>> df1.except_distinct(df2).collect()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 5     │
╰───────┴───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def except_distinct(self, other: "DataFrame") -> "DataFrame":
    """Returns the set difference of two DataFrames.

    Args:
        other (DataFrame): DataFrame to except with

    Returns:
        DataFrame: DataFrame with the set difference of the two DataFrames

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
        >>> df1.except_distinct(df2).collect()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 2     ┆ 5     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    builder = self._builder.except_distinct(other._builder)
    return DataFrame(builder)

exclude #

exclude(*names: str) -> DataFrame

Drops columns from the current DataFrame by name.

This is equivalent of performing a select with all the columns but the ones excluded.

Parameters:

Name	Type	Description	Default
`*names`	`str`	names to exclude	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with some columns excluded.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df_without_x = df.exclude("x")
>>> df_without_x.show()

╭───────┬───────╮
│ y     ┆ z     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6     ┆ 9     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def exclude(self, *names: str) -> "DataFrame":
    """Drops columns from the current DataFrame by name.

    This is equivalent of performing a select with all the columns but the ones excluded.

    Args:
        *names (str): names to exclude

    Returns:
        DataFrame: DataFrame with some columns excluded.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df_without_x = df.exclude("x")
        >>> df_without_x.show()
        ╭───────┬───────╮
        │ y     ┆ z     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 5     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 6     ┆ 9     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.exclude(list(names))
    return DataFrame(builder)

explain #

explain(show_all: bool = False, format: str = 'ascii', simple: bool = False, file: IOBase | None = None) -> Any

Prints the (logical and physical) plans that will be executed to produce this DataFrame.

Defaults to showing the unoptimized logical plan. Use show_all=True to show the unoptimized logical plan, the optimized logical plan, and the physical plan.

Parameters:

Name	Type	Description	Default
`show_all`	`bool`	Whether to show the optimized logical plan and the physical plan in addition to the unoptimized logical plan.	`False`
`format`	`str`	The format to print the plan in. one of 'ascii' or 'mermaid'	`'ascii'`
`simple`	`bool`	Whether to only show the type of op for each node in the plan, rather than showing details of how each op is configured.	`False`
`file`	`Optional[IOBase]`	Location to print the output to, or defaults to None which defaults to the default location for print (in Python, that should be sys.stdout)	`None`

Returns:

Type	Description
`Any`	Union[None, str, MermaidFormatter]: - If `format="mermaid"` and running in a notebook, returns a `MermaidFormatter` instance for rich rendering. - If `format="mermaid"` and not in a notebook, returns a string representation of the plan. - Otherwise, prints the plan(s) to the specified file or stdout and returns `None`.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"x": [1, 2, 3]})
>>>
>>> def double(df, column: str):
...     return df.select((df[column] * df[column]).alias(column))
>>>
>>> df = df.pipe(double, "x")
>>>
>>> df.explain()

== Unoptimized Logical Plan ==
* Project: col(x) * col(x) as x
|
* Source:
|   Number of partitions = 1
|   Output schema = x#Int64
Set `show_all=True` to also see the Optimized and Physical plans. This will run the query optimizer.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def explain(
    self, show_all: bool = False, format: str = "ascii", simple: bool = False, file: io.IOBase | None = None
) -> Any:
    r"""Prints the (logical and physical) plans that will be executed to produce this DataFrame.

    Defaults to showing the unoptimized logical plan. Use `show_all=True` to show the unoptimized logical plan,
    the optimized logical plan, and the physical plan.

    Args:
        show_all (bool): Whether to show the optimized logical plan and the physical plan in addition to the
            unoptimized logical plan.
        format (str): The format to print the plan in. one of 'ascii' or 'mermaid'
        simple (bool): Whether to only show the type of op for each node in the plan, rather than showing details
            of how each op is configured.

        file (Optional[io.IOBase]): Location to print the output to, or defaults to None which defaults to the default location for
            print (in Python, that should be sys.stdout)

    Returns:
        Union[None, str, MermaidFormatter]:
            - If `format="mermaid"` and running in a notebook, returns a `MermaidFormatter` instance for rich rendering.
            - If `format="mermaid"` and not in a notebook, returns a string representation of the plan.
            - Otherwise, prints the plan(s) to the specified file or stdout and returns `None`.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"x": [1, 2, 3]})
        >>>
        >>> def double(df, column: str):
        ...     return df.select((df[column] * df[column]).alias(column))
        >>>
        >>> df = df.pipe(double, "x")
        >>>
        >>> df.explain()
        == Unoptimized Logical Plan ==
        <BLANKLINE>
        * Project: col(x) * col(x) as x
        |
        * Source:
        |   Number of partitions = 1
        |   Output schema = x#Int64
        <BLANKLINE>
        <BLANKLINE>
        <BLANKLINE>
        Set `show_all=True` to also see the Optimized and Physical plans. This will run the query optimizer.

    """
    is_cached = self._result_cache is not None
    if format == "mermaid":
        from daft.dataframe.display import MermaidFormatter
        from daft.utils import in_notebook

        instance = MermaidFormatter(self.__builder, show_all, simple, is_cached)
        if file is not None:
            # if we are printing to a file, we print the markdown representation of the plan
            text = instance._repr_markdown_()
            print(text, file=file)
        if in_notebook():
            # if in a notebook, we return the class instance and let jupyter display it
            return instance
        else:
            # if we are not in a notebook, we return the raw markdown instead of the class instance
            return repr(instance)

    print_to_file = partial(print, file=file)

    if self._result_cache is not None:
        print_to_file("Result is cached and will skip computation\n")
        print_to_file(self._builder.pretty_print(simple, format=format))

        print_to_file("However here is the logical plan used to produce this result:\n", file=file)

    builder = self.__builder
    print_to_file("== Unoptimized Logical Plan ==\n")
    print_to_file(builder.pretty_print(simple, format=format))
    if show_all:
        print_to_file("\n== Optimized Logical Plan ==\n")
        execution_config = get_context().daft_execution_config
        builder = builder.optimize(execution_config)
        print_to_file(builder.pretty_print(simple))
        print_to_file("\n== Physical Plan ==\n")
        if get_or_create_runner().name != "native":
            from daft.daft import DistributedPhysicalPlan

            distributed_plan = DistributedPhysicalPlan.from_logical_plan_builder(
                builder._builder, "<tmp>", execution_config
            )
            if format == "ascii":
                print_to_file(distributed_plan.repr_ascii(simple))
            elif format == "mermaid":
                print_to_file(distributed_plan.repr_mermaid(MermaidOptions(simple)))
        else:
            native_executor = NativeExecutor()
            print_to_file(
                native_executor.pretty_print(builder, get_context().daft_execution_config, simple, format=format)
            )
    else:
        print_to_file(
            "\n \nSet `show_all=True` to also see the Optimized and Physical plans. This will run the query optimizer.",
        )
    return None

explode #

explode(*columns: ColumnInputType, index_column: ColumnInputType | None = None) -> DataFrame

Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.

If multiple columns are specified, each row must contain the same number of items in each specified column.

Exploding Null values or empty lists will create a single Null entry (see example below).

Parameters:

Name	Type	Description	Default
`*columns`	`ColumnInputType`	columns to explode	`()`
`index_column`	`ColumnInputType \| None`	optional name for an index column that tracks the position of each element within its original list	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with exploded column

Examples:

>>> import daft
>>> df = daft.from_pydict(
...     {
...         "x": [[1], [2, 3]],
...         "y": [["a"], ["b", "c"]],
...         "z": [
...             [1.0],
...             [2.0, 2.0],
...         ],
...     }
... )
>>> df.collect()
>>> df.explode(df["x"], df["y"]).collect()

╭─────────────┬──────────────┬───────────────╮
│ x           ┆ y            ┆ z             │
│ ---         ┆ ---          ┆ ---           │
│ List[Int64] ┆ List[String] ┆ List[Float64] │
╞═════════════╪══════════════╪═══════════════╡
│ [1]         ┆ [a]          ┆ [1]           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 3]      ┆ [b, c]       ┆ [2, 2]        │
╰─────────────┴──────────────┴───────────────╯
(Showing first 2 of 2 rows)
╭───────┬────────┬───────────────╮
│ x     ┆ y      ┆ z             │
│ ---   ┆ ---    ┆ ---           │
│ Int64 ┆ String ┆ List[Float64] │
╞═══════╪════════╪═══════════════╡
│ 1     ┆ a      ┆ [1]           │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ b      ┆ [2, 2]        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ c      ┆ [2, 2]        │
╰───────┴────────┴───────────────╯
(Showing first 3 of 3 rows)

Example with Null values and empty lists:

>>> df2 = daft.from_pydict(
...     {"id": [1, 2, 3, 4], "values": [[1, 2], [], None, [3]], "labels": [["a", "b"], [], None, ["c"]]}
... )
>>> df2.collect()
>>> df2.explode(df2["values"], df2["labels"]).collect()

╭───────┬─────────────┬──────────────╮
│ id    ┆ values      ┆ labels       │
│ ---   ┆ ---         ┆ ---          │
│ Int64 ┆ List[Int64] ┆ List[String] │
╞═══════╪═════════════╪══════════════╡
│ 1     ┆ [1, 2]      ┆ [a, b]       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ []          ┆ []           │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ None        ┆ None         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4     ┆ [3]         ┆ [c]          │
╰───────┴─────────────┴──────────────╯
(Showing first 4 of 4 rows)
╭───────┬────────┬────────╮
│ id    ┆ values ┆ labels │
│ ---   ┆ ---    ┆ ---    │
│ Int64 ┆ Int64  ┆ String │
╞═══════╪════════╪════════╡
│ 1     ┆ 1      ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ 2      ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ None   ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ None   ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4     ┆ 3      ┆ c      │
╰───────┴────────┴────────╯
(Showing first 5 of 5 rows)

Example with index_column to track element positions:

>>> df3 = daft.from_pydict({"a": [[1, 2], [3, 4, 3]]})
>>> df3.explode("a", index_column="idx").collect()

╭───────┬────────╮
│ a     ┆ idx    │
│ ---   ┆ ---    │
│ Int64 ┆ UInt64 │
╞═══════╪════════╡
│ 1     ┆ 0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 1      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ 0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4     ┆ 1      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ 2      │
╰───────┴────────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def explode(self, *columns: ColumnInputType, index_column: ColumnInputType | None = None) -> "DataFrame":
    """Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.

    If multiple columns are specified, each row must contain the same number of items in each specified column.

    Exploding Null values or empty lists will create a single Null entry (see example below).

    Args:
        *columns (ColumnInputType): columns to explode
        index_column (ColumnInputType | None): optional name for an index column that tracks the position of each element within its original list

    Returns:
        DataFrame: DataFrame with exploded column

    Examples:
        >>> import daft
        >>> df = daft.from_pydict(
        ...     {
        ...         "x": [[1], [2, 3]],
        ...         "y": [["a"], ["b", "c"]],
        ...         "z": [
        ...             [1.0],
        ...             [2.0, 2.0],
        ...         ],
        ...     }
        ... )
        >>> df.collect()
        ╭─────────────┬──────────────┬───────────────╮
        │ x           ┆ y            ┆ z             │
        │ ---         ┆ ---          ┆ ---           │
        │ List[Int64] ┆ List[String] ┆ List[Float64] │
        ╞═════════════╪══════════════╪═══════════════╡
        │ [1]         ┆ [a]          ┆ [1]           │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ [2, 3]      ┆ [b, c]       ┆ [2, 2]        │
        ╰─────────────┴──────────────┴───────────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
        >>> df.explode(df["x"], df["y"]).collect()
        ╭───────┬────────┬───────────────╮
        │ x     ┆ y      ┆ z             │
        │ ---   ┆ ---    ┆ ---           │
        │ Int64 ┆ String ┆ List[Float64] │
        ╞═══════╪════════╪═══════════════╡
        │ 1     ┆ a      ┆ [1]           │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 2     ┆ b      ┆ [2, 2]        │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 3     ┆ c      ┆ [2, 2]        │
        ╰───────┴────────┴───────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        Example with Null values and empty lists:

        >>> df2 = daft.from_pydict(
        ...     {"id": [1, 2, 3, 4], "values": [[1, 2], [], None, [3]], "labels": [["a", "b"], [], None, ["c"]]}
        ... )
        >>> df2.collect()
        ╭───────┬─────────────┬──────────────╮
        │ id    ┆ values      ┆ labels       │
        │ ---   ┆ ---         ┆ ---          │
        │ Int64 ┆ List[Int64] ┆ List[String] │
        ╞═══════╪═════════════╪══════════════╡
        │ 1     ┆ [1, 2]      ┆ [a, b]       │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 2     ┆ []          ┆ []           │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 3     ┆ None        ┆ None         │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 4     ┆ [3]         ┆ [c]          │
        ╰───────┴─────────────┴──────────────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
        >>> df2.explode(df2["values"], df2["labels"]).collect()
        ╭───────┬────────┬────────╮
        │ id    ┆ values ┆ labels │
        │ ---   ┆ ---    ┆ ---    │
        │ Int64 ┆ Int64  ┆ String │
        ╞═══════╪════════╪════════╡
        │ 1     ┆ 1      ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 1     ┆ 2      ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ None   ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ None   ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 4     ┆ 3      ┆ c      │
        ╰───────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

        Example with index_column to track element positions:

        >>> df3 = daft.from_pydict({"a": [[1, 2], [3, 4, 3]]})
        >>> df3.explode("a", index_column="idx").collect()
        ╭───────┬────────╮
        │ a     ┆ idx    │
        │ ---   ┆ ---    │
        │ Int64 ┆ UInt64 │
        ╞═══════╪════════╡
        │ 1     ┆ 0      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ 1      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ 0      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 4     ┆ 1      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ 2      │
        ╰───────┴────────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

    """
    parsed_exprs = column_inputs_to_expressions(columns)
    index_col_name = column_input_to_expression(index_column).name() if index_column is not None else None
    builder = self._builder.explode(parsed_exprs, index_col_name)
    return DataFrame(builder)

filter #

filter(predicate: Expression | str) -> DataFrame

Filters rows via a predicate expression, similar to SQL WHERE.

Alias for daft.DataFrame.where.

Parameters:

Name	Type	Description	Default
`predicate`	`Expression`	expression that keeps row if evaluates to True.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Filtered DataFrame.

Tip

groupby #

groupby(*group_by: ManyColumnsInputType) -> GroupedDataFrame

Performs a GroupBy on the DataFrame for aggregation.

Parameters:

Name	Type	Description	Default
`*group_by`	`Union[str, Expression]`	columns to group by	`()`

Returns:

Name	Type	Description
`GroupedDataFrame`	`GroupedDataFrame`	DataFrame to Aggregate

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict(
...     {
...         "pet": ["cat", "dog", "dog", "cat"],
...         "age": [1, 2, 3, 4],
...         "name": ["Alex", "Jordan", "Sam", "Riley"],
...     }
... )
>>> grouped_df = df.groupby("pet").agg(
...     df["age"].min().alias("min_age"),
...     df["age"].max().alias("max_age"),
...     df["pet"].count().alias("count"),
...     df["name"].any_value(),
... )
>>> grouped_df = grouped_df.sort("pet")
>>> grouped_df.show()

╭────────┬─────────┬─────────┬────────┬────────╮
│ pet    ┆ min_age ┆ max_age ┆ count  ┆ name   │
│ ---    ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
│ String ┆ Int64   ┆ Int64   ┆ UInt64 ┆ String │
╞════════╪═════════╪═════════╪════════╪════════╡
│ cat    ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ dog    ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
╰────────┴─────────┴─────────┴────────┴────────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def groupby(self, *group_by: ManyColumnsInputType) -> "GroupedDataFrame":
    """Performs a GroupBy on the DataFrame for aggregation.

    Args:
        *group_by (Union[str, Expression]): columns to group by

    Returns:
        GroupedDataFrame: DataFrame to Aggregate

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict(
        ...     {
        ...         "pet": ["cat", "dog", "dog", "cat"],
        ...         "age": [1, 2, 3, 4],
        ...         "name": ["Alex", "Jordan", "Sam", "Riley"],
        ...     }
        ... )
        >>> grouped_df = df.groupby("pet").agg(
        ...     df["age"].min().alias("min_age"),
        ...     df["age"].max().alias("max_age"),
        ...     df["pet"].count().alias("count"),
        ...     df["name"].any_value(),
        ... )
        >>> grouped_df = grouped_df.sort("pet")
        >>> grouped_df.show()
        ╭────────┬─────────┬─────────┬────────┬────────╮
        │ pet    ┆ min_age ┆ max_age ┆ count  ┆ name   │
        │ ---    ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
        │ String ┆ Int64   ┆ Int64   ┆ UInt64 ┆ String │
        ╞════════╪═════════╪═════════╪════════╪════════╡
        │ cat    ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ dog    ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
        ╰────────┴─────────┴─────────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    return GroupedDataFrame(self, ExpressionsProjection(self._wildcard_inputs_to_expressions(group_by)))

intersect #

intersect(other: DataFrame) -> DataFrame

Returns the intersection of two DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to intersect with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the intersection of the two DataFrames

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
>>> df = df1.intersect(df2)
>>> df = df.sort("a")
>>> df.show()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def intersect(self, other: "DataFrame") -> "DataFrame":
    """Returns the intersection of two DataFrames.

    Args:
        other (DataFrame): DataFrame to intersect with

    Returns:
        DataFrame: DataFrame with the intersection of the two DataFrames

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
        >>> df = df1.intersect(df2)
        >>> df = df.sort("a")
        >>> df.show()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    builder = self._builder.intersect(other._builder)
    return DataFrame(builder)

intersect_all #

intersect_all(other: DataFrame) -> DataFrame

Returns the intersection of two DataFrames, including duplicates.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to intersect with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the intersection of the two DataFrames, including duplicates

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
>>> df2 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
>>> df1.intersect_all(df2).sort("a").collect()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def intersect_all(self, other: "DataFrame") -> "DataFrame":
    """Returns the intersection of two DataFrames, including duplicates.

    Args:
        other (DataFrame): DataFrame to intersect with

    Returns:
        DataFrame: DataFrame with the intersection of the two DataFrames, including duplicates

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
        >>> df1.intersect_all(df2).sort("a").collect()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    """
    builder = self._builder.intersect_all(other._builder)
    return DataFrame(builder)

into_batches #

into_batches(batch_size: int) -> DataFrame

Splits or coalesces DataFrame to partitions of size batch_size.

Note

Batch sizing is performed on a best-effort basis. The heuristic is to emit a batch when we have enough rows to fill batch_size * 0.8 rows. This approach prioritizes processing efficiency over uniform batch sizes, especially when using the Ray Runner, as batches can be distributed over the cluster. The exception to this is that the last batch will be the remainder of the total number of rows in the DataFrame.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	number of target rows per partition.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Dataframe with `batch_size` rows per partition.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
>>> df = df.into_batches(2)
>>> for i, block in enumerate(df.to_arrow_iter()):
...     assert len(block) == 2, f"Expected batch size 2, got {len(block)}"

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def into_batches(self, batch_size: int) -> "DataFrame":
    """Splits or coalesces DataFrame to partitions of size ``batch_size``.

    Note:
        Batch sizing is performed on a best-effort basis.
        The heuristic is to emit a batch when we have enough rows to fill `batch_size * 0.8` rows.
        This approach prioritizes processing efficiency over uniform batch sizes, especially when using the Ray Runner, as batches can be distributed over the cluster.
        The exception to this is that the last batch will be the remainder of the total number of rows in the DataFrame.

    Args:
        batch_size (int): number of target rows per partition.

    Returns:
        DataFrame: Dataframe with `batch_size` rows per partition.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
        >>> df = df.into_batches(2)
        >>> for i, block in enumerate(df.to_arrow_iter()):
        ...     assert len(block) == 2, f"Expected batch size 2, got {len(block)}"
    """
    if batch_size <= 0:
        raise ValueError("batch_size must be greater than 0")

    builder = self._builder.into_batches(batch_size)
    return DataFrame(builder)

into_partitions #

into_partitions(num: int) -> DataFrame

Splits or coalesces DataFrame to num partitions. Order is preserved.

This will naively greedily split partitions in a round-robin fashion to hit the targeted number of partitions. The number of rows/size in a given partition is not taken into account during the splitting.

Parameters:

Name	Type	Description	Default
`num`	`int`	number of target partitions.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Dataframe with `num` partitions.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def into_partitions(self, num: int) -> "DataFrame":
    """Splits or coalesces DataFrame to ``num`` partitions. Order is preserved.

    This will naively greedily split partitions in a round-robin fashion to hit the targeted number of partitions.
    The number of rows/size in a given partition is not taken into account during the splitting.

    Args:
        num (int): number of target partitions.

    Returns:
        DataFrame: Dataframe with `num` partitions.
    """
    if get_or_create_runner().name == "native":
        warnings.warn(
            "DataFrame.into_partitions not supported on the NativeRunner. This will be a no-op. Please use the RayRunner via `daft.set_runner_ray()` instead if you need to repartition."
        )

    builder = self._builder.into_partitions(num)
    return DataFrame(builder)

iter_partitions #

iter_partitions(results_buffer_size: int | None | Literal['num_cpus'] = 'num_cpus') -> Iterator[Union[MicroPartition, ObjectRef]]

Begin executing this dataframe and return an iterator over the partitions.

Each partition will be returned as a daft.recordbatch object (if using Python runner backend) or a ray ObjectRef (if using Ray runner backend).

Parameters:

Name	Type	Description	Default
`results_buffer_size`	`int \| None \| Literal['num_cpus']`	how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).	`'num_cpus'`

A quick note on configuring asynchronous/parallel execution using results_buffer_size.

The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration

The default value is the total number of CPUs available on the current machine.

Returns:

Type	Description
`Iterator[Union[MicroPartition, ObjectRef]]`	Iterator[Union[MicroPartition, ray.ObjectRef]]: An iterator over the partitions of the DataFrame.
`Iterator[Union[MicroPartition, ObjectRef]]`	Each partition is a MicroPartition object (if using Python runner backend) or a ray ObjectRef
`Iterator[Union[MicroPartition, ObjectRef]]`	(if using Ray runner backend).

Examples:

>>> import daft
>>>
>>> daft.set_runner_ray()
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(2)
>>> for part in df.iter_partitions():
...     print(part)

MicroPartition with 3 rows:
TableState: Loaded. 1 tables
╭───────┬────────╮
│ foo   ┆ bar    │
│ ---   ┆ ---    │
│ Int64 ┆ String │
╞═══════╪════════╡
│ 1     ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ c      │
╰───────┴────────╯
Statistics: missing

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def iter_partitions(
    self, results_buffer_size: int | None | Literal["num_cpus"] = "num_cpus"
) -> Iterator[Union[MicroPartition, "ray.ObjectRef"]]:
    """Begin executing this dataframe and return an iterator over the partitions.

    Each partition will be returned as a daft.recordbatch object (if using Python runner backend)
    or a ray ObjectRef (if using Ray runner backend).

    Args:
        results_buffer_size: how many partitions to allow in the results buffer (defaults to the total number of CPUs
            available on the machine).

    Note: A quick note on configuring asynchronous/parallel execution using `results_buffer_size`.
        The `results_buffer_size` kwarg controls how many results Daft will allow to be in the buffer while iterating.
        Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

        * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
        * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
        * Setting this value to `None` means the iterator will consume as much resources as it deems appropriate per-iteration

        The default value is the total number of CPUs available on the current machine.

    Returns:
        Iterator[Union[MicroPartition, ray.ObjectRef]]: An iterator over the partitions of the DataFrame.
        Each partition is a MicroPartition object (if using Python runner backend) or a ray ObjectRef
        (if using Ray runner backend).

    Examples:
        >>> import daft
        >>>
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>>
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(2)
        >>> for part in df.iter_partitions():
        ...     print(part)  # doctest: +SKIP
        MicroPartition with 3 rows:
        TableState: Loaded. 1 tables
        ╭───────┬────────╮
        │ foo   ┆ bar    │
        │ ---   ┆ ---    │
        │ Int64 ┆ String │
        ╞═══════╪════════╡
        │ 1     ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ c      │
        ╰───────┴────────╯
        <BLANKLINE>
        <BLANKLINE>
        Statistics: missing
    """
    if results_buffer_size == "num_cpus":
        results_buffer_size = multiprocessing.cpu_count()
    elif results_buffer_size is not None and not results_buffer_size > 0:
        raise ValueError(f"Provided `results_buffer_size` value must be > 0, received: {results_buffer_size}")

    results = self._result
    if results is not None:
        # If the dataframe has already finished executing,
        # use the precomputed results.
        for mat_result in results.values():
            yield mat_result.partition()

    else:
        # Execute the dataframe in a streaming fashion.
        results_iter: Iterator[MaterializedResult[Any]] = get_or_create_runner().run_iter(
            self._builder, results_buffer_size=results_buffer_size
        )
        for result in results_iter:
            yield result.partition()

iter_rows #

iter_rows(results_buffer_size: int | None | Literal['num_cpus'] = 'num_cpus', column_format: Literal['python', 'arrow'] = 'python') -> Iterator[dict[str, Any]]

Return an iterator of rows for this dataframe.

Each row will be a Python dictionary of the form { "key" : value, ...}. If you are instead looking to iterate over entire partitions of data, see df.iter_partitions().

By default, Daft will convert the columns to Python lists for easy consumption. Datatypes with Python equivalents will be converted accordingly, e.g. timestamps to datetime, tensors to numpy arrays. For nested data such as List or Struct arrays, however, this can be expensive. You may wish to set column_format to "arrow" such that the nested data is returned as Arrow scalars.

Parameters:

Name	Type	Description	Default
`results_buffer_size`	`int \| None \| Literal['num_cpus']`	how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).	`'num_cpus'`
`column_format`	`Literal['python', 'arrow']`	the format of the columns to iterate over. One of "python" or "arrow". Defaults to "python".	`'python'`

A quick note on configuring asynchronous/parallel execution using results_buffer_size.

The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration

The default value is the total number of CPUs available on the current machine.

Returns:

Type	Description
`Iterator[dict[str, Any]]`	Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
`Iterator[dict[str, Any]]`	mapping column names to values.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
>>> for row in df.iter_rows():
...     print(row)

{'foo': 1, 'bar': 'a'}
{'foo': 2, 'bar': 'b'}
{'foo': 3, 'bar': 'c'}

Tip

See also df.iter_partitions(): iterator over entire partitions instead of single rows

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def iter_rows(
    self,
    results_buffer_size: int | None | Literal["num_cpus"] = "num_cpus",
    column_format: Literal["python", "arrow"] = "python",
) -> Iterator[dict[str, Any]]:
    """Return an iterator of rows for this dataframe.

    Each row will be a Python dictionary of the form `{ "key" : value, ...}`. If you are instead looking to iterate over
    entire partitions of data, see [`df.iter_partitions()`][daft.DataFrame.iter_partitions].

    By default, Daft will convert the columns to Python lists for easy consumption. Datatypes with Python equivalents will be converted accordingly, e.g. timestamps to datetime, tensors to numpy arrays.
    For nested data such as List or Struct arrays, however, this can be expensive. You may wish to set `column_format` to "arrow" such that the nested data is returned as Arrow scalars.

    Args:
        results_buffer_size: how many partitions to allow in the results buffer (defaults to the total number of CPUs
            available on the machine).
        column_format: the format of the columns to iterate over. One of "python" or "arrow". Defaults to "python".

    Note: A quick note on configuring asynchronous/parallel execution using `results_buffer_size`.
        The `results_buffer_size` kwarg controls how many results Daft will allow to be in the buffer while iterating.
        Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

        * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
        * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
        * Setting this value to `None` means the iterator will consume as much resources as it deems appropriate per-iteration

        The default value is the total number of CPUs available on the current machine.

    Returns:
        Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
        mapping column names to values.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
        >>> for row in df.iter_rows():
        ...     print(row)
        {'foo': 1, 'bar': 'a'}
        {'foo': 2, 'bar': 'b'}
        {'foo': 3, 'bar': 'c'}

    Tip:
        See also [`df.iter_partitions()`][daft.DataFrame.iter_partitions]: iterator over entire partitions instead of single rows
    """
    if results_buffer_size == "num_cpus":
        results_buffer_size = multiprocessing.cpu_count()

    def arrow_iter_rows(table: "pyarrow.Table") -> Iterator[dict[str, Any]]:
        columns = table.columns
        for i in range(len(table)):
            row = {col._name: col[i] for col in columns}
            yield row

    def python_iter_rows(pydict: dict[str, list[Any]], num_rows: int) -> Iterator[dict[str, Any]]:
        for i in range(num_rows):
            row = {key: value[i] for (key, value) in pydict.items()}
            yield row

    if self._result is not None:
        # If the dataframe has already finished executing,
        # use the precomputed results.
        if column_format == "python":
            yield from python_iter_rows(self.to_pydict(), len(self))
        elif column_format == "arrow":
            yield from arrow_iter_rows(self.to_arrow())
        else:
            raise ValueError(
                f"Unsupported column_format: {column_format}, supported formats are 'python' and 'arrow'"
            )
    else:
        # Execute the dataframe in a streaming fashion.
        partitions_iter = get_or_create_runner().run_iter_tables(
            self._builder, results_buffer_size=results_buffer_size
        )

        # Iterate through partitions.
        for partition in partitions_iter:
            if column_format == "python":
                yield from python_iter_rows(partition.to_pydict(), len(partition))
            elif column_format == "arrow":
                yield from arrow_iter_rows(partition.to_arrow())
            else:
                raise ValueError(
                    f"Unsupported column_format: {column_format}, supported formats are 'python' and 'arrow'"
                )

join #

join(other: DataFrame, on: list[ColumnInputType] | ColumnInputType | None = None, left_on: list[ColumnInputType] | ColumnInputType | None = None, right_on: list[ColumnInputType] | ColumnInputType | None = None, how: Literal['inner', 'left', 'right', 'outer', 'anti', 'semi', 'cross'] = 'inner', strategy: Literal['hash', 'sort_merge', 'broadcast'] | None = None, prefix: str | None = None, suffix: str | None = None) -> DataFrame

Column-wise join of the current DataFrame with an other DataFrame, similar to a SQL JOIN.

If the two DataFrames have duplicate non-join key column names, "right." will be prepended to the conflicting right columns. You can change the behavior by passing either (or both) prefix or suffix to the function. If prefix is passed, it will be prepended to the conflicting right columns. If suffix is passed, it will be appended to the conflicting right columns.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	the right DataFrame to join on.	required
`on`	`Optional[Union[List[ColumnInputType], ColumnInputType]]`	key or keys to join on [use if the keys on the left and right side match.]. Defaults to None.	`None`
`left_on`	`Optional[Union[List[ColumnInputType], ColumnInputType]]`	key or keys to join on left DataFrame. Defaults to None.	`None`
`right_on`	`Optional[Union[List[ColumnInputType], ColumnInputType]]`	key or keys to join on right DataFrame. Defaults to None.	`None`
`how`	`str`	what type of join to perform; currently "inner", "left", "right", "outer", "anti", "semi", and "cross" are supported. Defaults to "inner".	`'inner'`
`strategy`	`Optional[str]`	The join strategy (algorithm) to use; currently "hash", "sort_merge", "broadcast", and None are supported, where None chooses the join strategy automatically during query optimization. The default is None.	`None`
`suffix`	`Optional[str]`	Suffix to add to the column names in case of a name collision. Defaults to "".	`None`
`prefix`	`Optional[str]`	Prefix to add to the column names in case of a name collision. Defaults to "right.".	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Joined DataFrame.

Raises:

Type	Description
`ValueError`	if `on` is passed in and `left_on` or `right_on` is not None.
`ValueError`	if `on` is None but both `left_on` and `right_on` are not defined.

Note

Although self joins are supported, we currently duplicate the logical plan for the right side and recompute the entire tree. Caching for this is on the roadmap.

Examples:

>>> import daft
>>> from daft import col
>>> df1 = daft.from_pydict({"a": ["w", "x", "y"], "b": [1, 2, 3]})
>>> df2 = daft.from_pydict({"a": ["x", "y", "z"], "b": [20, 30, 40]})
>>> joined_df = df1.join(df2, left_on=df1["a"], right_on=df2["a"])
>>> joined_df.show()

╭────────┬───────┬─────────╮
│ a      ┆ b     ┆ right.b │
│ ---    ┆ ---   ┆ ---     │
│ String ┆ Int64 ┆ Int64   │
╞════════╪═══════╪═════════╡
│ x      ┆ 2     ┆ 20      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ y      ┆ 3     ┆ 30      │
╰────────┴───────┴─────────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def join(
    self,
    other: "DataFrame",
    on: list[ColumnInputType] | ColumnInputType | None = None,
    left_on: list[ColumnInputType] | ColumnInputType | None = None,
    right_on: list[ColumnInputType] | ColumnInputType | None = None,
    how: Literal["inner", "left", "right", "outer", "anti", "semi", "cross"] = "inner",
    strategy: Literal["hash", "sort_merge", "broadcast"] | None = None,
    prefix: str | None = None,
    suffix: str | None = None,
) -> "DataFrame":
    """Column-wise join of the current DataFrame with an ``other`` DataFrame, similar to a SQL ``JOIN``.

    If the two DataFrames have duplicate non-join key column names, "right." will be prepended to the conflicting right columns. You can change the behavior by passing either (or both) `prefix` or `suffix` to the function.
    If `prefix` is passed, it will be prepended to the conflicting right columns. If `suffix` is passed, it will be appended to the conflicting right columns.

    Args:
        other (DataFrame): the right DataFrame to join on.
        on (Optional[Union[List[ColumnInputType], ColumnInputType]]): key or keys to join on [use if the keys on the left and right side match.]. Defaults to None.
        left_on (Optional[Union[List[ColumnInputType], ColumnInputType]], optional): key or keys to join on left DataFrame. Defaults to None.
        right_on (Optional[Union[List[ColumnInputType], ColumnInputType]], optional): key or keys to join on right DataFrame. Defaults to None.
        how (str, optional): what type of join to perform; currently "inner", "left", "right", "outer", "anti", "semi", and "cross" are supported. Defaults to "inner".
        strategy (Optional[str]): The join strategy (algorithm) to use; currently "hash", "sort_merge", "broadcast", and None are supported, where None
            chooses the join strategy automatically during query optimization. The default is None.
        suffix (Optional[str], optional): Suffix to add to the column names in case of a name collision. Defaults to "".
        prefix (Optional[str], optional): Prefix to add to the column names in case of a name collision. Defaults to "right.".

    Returns:
        DataFrame: Joined DataFrame.

    Raises:
        ValueError: if `on` is passed in and `left_on` or `right_on` is not None.
        ValueError: if `on` is None but both `left_on` and `right_on` are not defined.

    Note:
        Although self joins are supported, we currently duplicate the logical plan for the right side
        and recompute the entire tree. Caching for this is on the roadmap.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df1 = daft.from_pydict({"a": ["w", "x", "y"], "b": [1, 2, 3]})
        >>> df2 = daft.from_pydict({"a": ["x", "y", "z"], "b": [20, 30, 40]})
        >>> joined_df = df1.join(df2, left_on=df1["a"], right_on=df2["a"])
        >>> joined_df.show()
        ╭────────┬───────┬─────────╮
        │ a      ┆ b     ┆ right.b │
        │ ---    ┆ ---   ┆ ---     │
        │ String ┆ Int64 ┆ Int64   │
        ╞════════╪═══════╪═════════╡
        │ x      ┆ 2     ┆ 20      │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
        │ y      ┆ 3     ┆ 30      │
        ╰────────┴───────┴─────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    if how == "cross":
        if any(side_on is not None for side_on in [on, left_on, right_on]):
            raise ValueError("In a cross join, `on`, `left_on`, and `right_on` cannot be set")
        if strategy is not None:
            raise ValueError("In a cross join, `strategy` cannot be set")
        left_on = []
        right_on = []
    elif on is None:
        if left_on is None or right_on is None:
            raise ValueError("If `on` is None then both `left_on` and `right_on` must not be None")
    else:
        if left_on is not None or right_on is not None:
            raise ValueError("If `on` is not None then both `left_on` and `right_on` must be None")
        left_on = on
        right_on = on

    join_type = JoinType.from_join_type_str(how)
    join_strategy = JoinStrategy.from_join_strategy_str(strategy) if strategy is not None else None

    if join_strategy == JoinStrategy.SortMerge and join_type != JoinType.Inner:
        raise ValueError("Sort merge join only supports inner joins")
    elif join_strategy == JoinStrategy.Broadcast and join_type == JoinType.Outer:
        raise ValueError("Broadcast join does not support outer joins")

    left_exprs = column_inputs_to_expressions(tuple(left_on) if isinstance(left_on, list) else (left_on,))
    right_exprs = column_inputs_to_expressions(tuple(right_on) if isinstance(right_on, list) else (right_on,))
    builder = self._builder.join(
        other._builder,
        left_on=left_exprs,
        right_on=right_exprs,
        how=join_type,
        strategy=join_strategy,
        prefix=prefix,
        suffix=suffix,
    )
    return DataFrame(builder)

limit #

limit(num: int) -> DataFrame

Limits the rows in the DataFrame to the first N rows, similar to a SQL LIMIT.

Parameters:

Name	Type	Description	Default
`num`	`int`	maximum rows to allow.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Limited DataFrame

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
>>> df_limited = df.limit(5)  # returns 5 rows
>>> df_limited.show()

╭───────╮
│ x     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
├╌╌╌╌╌╌╌┤
│ 4     │
├╌╌╌╌╌╌╌┤
│ 5     │
╰───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def limit(self, num: int) -> "DataFrame":
    """Limits the rows in the DataFrame to the first ``N`` rows, similar to a SQL ``LIMIT``.

    Args:
        num (int): maximum rows to allow.

    Returns:
        DataFrame: Limited DataFrame

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
        >>> df_limited = df.limit(5)  # returns 5 rows
        >>> df_limited.show()
        ╭───────╮
        │ x     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ├╌╌╌╌╌╌╌┤
        │ 2     │
        ├╌╌╌╌╌╌╌┤
        │ 3     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ├╌╌╌╌╌╌╌┤
        │ 5     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

    """
    builder = self._builder.limit(num, eager=False)
    return DataFrame(builder)

max #

max(*cols: ColumnInputType) -> DataFrame

Performs a global max on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to max	`()`

Returns: DataFrame: Globally aggregated max. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.max("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 3     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def max(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global max on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to max
    Returns:
        DataFrame: Globally aggregated max. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.max("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 3     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.max, cols)

mean #

mean(*cols: ColumnInputType) -> DataFrame

Performs a global mean on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to mean	`()`

Returns: DataFrame: Globally aggregated mean. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.mean("col_a")
>>> df.show()

╭─────────╮
│ col_a   │
│ ---     │
│ Float64 │
╞═════════╡
│ 2       │
╰─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def mean(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global mean on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to mean
    Returns:
        DataFrame: Globally aggregated mean. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.mean("col_a")
        >>> df.show()
        ╭─────────╮
        │ col_a   │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 2       │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.mean, cols)

melt #

melt(ids: ManyColumnsInputType, values: ManyColumnsInputType = [], variable_name: str = 'variable', value_name: str = 'value') -> DataFrame

Alias for unpivot.

Parameters:

Name	Type	Description	Default
`ids`	`ManyColumnsInputType`	Columns to keep as identifiers	required
`values`	`Optional[ManyColumnsInputType]`	Columns to unpivot. If not specified, all columns except ids will be unpivoted.	`[]`
`variable_name`	`Optional[str]`	Name of the variable column. Defaults to "variable".	`'variable'`
`value_name`	`Optional[str]`	Name of the value column. Defaults to "value".	`'value'`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Unpivoted DataFrame

Examples:

>>> import daft
>>> df = daft.from_pydict(
...     {
...         "year": [2020, 2021, 2022],
...         "Jan": [10, 30, 50],
...         "Feb": [20, 40, 60],
...     }
... )
>>> df = df.melt("year", ["Jan", "Feb"], variable_name="month", value_name="inventory")
>>> df = df.sort("year")
>>> df.show()

╭───────┬────────┬───────────╮
│ year  ┆ month  ┆ inventory │
│ ---   ┆ ---    ┆ ---       │
│ Int64 ┆ String ┆ Int64     │
╞═══════╪════════╪═══════════╡
│ 2020  ┆ Jan    ┆ 10        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2020  ┆ Feb    ┆ 20        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021  ┆ Jan    ┆ 30        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021  ┆ Feb    ┆ 40        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022  ┆ Jan    ┆ 50        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022  ┆ Feb    ┆ 60        │
╰───────┴────────┴───────────╯
(Showing first 6 of 6 rows)

Tip

min #

min(*cols: ColumnInputType) -> DataFrame

Performs a global min on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to min	`()`

Returns: DataFrame: Globally aggregated min. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.min("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def min(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global min on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to min
    Returns:
        DataFrame: Globally aggregated min. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.min("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.min, cols)

num_partitions #

num_partitions() -> int | None

Returns the number of partitions that will be used to execute this DataFrame.

The query optimizer may change the partitioning strategy. This method runs the optimizer and then inspects the resulting physical plan scheduler to determine how many partitions the execution will use.

Returns:

Name	Type	Description
`int`	`int \| None`	The number of partitions in the optimized physical execution plan.

Examples:

>>> import daft
>>>
>>> daft.set_runner_ray()
>>>
>>> # Create a DataFrame with 1000 rows
>>> df = daft.from_pydict({"x": list(range(1000))})
>>>
>>> # Partition count may depend on default config or optimizer decisions
>>> df.num_partitions()
>>>
>>> # You can repartition manually (if supported), and then inspect again:
>>> df2 = df.repartition(10)
>>> df2.num_partitions()

1
10

Source code in daft/dataframe/dataframe.py

def num_partitions(self) -> int | None:
    """Returns the number of partitions that will be used to execute this DataFrame.

    The query optimizer may change the partitioning strategy. This method runs the optimizer
    and then inspects the resulting physical plan scheduler to determine how many partitions
    the execution will use.

    Returns:
        int: The number of partitions in the optimized physical execution plan.

    Examples:
        >>> import daft
        >>>
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>>
        >>> # Create a DataFrame with 1000 rows
        >>> df = daft.from_pydict({"x": list(range(1000))})
        >>>
        >>> # Partition count may depend on default config or optimizer decisions
        >>> df.num_partitions()  # doctest: +SKIP
        1
        >>>
        >>> # You can repartition manually (if supported), and then inspect again:
        >>> df2 = df.repartition(10)  # doctest: +SKIP
        >>> df2.num_partitions()  # doctest: +SKIP
        10
    """
    runner_name = get_or_create_runner().name
    # Native runner does not support num_partitions
    if runner_name == "native":
        return None
    else:
        execution_config = get_context().daft_execution_config
        optimized = self._builder.optimize(execution_config)
        distributed_plan = DistributedPhysicalPlan.from_logical_plan_builder(
            optimized._builder, "<tmp>", execution_config
        )
        return distributed_plan.num_partitions()

offset #

offset(num: int) -> DataFrame

Returns a new DataFrame by skipping the first N rows, similar to a SQL Offset.

Parameters:

Name	Type	Description	Default
`num`	`int`	the number of rows to skip	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame by skipping the first `N` rows

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
>>> df = df.offset(1).limit(5)  # skip the first row and return 5 rows
>>> df.show()

╭───────╮
│ x     │
│ ---   │
│ Int64 │
╞═══════╡
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
├╌╌╌╌╌╌╌┤
│ 4     │
├╌╌╌╌╌╌╌┤
│ 5     │
├╌╌╌╌╌╌╌┤
│ 6     │
╰───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def offset(self, num: int) -> "DataFrame":
    """Returns a new DataFrame by skipping the first ``N`` rows, similar to a SQL ``Offset``.

    Args:
        num (int): the number of rows to skip

    Returns:
        DataFrame: A new DataFrame by skipping the first ``N`` rows

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
        >>> df = df.offset(1).limit(5)  # skip the first row and return 5 rows
        >>> df.show()
        ╭───────╮
        │ x     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 2     │
        ├╌╌╌╌╌╌╌┤
        │ 3     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ├╌╌╌╌╌╌╌┤
        │ 5     │
        ├╌╌╌╌╌╌╌┤
        │ 6     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

    """
    builder = self._builder.offset(num)
    return DataFrame(builder)

pipe #

pipe(function: Callable[Concatenate[DataFrame, P], T], *args: args, **kwargs: kwargs) -> T

Apply the function to this DataFrame.

Parameters:

Name	Type	Description	Default
`function`	`Callable[Concatenate[DataFrame, P], T]`	Function to apply.	required
`*args`	`args`	Positional arguments to pass to the function.	`()`
`**kwargs`	`kwargs`	Keyword arguments to pass to the function.	`{}`

Returns:

Type	Description
`T`	Result of applying the function on this DataFrame.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"x": [1, 2, 3]})
>>>
>>> def square(df, column: str):
...     return df.select((df[column] * df[column]).alias(column))
>>>
>>> df.pipe(square, "x").show()

╭───────╮
│ x     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 4     │
├╌╌╌╌╌╌╌┤
│ 9     │
╰───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

def pipe(
    self,
    function: Callable[Concatenate["DataFrame", P], T],
    *args: P.args,
    **kwargs: P.kwargs,
) -> T:
    """Apply the function to this DataFrame.

    Args:
        function (Callable[Concatenate["DataFrame", P], T]): Function to apply.
        *args (P.args): Positional arguments to pass to the function.
        **kwargs (P.kwargs): Keyword arguments to pass to the function.

    Returns:
        Result of applying the function on this DataFrame.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"x": [1, 2, 3]})
        >>>
        >>> def square(df, column: str):
        ...     return df.select((df[column] * df[column]).alias(column))
        >>>
        >>> df.pipe(square, "x").show()
        ╭───────╮
        │ x     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ├╌╌╌╌╌╌╌┤
        │ 9     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    return function(self, *args, **kwargs)

pivot #

pivot(group_by: ManyColumnsInputType, pivot_col: ColumnInputType, value_col: ColumnInputType, agg_fn: str, names: list[str] | None = None) -> DataFrame

Pivots a column of the DataFrame and performs an aggregation on the values.

Parameters:

Name	Type	Description	Default
`group_by`	`ManyColumnsInputType`	columns to group by	required
`pivot_col`	`Union[str, Expression]`	column to pivot	required
`value_col`	`Union[str, Expression]`	column to aggregate	required
`agg_fn`	`str`	aggregation function to apply	required
`names`	`Optional[List[str]]`	names of the pivoted columns	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with pivoted columns

Note

You may wish to provide a list of distinct values to pivot on, which is more efficient as it avoids a distinct operation. Without this list, Daft will perform a distinct operation on the pivot column to determine the unique values to pivot on.

Examples:

>>> import daft
>>> data = {
...     "id": [1, 2, 3, 4],
...     "version": ["3.8", "3.8", "3.9", "3.9"],
...     "platform": ["macos", "macos", "macos", "windows"],
...     "downloads": [100, 200, 150, 250],
... }
>>> df = daft.from_pydict(data)
>>> df = df.pivot("version", "platform", "downloads", "sum")
>>>
>>> df = df.sort("version").select("version", "windows", "macos")
>>> df.show()

╭─────────┬─────────┬───────╮
│ version ┆ windows ┆ macos │
│ ---     ┆ ---     ┆ ---   │
│ String  ┆ Int64   ┆ Int64 │
╞═════════╪═════════╪═══════╡
│ 3.8     ┆ None    ┆ 300   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3.9     ┆ 250     ┆ 150   │
╰─────────┴─────────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def pivot(
    self,
    group_by: ManyColumnsInputType,
    pivot_col: ColumnInputType,
    value_col: ColumnInputType,
    agg_fn: str,
    names: list[str] | None = None,
) -> "DataFrame":
    """Pivots a column of the DataFrame and performs an aggregation on the values.

    Args:
        group_by (ManyColumnsInputType): columns to group by
        pivot_col (Union[str, Expression]): column to pivot
        value_col (Union[str, Expression]): column to aggregate
        agg_fn (str): aggregation function to apply
        names (Optional[List[str]]): names of the pivoted columns

    Returns:
        DataFrame: DataFrame with pivoted columns

    Note:
        You may wish to provide a list of distinct values to pivot on, which is more efficient as it avoids
        a distinct operation. Without this list, Daft will perform a distinct operation on the pivot column to
        determine the unique values to pivot on.

    Examples:
        >>> import daft
        >>> data = {
        ...     "id": [1, 2, 3, 4],
        ...     "version": ["3.8", "3.8", "3.9", "3.9"],
        ...     "platform": ["macos", "macos", "macos", "windows"],
        ...     "downloads": [100, 200, 150, 250],
        ... }
        >>> df = daft.from_pydict(data)
        >>> df = df.pivot("version", "platform", "downloads", "sum")
        >>>
        >>> df = df.sort("version").select("version", "windows", "macos")
        >>> df.show()
        ╭─────────┬─────────┬───────╮
        │ version ┆ windows ┆ macos │
        │ ---     ┆ ---     ┆ ---   │
        │ String  ┆ Int64   ┆ Int64 │
        ╞═════════╪═════════╪═══════╡
        │ 3.8     ┆ None    ┆ 300   │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3.9     ┆ 250     ┆ 150   │
        ╰─────────┴─────────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)


    """
    group_by_expr = column_inputs_to_expressions(group_by)
    [pivot_col_expr, value_col_expr] = column_inputs_to_expressions([pivot_col, value_col])
    agg_expr = self._map_agg_string_to_expr(value_col_expr, agg_fn)

    if names is None:
        names = (
            self.select(typing.cast("ColumnInputType", pivot_col_expr))
            .distinct()
            .to_pydict()[pivot_col_expr.name()]
        )
        names = [str(x) for x in names]
    builder = self._builder.pivot(group_by_expr, pivot_col_expr, value_col_expr, agg_expr, names)
    return DataFrame(builder)

repartition #

repartition(num: int | None, *partition_by: ColumnInputType) -> DataFrame

Repartitions DataFrame to num partitions.

If columns are passed in, then DataFrame will be repartitioned by those, otherwise random repartitioning will occur.

Parameters:

Name	Type	Description	Default
`num`	`Optional[int]`	Number of target partitions; if None, the number of partitions will not be changed.	required
`*partition_by`	`Union[str, Expression]`	Optional columns to partition by.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Repartitioned DataFrame.

This function will globally shuffle your data, which is potentially a very expensive operation.

If instead you merely wish to "split" or "coalesce" partitions to obtain a target number of partitions, you mean instead wish to consider using DataFrame.into_partitions which avoids shuffling of data in favor of splitting/coalescing adjacent partitions where appropriate.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> repartitioned_df = df.repartition(3)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def repartition(self, num: int | None, *partition_by: ColumnInputType) -> "DataFrame":
    """Repartitions DataFrame to ``num`` partitions.

    If columns are passed in, then DataFrame will be repartitioned by those, otherwise
    random repartitioning will occur.

    Args:
        num (Optional[int]): Number of target partitions; if None, the number of partitions will not be changed.
        *partition_by (Union[str, Expression]): Optional columns to partition by.

    Returns:
        DataFrame: Repartitioned DataFrame.

    Note: This function will globally shuffle your data, which is potentially a very expensive operation.
        If instead you merely wish to "split" or "coalesce" partitions to obtain a target number of partitions,
        you mean instead wish to consider using [DataFrame.into_partitions][daft.DataFrame.into_partitions] which
        avoids shuffling of data in favor of splitting/coalescing adjacent partitions where appropriate.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> repartitioned_df = df.repartition(3)

    """
    if get_or_create_runner().name == "native":
        warnings.warn(
            "DataFrame.repartition not supported on the NativeRunner. This will be a no-op. Please use the RayRunner via `daft.set_runner_ray()` instead if you need to repartition."
        )
    if len(partition_by) == 0:
        warnings.warn(
            "No columns specified for repartition, so doing a random shuffle. If you do not require rebalancing of "
            "partitions, you may instead prefer using `df.into_partitions(N)` which is a cheaper operation that "
            "avoids shuffling data."
        )
        builder = self._builder.random_shuffle(num)
    else:
        builder = self._builder.hash_repartition(num, column_inputs_to_expressions(partition_by))
    return DataFrame(builder)

sample #

sample(fraction: float | None = None, size: int | None = None, with_replacement: bool = False, seed: int | None = None) -> DataFrame

Samples rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`fraction`	`Optional[float]`	fraction of rows to sample (between 0.0 and 1.0). Must specify either `fraction` or `size`, but not both. For backward compatibility, can also be passed as a positional argument.	`None`
`size`	`Optional[int]`	exact number of rows to sample. Must specify either `fraction` or `size`, but not both. If `size` exceeds the total number of rows: - When `with_replacement=False`: raises ValueError - When `with_replacement=True`: returns `size` rows (may contain duplicates) Note: Sample by size only works on the native runner right now.	`None`
`with_replacement`	`bool`	whether to sample with replacement. Defaults to False.	`False`
`seed`	`Optional[int]`	random seed. Defaults to None.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with sampled rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> # Sample by fraction (backward compatible positional argument)
>>> sampled_df = df.sample(0.5)
>>> sampled_df = sampled_df.collect()
>>> # sampled_df.show()
>>> # ╭───────┬───────┬───────╮
>>> # │ x     ┆ y     ┆ z     │
>>> # │ ---   ┆ ---   ┆ ---   │
>>> # │ Int64 ┆ Int64 ┆ Int64 │
>>> # ╞═══════╪═══════╪═══════╡
>>> # │ 3     ┆ 6     ┆ 9     │
>>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
>>> # │ 1     ┆ 4     ┆ 7     │
>>> # ╰───────┴───────┴───────╯
>>> # <BLANKLINE>
>>> # (Showing first 2 of 2 rows)
>>> # Samples will vary from output to output
>>> # here is a sample output
>>> # ╭───────┬───────┬───────╮
>>> # │ x     ┆ y     ┆ z     │
>>> # │ ---   ┆ ---   ┆ ---   │
>>> # │ Int64 ┆ Int64 ┆ Int64 │
>>> # |═══════╪═══════╪═══════╡
>>> # │ 2     ┆ 5     ┆ 8     │
>>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
>>> # │ 3     ┆ 6     ┆ 9     │
>>> # ╰───────┴───────┴───────╯
>>> # Sample by exact number of rows
>>> sampled_df = df.sample(size=2)
>>> sampled_df = sampled_df.collect()

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def sample(
    self,
    fraction: float | None = None,
    size: int | None = None,
    with_replacement: bool = False,
    seed: int | None = None,
) -> "DataFrame":
    """Samples rows from the DataFrame.

    Args:
        fraction (Optional[float]): fraction of rows to sample (between 0.0 and 1.0).
            Must specify either `fraction` or `size`, but not both.
            For backward compatibility, can also be passed as a positional argument.
        size (Optional[int]): exact number of rows to sample.
            Must specify either `fraction` or `size`, but not both.
            If `size` exceeds the total number of rows:
            - When `with_replacement=False`: raises ValueError
            - When `with_replacement=True`: returns `size` rows (may contain duplicates)
            Note: Sample by size only works on the native runner right now.
        with_replacement (bool, optional): whether to sample with replacement. Defaults to False.
        seed (Optional[int], optional): random seed. Defaults to None.

    Returns:
        DataFrame: DataFrame with sampled rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> # Sample by fraction (backward compatible positional argument)
        >>> sampled_df = df.sample(0.5)
        >>> sampled_df = sampled_df.collect()
        >>> # sampled_df.show()
        >>> # ╭───────┬───────┬───────╮
        >>> # │ x     ┆ y     ┆ z     │
        >>> # │ ---   ┆ ---   ┆ ---   │
        >>> # │ Int64 ┆ Int64 ┆ Int64 │
        >>> # ╞═══════╪═══════╪═══════╡
        >>> # │ 3     ┆ 6     ┆ 9     │
        >>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        >>> # │ 1     ┆ 4     ┆ 7     │
        >>> # ╰───────┴───────┴───────╯
        >>> # <BLANKLINE>
        >>> # (Showing first 2 of 2 rows)
        >>> # Samples will vary from output to output
        >>> # here is a sample output
        >>> # ╭───────┬───────┬───────╮
        >>> # │ x     ┆ y     ┆ z     │
        >>> # │ ---   ┆ ---   ┆ ---   │
        >>> # │ Int64 ┆ Int64 ┆ Int64 │
        >>> # |═══════╪═══════╪═══════╡
        >>> # │ 2     ┆ 5     ┆ 8     │
        >>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        >>> # │ 3     ┆ 6     ┆ 9     │
        >>> # ╰───────┴───────┴───────╯
        >>> # Sample by exact number of rows
        >>> sampled_df = df.sample(size=2)
        >>> sampled_df = sampled_df.collect()
    """
    if fraction is not None and size is not None:
        raise ValueError("Must specify either `fraction` or `size`, but not both")
    if fraction is None and size is None:
        raise ValueError("Must specify either `fraction` or `size`")
    if fraction is not None:
        if fraction < 0.0 or fraction > 1.0:
            raise ValueError(f"fraction should be between 0.0 and 1.0, but got {fraction}")
    if size is not None:
        if size < 0:
            raise ValueError(f"size should be non-negative, but got {size}")
        if get_or_create_runner().name == "ray":
            raise ValueError(
                "Sample by size only works on the native runner right now. "
                "Please use `daft.set_runner_native()` to switch to the native runner, "
                "or use `fraction` instead of `size` for sampling."
            )

    builder = self._builder.sample(fraction, size, with_replacement, seed)
    return DataFrame(builder)

schema #

schema() -> Schema

Returns the Schema of the DataFrame, which provides information about each column, as a Python object.

Returns:

Name	Type	Description
`Schema`	`Schema`	schema of the DataFrame

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.schema()

╭─────────────┬────────╮
│ column_name ┆ type   │
╞═════════════╪════════╡
│ x           ┆ Int64  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ y           ┆ String │
╰─────────────┴────────╯

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def schema(self) -> Schema:
    """Returns the Schema of the DataFrame, which provides information about each column, as a Python object.

    Returns:
        Schema: schema of the DataFrame

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.schema()
        ╭─────────────┬────────╮
        │ column_name ┆ type   │
        ╞═════════════╪════════╡
        │ x           ┆ Int64  │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ y           ┆ String │
        ╰─────────────┴────────╯
        <BLANKLINE>
    """
    return self.__builder.schema()

select #

select(*columns: ColumnInputType, **projections: Expression) -> DataFrame

Creates a new DataFrame from the provided expressions, similar to a SQL SELECT.

Parameters:

Name	Type	Description	Default
`*columns`	`Union[str, Expression]`	columns to select from the current DataFrame	`()`
`**projections`	`Expression`	additional projections in kwarg format.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	new DataFrame that will select the passed in columns

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df = df.select("x", daft.col("y"), daft.col("z") + 1)
>>> df.show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 9     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 10    │
╰───────┴───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def select(self, *columns: ColumnInputType, **projections: Expression) -> "DataFrame":
    """Creates a new DataFrame from the provided expressions, similar to a SQL ``SELECT``.

    Args:
        *columns (Union[str, Expression]): columns to select from the current DataFrame
        **projections (Expression): additional projections in kwarg format.

    Returns:
        DataFrame: new DataFrame that will select the passed in columns

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df = df.select("x", daft.col("y"), daft.col("z") + 1)
        >>> df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 9     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     ┆ 10    │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    selection = column_inputs_to_expressions(columns)
    selection += [expr.alias(alias) for (alias, expr) in projections.items()]
    builder = self._builder.select(selection)
    return DataFrame(builder)

show #

show(n: int = 8, format: PreviewFormat | None = None, verbose: bool = False, max_width: int = 30, align: PreviewAlign = 'left', columns: list[PreviewColumn] | None = None) -> None

Executes enough of the DataFrame in order to display the first n rows.

If IPython is installed, this will use IPython's display utility to pretty-print in a notebook/REPL environment. Otherwise, this will fall back onto a naive Python print.

If no format is given, then daft's truncating preview format is used. - The output is a 'fancy' table with rounded corners. - Headers contain the column's data type. - Columns are truncated to 30 characters. - The table's overall width is limited to 10 columns.

Parameters:

Name	Type	Description	Default
`n`	`int`	number of rows to show. Defaults to 8.	`8`
`format`	`PreviewFormat`	the box-drawing format e.g. "fancy" or "markdown".	`None`
`verbose`	`bool`	verbose will print header info	`False`
`max_width`	`int`	global max column width	`30`
`align`	`PreviewAlign`	global column align	`'left'`
`columns`	`list[PreviewColumn]`	column overrides	`None`

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df.show()
>>> df.show(format="markdown")
>>> df.show(max_width=50)
>>> df.show(align="left")

Usage

If columns are given, their length MUST match the schema.
If columns are given, their settings override any global settings.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def show(
    self,
    n: int = 8,
    format: PreviewFormat | None = None,
    verbose: bool = False,
    max_width: int = 30,
    align: PreviewAlign = "left",
    columns: list[PreviewColumn] | None = None,
) -> None:
    """Executes enough of the DataFrame in order to display the first ``n`` rows.

    If IPython is installed, this will use IPython's `display` utility to pretty-print in a
    notebook/REPL environment. Otherwise, this will fall back onto a naive Python `print`.

    If no format is given, then daft's truncating preview format is used.
        - The output is a 'fancy' table with rounded corners.
        - Headers contain the column's data type.
        - Columns are truncated to 30 characters.
        - The table's overall width is limited to 10 columns.

    Args:
        n: number of rows to show. Defaults to 8.
        format (PreviewFormat): the box-drawing format e.g. "fancy" or "markdown".
        verbose (bool): verbose will print header info
        max_width (int): global max column width
        align (PreviewAlign): global column align
        columns (list[PreviewColumn]): column overrides

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df.show()  # doctest: +SKIP
        >>> df.show(format="markdown")  # doctest: +SKIP
        >>> df.show(max_width=50)  # doctest: +SKIP
        >>> df.show(align="left")  # doctest: +SKIP

    Tip: Usage
        - If columns are given, their length MUST match the schema.
        - If columns are given, their settings override any global settings.

    """
    schema = self.schema()
    preview = self._construct_show_preview(n)
    preview_formatter = PreviewFormatter(
        preview,
        schema,
        format,
        **{
            "verbose": verbose,
            "max_width": max_width,
            "align": align,
            "columns": columns,
        },
    )

    try:
        from IPython.display import HTML, display

        if in_notebook() and preview.partition is not None:
            try:
                interactive_html = preview_formatter._generate_interactive_html()
                display(HTML(interactive_html), clear=True)
                return None
            except Exception:
                pass

        display(preview_formatter, clear=True)
    except ImportError:
        print(preview_formatter)
    return None

sort #

sort(by: ColumnInputType | list[ColumnInputType], desc: bool | list[bool] = False, nulls_first: bool | list[bool] | None = None) -> DataFrame

Sorts DataFrame globally.

Parameters:

Name	Type	Description	Default
`by`	`Union[ColumnInputType, List[ColumnInputType]]`	column to sort by. Can be `str` or expression as well as a list of either.	required
`desc`	`Union[bool, List[bool])`	Sort by descending order. Defaults to False.	`False`
`nulls_first`	`Union[bool, List[bool])`	Sort by nulls first. Defaults to nulls being treated as the greatest value.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Sorted DataFrame.

Note

Since this a global sort, this requires an expensive repartition which can be quite slow.
Supports multicolumn sorts and can have unique descending and nulls_first flags per column.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [3, 2, 1], "y": [6, 4, 5]})
>>> sorted_df = df.sort(df["x"] + df["y"])
>>> sorted_df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

You can also sort by multiple columns, and specify the 'descending' flag for each column:

>>> df = daft.from_pydict({"x": [1, 2, 1, 2], "y": [9, 8, 7, 6]})
>>> sorted_df = df.sort(["x", "y"], [True, False])
>>> sorted_df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 9     │
╰───────┴───────╯
(Showing first 4 of 4 rows)

You can also specify null positioning (first/last) for each column

>>> df = daft.from_pydict({"x": [1, 2, 1, 2, None], "y": [9, 8, None, 6, None]})
>>> sorted_df = df.sort(["x", "y"], [True, False], nulls_first=[True, True])
>>> sorted_df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ None  ┆ None  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ None  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 9     │
╰───────┴───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def sort(
    self,
    by: ColumnInputType | list[ColumnInputType],
    desc: bool | list[bool] = False,
    nulls_first: bool | list[bool] | None = None,
) -> "DataFrame":
    """Sorts DataFrame globally.

    Args:
        by (Union[ColumnInputType, List[ColumnInputType]]): column to sort by. Can be `str` or expression as well as a list of either.
        desc (Union[bool, List[bool]), optional): Sort by descending order. Defaults to False.
        nulls_first (Union[bool, List[bool]), optional): Sort by nulls first. Defaults to nulls being treated as the greatest value.

    Returns:
        DataFrame: Sorted DataFrame.

    Note:
        * Since this a global sort, this requires an expensive repartition which can be quite slow.
        * Supports multicolumn sorts and can have unique `descending` and `nulls_first` flags per column.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [3, 2, 1], "y": [6, 4, 5]})
        >>> sorted_df = df.sort(df["x"] + df["y"])
        >>> sorted_df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 2     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        You can also sort by multiple columns, and specify the 'descending' flag for each column:

        >>> df = daft.from_pydict({"x": [1, 2, 1, 2], "y": [9, 8, 7, 6]})
        >>> sorted_df = df.sort(["x", "y"], [True, False])
        >>> sorted_df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 2     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 9     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)

        You can also specify null positioning (first/last) for each column

        >>> df = daft.from_pydict({"x": [1, 2, 1, 2, None], "y": [9, 8, None, 6, None]})
        >>> sorted_df = df.sort(["x", "y"], [True, False], nulls_first=[True, True])
        >>> sorted_df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ None  ┆ None  │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ None  │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 9     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)
    """
    if not isinstance(by, list):
        by = [
            by,
        ]

    if nulls_first is None:
        nulls_first = desc

    sort_by = column_inputs_to_expressions(by)

    builder = self._builder.sort(sort_by=sort_by, descending=desc, nulls_first=nulls_first)
    return DataFrame(builder)

stddev #

stddev(*cols: ColumnInputType) -> DataFrame

Performs a global standard deviation on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to stddev	`()`

Returns: DataFrame: Globally aggregated standard deviation. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [0, 1, 2]})
>>> df = df.stddev("col_a")
>>> df.show()

╭───────────────────╮
│ col_a             │
│ ---               │
│ Float64           │
╞═══════════════════╡
│ 0.816496580927726 │
╰───────────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def stddev(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global standard deviation on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to stddev
    Returns:
        DataFrame: Globally aggregated standard deviation. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [0, 1, 2]})
        >>> df = df.stddev("col_a")
        >>> df.show()
        ╭───────────────────╮
        │ col_a             │
        │ ---               │
        │ Float64           │
        ╞═══════════════════╡
        │ 0.816496580927726 │
        ╰───────────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    return self._apply_agg_fn(Expression.stddev, cols)

sum #

sum(*cols: ManyColumnsInputType) -> DataFrame

Performs a global sum on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to sum	`()`

Returns: DataFrame: Globally aggregated sums. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.sum("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 6     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def sum(self, *cols: ManyColumnsInputType) -> "DataFrame":
    """Performs a global sum on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to sum
    Returns:
        DataFrame: Globally aggregated sums. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.sum("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 6     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.sum, cols)

summarize #

summarize() -> DataFrame

Returns column statistics for the DataFrame.

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	new DataFrame with the computed column statistics.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df.summarize().show()

╭────────┬────────┬────────┬────────────┬────────┬─────────────┬───────────────────────╮
│ column ┆ type   ┆ min    ┆      …     ┆ count  ┆ count_nulls ┆ approx_count_distinct │
│ ---    ┆ ---    ┆ ---    ┆            ┆ ---    ┆ ---         ┆ ---                   │
│ String ┆ String ┆ String ┆ (1 hidden) ┆ UInt64 ┆ UInt64      ┆ UInt64                │
╞════════╪════════╪════════╪════════════╪════════╪═════════════╪═══════════════════════╡
│ x      ┆ Int64  ┆ 1      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ y      ┆ Int64  ┆ 4      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ z      ┆ Int64  ┆ 7      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
╰────────┴────────┴────────┴────────────┴────────┴─────────────┴───────────────────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def summarize(self) -> "DataFrame":
    """Returns column statistics for the DataFrame.

    Returns:
        DataFrame: new DataFrame with the computed column statistics.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df.summarize().show()  # doctest: +SKIP
        ╭────────┬────────┬────────┬────────────┬────────┬─────────────┬───────────────────────╮
        │ column ┆ type   ┆ min    ┆      …     ┆ count  ┆ count_nulls ┆ approx_count_distinct │
        │ ---    ┆ ---    ┆ ---    ┆            ┆ ---    ┆ ---         ┆ ---                   │
        │ String ┆ String ┆ String ┆ (1 hidden) ┆ UInt64 ┆ UInt64      ┆ UInt64                │
        ╞════════╪════════╪════════╪════════════╪════════╪═════════════╪═══════════════════════╡
        │ x      ┆ Int64  ┆ 1      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ y      ┆ Int64  ┆ 4      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ z      ┆ Int64  ┆ 7      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
        ╰────────┴────────┴────────┴────────────┴────────┴─────────────┴───────────────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.summarize()
    return DataFrame(builder)

to_arrow #

to_arrow() -> Table

Converts the current DataFrame to a pyarrow Table.

If results have not computed yet, collect will be called.

Returns:

Type	Description
`Table`	pyarrow.Table: pyarrow Table converted from a Daft DataFrame

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> arrow_table = df.to_arrow()
>>> print(arrow_table)

pyarrow.Table
a: int64
b: int64
----
a: [[1,2,3]]
b: [[4,5,6]]

Tip

See also DataFrame.to_arrow_iter() for a streaming iterator over the rows of the DataFrame as Arrow RecordBatches.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_arrow(self) -> "pyarrow.Table":
    """Converts the current DataFrame to a [pyarrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html).

    If results have not computed yet, collect will be called.

    Returns:
        pyarrow.Table: [pyarrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html) converted from a Daft DataFrame

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> arrow_table = df.to_arrow()
        >>> print(arrow_table)
        pyarrow.Table
        a: int64
        b: int64
        ----
        a: [[1,2,3]]
        b: [[4,5,6]]

    Tip:
        See also [DataFrame.to_arrow_iter()][daft.DataFrame.to_arrow_iter] for
        a streaming iterator over the rows of the DataFrame as Arrow RecordBatches.
    """
    import pyarrow as pa

    arrow_rb_iter = self.to_arrow_iter(results_buffer_size=None)
    return pa.Table.from_batches(arrow_rb_iter, schema=self.schema().to_pyarrow_schema())

to_arrow_iter #

to_arrow_iter(results_buffer_size: int | None | Literal['num_cpus'] = 'num_cpus') -> Iterator[RecordBatch]

Return an iterator of pyarrow recordbatches for this dataframe.

Parameters:

Name	Type	Description	Default
`results_buffer_size`	`int \| None \| Literal['num_cpus']`	how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).	`'num_cpus'`

Note: A quick note on configuring asynchronous/parallel execution using results_buffer_size. The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer. * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput * Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration The default value is the total number of CPUs available on the current machine.

Returns:

Type	Description
`Iterator[RecordBatch]`	Iterator[pyarrow.RecordBatch]: An iterator over the RecordBatches of the DataFrame.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
>>> for batch in df.to_arrow_iter():
...     print(batch)

pyarrow.RecordBatch
foo: int64
bar: large_string
----
foo: [1,2,3]
bar: ["a","b","c"]

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_arrow_iter(
    self,
    results_buffer_size: int | None | Literal["num_cpus"] = "num_cpus",
) -> Iterator["pyarrow.RecordBatch"]:
    """Return an iterator of pyarrow recordbatches for this dataframe.

    Args:
        results_buffer_size: how many partitions to allow in the results buffer (defaults to the total number of CPUs
            available on the machine).
    Note: A quick note on configuring asynchronous/parallel execution using `results_buffer_size`.
        The `results_buffer_size` kwarg controls how many results Daft will allow to be in the buffer while iterating.
        Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.
        * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
        * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
        * Setting this value to `None` means the iterator will consume as much resources as it deems appropriate per-iteration
        The default value is the total number of CPUs available on the current machine.

    Returns:
        Iterator[pyarrow.RecordBatch]: An iterator over the RecordBatches of the DataFrame.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
        >>> for batch in df.to_arrow_iter():
        ...     print(batch)
        pyarrow.RecordBatch
        foo: int64
        bar: large_string
        ----
        foo: [1,2,3]
        bar: ["a","b","c"]
    """
    if results_buffer_size == "num_cpus":
        results_buffer_size = multiprocessing.cpu_count()
    if results_buffer_size is not None and not results_buffer_size > 0:
        raise ValueError(f"Provided `results_buffer_size` value must be > 0, received: {results_buffer_size}")

    results = self._result
    if results is not None:
        # If the dataframe has already finished executing,
        # use the precomputed results.

        for _, result in results.items():
            yield from (result.micropartition().to_arrow().to_batches())
    else:
        # Execute the dataframe in a streaming fashion.
        partitions_iter = get_or_create_runner().run_iter_tables(
            self._builder, results_buffer_size=results_buffer_size
        )

        # Iterate through partitions.
        for partition in partitions_iter:
            yield from partition.to_arrow().to_batches()

to_dask_dataframe #

to_dask_dataframe(meta: Union[DataFrame, Series[Any], dict[str, Any], Iterable[Any], tuple[Any], None] = None) -> DataFrame

Converts the current Daft DataFrame to a Dask DataFrame.

The returned Dask DataFrame will use Dask-on-Ray to execute operations on a Ray cluster.

Parameters:

Name	Type	Description	Default
`meta`	`Union[DataFrame, Series[Any], dict[str, Any], Iterable[Any], tuple[Any], None]`	An empty pandas DataFrameor Series that matches the dtypes and column names of the stream. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of `{name: dtype}` or iterable of `(name, dtype)` can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of `(name, dtype)` can be used. By default, this will be inferred from the underlying Daft DataFrame schema, with this argument supplying an optional override.	`None`

Returns:

Type	Description
`DataFrame`	dask.DataFrame: A Dask DataFrame stored on a Ray cluster.

Note

This function can only work if Daft is running using the RayRunner.

Examples:

>>> import daft
>>> daft.set_runner_ray()
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> dask_df = df.to_dask_dataframe()

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_dask_dataframe(
    self,
    meta: Union[
        "pandas.DataFrame",
        "pandas.Series[Any]",
        dict[str, Any],
        Iterable[Any],
        tuple[Any],
        None,
    ] = None,
) -> "dask.DataFrame":
    """Converts the current Daft DataFrame to a Dask DataFrame.

    The returned Dask DataFrame will use [Dask-on-Ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)
    to execute operations on a Ray cluster.

    Args:
        meta: An empty [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)or [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) that matches the dtypes and column
            names of the stream. This metadata is necessary for many algorithms in
            dask dataframe to work. For ease of use, some alternative inputs are
            also available. Instead of a DataFrame, a dict of ``{name: dtype}`` or
            iterable of ``(name, dtype)`` can be provided (note that the order of
            the names should match the order of the columns). Instead of a series, a
            tuple of ``(name, dtype)`` can be used.
            By default, this will be inferred from the underlying Daft DataFrame schema,
            with this argument supplying an optional override.

    Returns:
        dask.DataFrame: A Dask DataFrame stored on a Ray cluster.

    Note:
        This function can only work if Daft is running using the RayRunner.

    Examples:
        >>> import daft
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> dask_df = df.to_dask_dataframe()  # doctest: +SKIP

    """
    from daft.runners.ray_runner import RayPartitionSet

    self.collect()
    partition_set = self._result
    assert partition_set is not None
    # TODO(Clark): Support Dask DataFrame conversion for the local runner if
    # Dask is using a non-distributed scheduler.
    if not isinstance(partition_set, RayPartitionSet):
        raise ValueError("Cannot convert to Dask DataFrame if not running on Ray backend")
    return partition_set.to_dask_dataframe(meta)

to_pandas #

to_pandas(coerce_temporal_nanoseconds: bool = False) -> DataFrame

Converts the current DataFrame to a pandas DataFrame.

If results have not computed yet, collect will be called.

Parameters:

Name	Type	Description	Default
`coerce_temporal_nanoseconds`	`bool`	Whether to coerce temporal columns to nanoseconds. Only applicable to pandas version >= 2.0 and pyarrow version >= 13.0.0. Defaults to False. See `pyarrow.Table.to_pandas <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas>`__ for more information.	`False`

Returns:

Type	Description
`DataFrame`	pandas.DataFrame: pandas DataFrame converted from a Daft DataFrame

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> pd_df = df.to_pandas()
>>> print(pd_df)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_pandas(self, coerce_temporal_nanoseconds: bool = False) -> "pandas.DataFrame":
    """Converts the current DataFrame to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

    If results have not computed yet, collect will be called.

    Args:
        coerce_temporal_nanoseconds (bool): Whether to coerce temporal columns to nanoseconds. Only applicable to pandas version >= 2.0 and pyarrow version >= 13.0.0. Defaults to False. See `pyarrow.Table.to_pandas <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas>`__ for more information.

    Returns:
        pandas.DataFrame: [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) converted from a Daft DataFrame

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> pd_df = df.to_pandas()
        >>> print(pd_df)
           a  b
        0  1  4
        1  2  5
        2  3  6
    """
    self.collect()
    result = self._result
    assert result is not None

    pd_df = result.to_pandas(
        schema=self._builder.schema(),
        coerce_temporal_nanoseconds=coerce_temporal_nanoseconds,
    )
    return pd_df

to_pydict #

to_pydict() -> dict[str, list[Any]]

Converts the current DataFrame to a python dictionary. The dictionary contains Python lists of Python objects for each column.

If results have not computed yet, collect will be called.

Returns:

Type	Description
`dict[str, list[Any]]`	dict[str, list[Any]]: python dict converted from a Daft DataFrame

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3, 4], "b": [2, 4, 3, 1]})
>>> print(df.to_pydict())

{'a': [1, 2, 3, 4], 'b': [2, 4, 3, 1]}

Tip

See also DataFrame.to_pylist() for a convenience method that converts the DataFrame to a list of Python dict objects.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_pydict(self) -> dict[str, list[Any]]:
    """Converts the current DataFrame to a python dictionary. The dictionary contains Python lists of Python objects for each column.

    If results have not computed yet, collect will be called.

    Returns:
        dict[str, list[Any]]: python dict converted from a Daft DataFrame

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3, 4], "b": [2, 4, 3, 1]})
        >>> print(df.to_pydict())
        {'a': [1, 2, 3, 4], 'b': [2, 4, 3, 1]}

    Tip:
        See also [DataFrame.to_pylist()][daft.DataFrame.to_pylist] for
        a convenience method that converts the DataFrame to a list of Python dict objects.
    """
    self.collect()
    result = self._result
    assert result is not None
    return result.to_pydict(schema=self.schema())

to_pylist #

to_pylist() -> list[Any]

Converts the current Dataframe into a python list.

Returns:

Type	Description
`list[Any]`	List[dict[str, Any]]: List of python dict objects.

Warning

This is a convenience method over DataFrame.iter_rows(). Users should prefer using .iter_rows() directly instead for lower memory utilization if they are streaming rows out of a DataFrame and don't require full materialization of the Python list.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"a": [1, 2, 3, 4], "b": [2, 4, 3, 1]})
>>> print(df.to_pylist())

[{'a': 1, 'b': 2}, {'a': 2, 'b': 4}, {'a': 3, 'b': 3}, {'a': 4, 'b': 1}]

to_ray_dataset #

to_ray_dataset() -> DataSet

Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray.

Returns:

Type	Description
`DataSet`	ray.data.dataset.DataSet: Ray dataset

Examples:

>>> import daft
>>> daft.set_runner_ray()
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> ray_dataset = df.to_ray_dataset()

Note

This function can only work if Daft is running using the RayRunner

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_ray_dataset(self) -> "ray.data.dataset.DataSet":
    """Converts the current DataFrame to a [Ray Dataset](https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset) which is useful for running distributed ML model training in Ray.

    Returns:
        ray.data.dataset.DataSet: [Ray dataset](https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset)

    Examples:
        >>> import daft
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> ray_dataset = df.to_ray_dataset()  # doctest: +SKIP

    Note:
        This function can only work if Daft is running using the RayRunner
    """
    from daft.runners.ray_runner import RayPartitionSet

    self.collect()
    partition_set = self._result
    assert partition_set is not None
    if not isinstance(partition_set, RayPartitionSet):
        raise ValueError("Cannot convert to Ray Dataset if not running on Ray backend")
    return partition_set.to_ray_dataset()

to_torch_iter_dataset #

to_torch_iter_dataset(shard_strategy: Literal['file'] | None = None, world_size: int | None = None, rank: int | None = None) -> IterableDataset

Convert the current DataFrame into a Torch IterableDataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>__ for use with PyTorch.

Begins execution of the DataFrame if it is not yet executed.

Items will be returned in pydict format: a dict of {"column name": value} for each row in the data.

Parameters:

Name	Type	Description	Default
`shard_strategy`	`Optional[Literal['file']]`	Strategy to use for sharding the dataset. Currently only "file" is supported.	`None`
`world_size`	`Optional[int]`	Total number of workers for sharding. Required if shard_strategy is specified.	`None`
`rank`	`Optional[int]`	Rank of current worker for sharding. Required if shard_strategy is specified.	`None`

Returns:

Type	Description
`IterableDataset`	torch.utils.data.IterableDataset: A PyTorch IterableDataset containing the data from the DataFrame.

Examples:

>>> import daft
>>> import torch
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> torch_iter_dataset = df.to_torch_iter_dataset()
>>> list(torch.utils.data.DataLoader(torch_iter_dataset))

[{'x': tensor([1]), 'y': tensor([4])}, {'x': tensor([2]), 'y': tensor([5])}, {'x': tensor([3]), 'y': tensor([6])}]

Note

The produced dataset is meant to be used with the single-process DataLoader, and does not support data sharding hooks for multi-process data loading.

Do keep in mind that Daft is already using multithreading or multiprocessing under the hood to compute the data stream that feeds this dataset.

Tip

This method returns results locally. For distributed training, you may want to use DataFrame.to_ray_dataset().

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_torch_iter_dataset(
    self,
    shard_strategy: Literal["file"] | None = None,
    world_size: int | None = None,
    rank: int | None = None,
) -> "torch.utils.data.IterableDataset":
    """Convert the current DataFrame into a `Torch IterableDataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>`__ for use with PyTorch.

    Begins execution of the DataFrame if it is not yet executed.

    Items will be returned in pydict format: a dict of `{"column name": value}` for each row in the data.

    Args:
        shard_strategy (Optional[Literal["file"]]): Strategy to use for sharding the dataset. Currently only "file" is supported.
        world_size (Optional[int]): Total number of workers for sharding. Required if shard_strategy is specified.
        rank (Optional[int]): Rank of current worker for sharding. Required if shard_strategy is specified.

    Returns:
        torch.utils.data.IterableDataset: A PyTorch IterableDataset containing the data from the DataFrame.

    Examples:
        >>> import daft
        >>> import torch  # doctest: +SKIP
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> torch_iter_dataset = df.to_torch_iter_dataset()  # doctest: +SKIP
        >>> list(torch.utils.data.DataLoader(torch_iter_dataset))  # doctest: +SKIP
        [{'x': tensor([1]), 'y': tensor([4])}, {'x': tensor([2]), 'y': tensor([5])}, {'x': tensor([3]), 'y': tensor([6])}]

    Note:
        The produced dataset is meant to be used with the single-process DataLoader,
        and does not support data sharding hooks for multi-process data loading.

        Do keep in mind that Daft is already using multithreading or multiprocessing under the hood
        to compute the data stream that feeds this dataset.

    Tip:
        This method returns results locally.
        For distributed training, you may want to use [DataFrame.to_ray_dataset()][daft.DataFrame.to_ray_dataset].
    """
    from daft.dataframe.to_torch import DaftTorchIterableDataset

    # TODO(desmond): We need to take in the batch size and number of epochs. So that when we shard, we can ensure that each shard produces
    # the same number of batches without coordination.

    if shard_strategy is not None:
        if world_size is None or rank is None:
            raise ValueError("world_size and rank must be specified when using sharding")
        df = self._shard(shard_strategy, world_size, rank)
    else:
        df = self

    return DaftTorchIterableDataset(df)

to_torch_map_dataset #

to_torch_map_dataset(shard_strategy: Literal['file'] | None = None, world_size: int | None = None, rank: int | None = None) -> Dataset

Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.

This method will materialize the entire DataFrame and block on completion.

Items will be returned in pydict format: a dict of {"column name": value} for each row in the data.

Note

If you do not need random access, you may get better performance out of an IterableDataset, which streams data items in as soon as they are ready and does not block on full materialization.

Tip

This method returns results locally. For distributed training, you may want to use DataFrame.to_ray_dataset().

Parameters:

Name	Type	Description	Default
`shard_strategy`	`Optional[Literal['file']]`	Strategy to use for sharding the dataset. Currently only "file" is supported.	`None`
`world_size`	`Optional[int]`	Total number of workers for sharding. Required if shard_strategy is specified.	`None`
`rank`	`Optional[int]`	Rank of current worker for sharding. Required if shard_strategy is specified.	`None`

Returns:

Type	Description
`Dataset`	torch.utils.data.Dataset: A PyTorch Dataset containing the data from the DataFrame.

Note

The produced dataset is meant to be used with the single-process DataLoader, and does not support data sharding hooks for multi-process data loading.

Examples:

>>> import daft
>>> import torch
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> torch_dataset = df.to_torch_map_dataset()

Tip

This method returns results locally. For distributed training, you may want to use DataFrame.to_ray_dataset().

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_torch_map_dataset(
    self,
    shard_strategy: Literal["file"] | None = None,
    world_size: int | None = None,
    rank: int | None = None,
) -> "torch.utils.data.Dataset":
    """Convert the current DataFrame into a map-style [Torch Dataset](https://pytorch.org/docs/stable/data.html#map-style-datasets) for use with PyTorch.

    This method will materialize the entire DataFrame and block on completion.

    Items will be returned in pydict format: a dict of `{"column name": value}` for each row in the data.

    Note:
        If you do not need random access, you may get better performance out of an IterableDataset,
        which streams data items in as soon as they are ready and does not block on full materialization.

    Tip:
        This method returns results locally.
        For distributed training, you may want to use [DataFrame.to_ray_dataset()][daft.DataFrame.to_ray_dataset].

    Args:
        shard_strategy (Optional[Literal["file"]]): Strategy to use for sharding the dataset. Currently only "file" is supported.
        world_size (Optional[int]): Total number of workers for sharding. Required if shard_strategy is specified.
        rank (Optional[int]): Rank of current worker for sharding. Required if shard_strategy is specified.

    Returns:
        torch.utils.data.Dataset: A PyTorch Dataset containing the data from the DataFrame.

    Note:
        The produced dataset is meant to be used with the single-process DataLoader,
        and does not support data sharding hooks for multi-process data loading.

    Examples:
        >>> import daft
        >>> import torch  # doctest: +SKIP
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> torch_dataset = df.to_torch_map_dataset()  # doctest: +SKIP

    Tip:
        This method returns results locally.
        For distributed training, you may want to use [DataFrame.to_ray_dataset()][daft.DataFrame.to_ray_dataset].
    """
    from daft.dataframe.to_torch import DaftTorchDataset

    if shard_strategy is not None:
        if world_size is None or rank is None:
            raise ValueError("world_size and rank must be specified when using sharding")
        df = self._shard(shard_strategy, world_size, rank)
    else:
        df = self

    return DaftTorchDataset(df.to_pydict(), len(df))

transform #

transform(func: Callable[..., DataFrame], *args: Any, **kwargs: Any) -> DataFrame

Apply a function that takes and returns a DataFrame.

Allow splitting your transformation into different units of work (functions) while preserving the syntax for chaining transformations.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3, 4]})
>>> def add_1(df):
...     df = df.select(daft.col("col_a") + 1)
...     return df
>>> def multiply_x(df, x):
...     df = df.select(daft.col("col_a") * x)
...     return df
>>> df = df.transform(add_1).transform(multiply_x, 4)
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 8     │
├╌╌╌╌╌╌╌┤
│ 12    │
├╌╌╌╌╌╌╌┤
│ 16    │
├╌╌╌╌╌╌╌┤
│ 20    │
╰───────╯
(Showing first 4 of 4 rows)

Parameters:

Name	Type	Description	Default
`func`	`Callable[..., DataFrame]`	A function that takes and returns a DataFrame.	required
`*args`	`Any`	Positional arguments to pass to func.	`()`
`**kwargs`	`Any`	Keyword arguments to pass to func.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Transformed DataFrame.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":
    """Apply a function that takes and returns a DataFrame.

    Allow splitting your transformation into different units of work (functions) while preserving the syntax for chaining transformations.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3, 4]})
        >>> def add_1(df):
        ...     df = df.select(daft.col("col_a") + 1)
        ...     return df
        >>> def multiply_x(df, x):
        ...     df = df.select(daft.col("col_a") * x)
        ...     return df
        >>> df = df.transform(add_1).transform(multiply_x, 4)
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 8     │
        ├╌╌╌╌╌╌╌┤
        │ 12    │
        ├╌╌╌╌╌╌╌┤
        │ 16    │
        ├╌╌╌╌╌╌╌┤
        │ 20    │
        ╰───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)

    Args:
        func: A function that takes and returns a DataFrame.
        *args: Positional arguments to pass to func.
        **kwargs: Keyword arguments to pass to func.

    Returns:
        DataFrame: Transformed DataFrame.
    """
    result = func(self, *args, **kwargs)
    assert isinstance(result, DataFrame), (
        f"Func returned an instance of type [{type(result)}], should have been DataFrame."
    )
    return result

union #

union(other: DataFrame) -> DataFrame

Returns the distinct union of two DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing the distinct rows from both DataFrames.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df2 = daft.from_pydict({"x": [3, 4, 5], "y": [6, 7, 8]})
>>> df1.union(df2).sort("x").show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5     ┆ 8     │
╰───────┴───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union(self, other: "DataFrame") -> "DataFrame":
    """Returns the distinct union of two DataFrames.

    Args:
        other (DataFrame): The DataFrame to union with this one.

    Returns:
        DataFrame: A new DataFrame containing the distinct rows from both DataFrames.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"x": [3, 4, 5], "y": [6, 7, 8]})
        >>> df1.union(df2).sort("x").show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 5     ┆ 8     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)
    """
    builder = self._builder.union(other._builder)
    return DataFrame(builder)

union_all #

union_all(other: DataFrame) -> DataFrame

Returns the union of two DataFrames, including duplicates.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing all rows from both DataFrames, including duplicates.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df2 = daft.from_pydict({"x": [3, 2, 1], "y": [6, 5, 4]})
>>> df1.union_all(df2).sort("x").show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 6 of 6 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union_all(self, other: "DataFrame") -> "DataFrame":
    """Returns the union of two DataFrames, including duplicates.

    Args:
        other (DataFrame): The DataFrame to union with this one.

    Returns:
        DataFrame: A new DataFrame containing all rows from both DataFrames, including duplicates.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"x": [3, 2, 1], "y": [6, 5, 4]})
        >>> df1.union_all(df2).sort("x").show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 6 of 6 rows)
    """
    builder = self._builder.union(other._builder, is_all=True)
    return DataFrame(builder)

union_all_by_name #

union_all_by_name(other: DataFrame) -> DataFrame

Returns the union of two DataFrames, including duplicates, with columns matched by name.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one, matching columns by name.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing all rows from both DataFrames, including duplicates, with columns matched by name.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
>>> df2 = daft.from_pydict({"y": [6, 6, 7, 7], "z": ["a", "a", "b", "b"]})
>>> df1.union_all_by_name(df2).sort("y").show()

╭───────┬───────┬───────┬────────╮
│ x     ┆ y     ┆ w     ┆ z      │
│ ---   ┆ ---   ┆ ---   ┆ ---    │
│ Int64 ┆ Int64 ┆ Int64 ┆ String │
╞═══════╪═══════╪═══════╪════════╡
│ 1     ┆ 4     ┆ 9     ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 10    ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 6     ┆ None  ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 6     ┆ None  ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 7     ┆ None  ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 7     ┆ None  ┆ b      │
╰───────┴───────┴───────┴────────╯
(Showing first 6 of 6 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union_all_by_name(self, other: "DataFrame") -> "DataFrame":
    """Returns the union of two DataFrames, including duplicates, with columns matched by name.

    Args:
        other (DataFrame): The DataFrame to union with this one, matching columns by name.

    Returns:
        DataFrame: A new DataFrame containing all rows from both DataFrames, including duplicates, with columns matched by name.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
        >>> df2 = daft.from_pydict({"y": [6, 6, 7, 7], "z": ["a", "a", "b", "b"]})
        >>> df1.union_all_by_name(df2).sort("y").show()
        ╭───────┬───────┬───────┬────────╮
        │ x     ┆ y     ┆ w     ┆ z      │
        │ ---   ┆ ---   ┆ ---   ┆ ---    │
        │ Int64 ┆ Int64 ┆ Int64 ┆ String │
        ╞═══════╪═══════╪═══════╪════════╡
        │ 1     ┆ 4     ┆ 9     ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 10    ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 6     ┆ None  ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 6     ┆ None  ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 7     ┆ None  ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 7     ┆ None  ┆ b      │
        ╰───────┴───────┴───────┴────────╯
        <BLANKLINE>
        (Showing first 6 of 6 rows)
    """
    builder = self._builder.union(other._builder, is_all=True, is_by_name=True)
    return DataFrame(builder)

union_by_name #

union_by_name(other: DataFrame) -> DataFrame

Returns the distinct union by name.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one, matching columns by name.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing the distinct rows from both DataFrames, with columns matched by name.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
>>> df2 = daft.from_pydict({"y": [6, 7], "z": ["a", "b"]})
>>> df1.union_by_name(df2).sort("y").show()

╭───────┬───────┬───────┬────────╮
│ x     ┆ y     ┆ w     ┆ z      │
│ ---   ┆ ---   ┆ ---   ┆ ---    │
│ Int64 ┆ Int64 ┆ Int64 ┆ String │
╞═══════╪═══════╪═══════╪════════╡
│ 1     ┆ 4     ┆ 9     ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 10    ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 6     ┆ None  ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 7     ┆ None  ┆ b      │
╰───────┴───────┴───────┴────────╯
(Showing first 4 of 4 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union_by_name(self, other: "DataFrame") -> "DataFrame":
    """Returns the distinct union by name.

    Args:
        other (DataFrame): The DataFrame to union with this one, matching columns by name.

    Returns:
        DataFrame: A new DataFrame containing the distinct rows from both DataFrames, with columns matched by name.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
        >>> df2 = daft.from_pydict({"y": [6, 7], "z": ["a", "b"]})
        >>> df1.union_by_name(df2).sort("y").show()
        ╭───────┬───────┬───────┬────────╮
        │ x     ┆ y     ┆ w     ┆ z      │
        │ ---   ┆ ---   ┆ ---   ┆ ---    │
        │ Int64 ┆ Int64 ┆ Int64 ┆ String │
        ╞═══════╪═══════╪═══════╪════════╡
        │ 1     ┆ 4     ┆ 9     ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 10    ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 6     ┆ None  ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 7     ┆ None  ┆ b      │
        ╰───────┴───────┴───────┴────────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
    """
    builder = self._builder.union(other._builder, is_all=False, is_by_name=True)
    return DataFrame(builder)

unique #

unique(*by: ColumnInputType) -> DataFrame

Computes distinct rows, dropping duplicates.

Alias for DataFrame.distinct.

Parameters:

Name	Type	Description	Default
`*by`	`Union[str, Expression]`	columns to perform distinct on. Defaults to all columns.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame that has only distinct rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
>>> distinct_df = df.unique()
>>> distinct_df = distinct_df.sort("x")
>>> distinct_df.show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def unique(self, *by: ColumnInputType) -> "DataFrame":
    """Computes distinct rows, dropping duplicates.

    Alias for [DataFrame.distinct][daft.DataFrame.distinct].

    Args:
        *by (Union[str, Expression]): columns to perform distinct on. Defaults to all columns.

    Returns:
        DataFrame: DataFrame that has only distinct rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
        >>> distinct_df = df.unique()
        >>> distinct_df = distinct_df.sort("x")
        >>> distinct_df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    return self.distinct(*by)

unpivot #

unpivot(ids: ManyColumnsInputType, values: ManyColumnsInputType = [], variable_name: str = 'variable', value_name: str = 'value') -> DataFrame

Unpivots a DataFrame from wide to long format.

Parameters:

Name	Type	Description	Default
`ids`	`ManyColumnsInputType`	Columns to keep as identifiers	required
`values`	`Optional[ManyColumnsInputType]`	Columns to unpivot. If not specified, all columns except ids will be unpivoted.	`[]`
`variable_name`	`Optional[str]`	Name of the variable column. Defaults to "variable".	`'variable'`
`value_name`	`Optional[str]`	Name of the value column. Defaults to "value".	`'value'`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Unpivoted DataFrame

Tip

where #

where(predicate: Expression | str) -> DataFrame

Filters rows via a predicate expression, similar to SQL WHERE.

Parameters:

Name	Type	Description	Default
`predicate`	`Expression`	expression that keeps row if evaluates to True.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Filtered DataFrame.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 6, 6], "z": [7, 8, 9]})
>>> df.where((df["x"] > 1) & (df["y"] > 1)).collect()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 2     ┆ 6     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 9     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

You can also use a string expression as a predicate.

Note: this will use the method sql_expr to parse the string into an expression this may raise an error if the expression is not yet supported in the sql engine.

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 9, 9]})
>>> df.where("z = 9 AND y > 5").collect()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 3     ┆ 6     ┆ 9     │
╰───────┴───────┴───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def where(self, predicate: Expression | str) -> "DataFrame":
    """Filters rows via a predicate expression, similar to SQL ``WHERE``.

    Args:
        predicate (Expression): expression that keeps row if evaluates to True.

    Returns:
        DataFrame: Filtered DataFrame.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 6, 6], "z": [7, 8, 9]})
        >>> df.where((df["x"] > 1) & (df["y"] > 1)).collect()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 2     ┆ 6     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     ┆ 9     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

        You can also use a string expression as a predicate.

        Note: this will use the method `sql_expr` to parse the string into an expression
        this may raise an error if the expression is not yet supported in the sql engine.

        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 9, 9]})
        >>> df.where("z = 9 AND y > 5").collect()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 3     ┆ 6     ┆ 9     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    if isinstance(predicate, str):
        from daft.sql.sql import sql_expr

        predicate = sql_expr(predicate)
    builder = self._builder.filter(predicate)
    return DataFrame(builder)

with_column #

with_column(column_name: str, expr: Expression) -> DataFrame

Adds a column to the current DataFrame with an Expression, equivalent to a select with all current columns and the new one.

Parameters:

Name	Type	Description	Default
`column_name`	`str`	name of new column	required
`expr`	`Expression`	expression of the new column.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with new column.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3]})
>>> new_df = df.with_column("x+1", df["x"] + 1)
>>> new_df.show()

╭───────┬───────╮
│ x     ┆ x+1   │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 4     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_column(
    self,
    column_name: str,
    expr: Expression,
) -> "DataFrame":
    """Adds a column to the current DataFrame with an Expression, equivalent to a ``select`` with all current columns and the new one.

    Args:
        column_name (str): name of new column
        expr (Expression): expression of the new column.

    Returns:
        DataFrame: DataFrame with new column.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3]})
        >>> new_df = df.with_column("x+1", df["x"] + 1)
        >>> new_df.show()
        ╭───────┬───────╮
        │ x     ┆ x+1   │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 2     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 4     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    return self.with_columns({column_name: expr})

with_column_renamed #

with_column_renamed(existing: str, new: str) -> DataFrame

Renames a column in the current DataFrame.

If the column in the DataFrame schema does not exist, this will be a no-op.

Parameters:

Name	Type	Description	Default
`existing`	`str`	name of the existing column to rename	required
`new`	`str`	new name for the column	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the column renamed.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df.with_column_renamed("x", "foo").show()

╭───────┬───────╮
│ foo   ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_column_renamed(self, existing: str, new: str) -> "DataFrame":
    """Renames a column in the current DataFrame.

    If the column in the DataFrame schema does not exist, this will be a no-op.

    Args:
        existing (str): name of the existing column to rename
        new (str): new name for the column

    Returns:
        DataFrame: DataFrame with the column renamed.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df.with_column_renamed("x", "foo").show()
        ╭───────┬───────╮
        │ foo   ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.with_column_renamed(existing, new)
    return DataFrame(builder)

with_columns #

with_columns(columns: dict[str, Expression]) -> DataFrame

Adds columns to the current DataFrame with Expressions, equivalent to a select with all current columns and the new ones.

Parameters:

Name	Type	Description	Default
`columns`	`Dict[str, Expression]`	Dictionary of new columns in the format { name: expression }	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with new columns.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> new_df = df.with_columns({"foo": df["x"] + 1, "bar": df["y"] - df["x"]})
>>> new_df.show()

╭───────┬───────┬───────┬───────╮
│ x     ┆ y     ┆ foo   ┆ bar   │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 2     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 3     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 4     ┆ 3     │
╰───────┴───────┴───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_columns(
    self,
    columns: dict[str, Expression],
) -> "DataFrame":
    """Adds columns to the current DataFrame with Expressions, equivalent to a ``select`` with all current columns and the new ones.

    Args:
        columns (Dict[str, Expression]): Dictionary of new columns in the format { name: expression }

    Returns:
        DataFrame: DataFrame with new columns.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> new_df = df.with_columns({"foo": df["x"] + 1, "bar": df["y"] - df["x"]})
        >>> new_df.show()
        ╭───────┬───────┬───────┬───────╮
        │ x     ┆ y     ┆ foo   ┆ bar   │
        │ ---   ┆ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 2     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 3     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     ┆ 4     ┆ 3     │
        ╰───────┴───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    new_columns = [col.alias(name) for name, col in columns.items()]

    builder = self._builder.with_columns(new_columns)
    return DataFrame(builder)

with_columns_renamed #

with_columns_renamed(cols_map: dict[str, str]) -> DataFrame

Renames multiple columns in the current DataFrame.

If the columns in the DataFrame schema do not exist, this will be a no-op.

Parameters:

Name	Type	Description	Default
`cols_map`	`Dict[str, str]`	Dictionary of columns to rename in the format { existing: new }	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the columns renamed.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df.with_columns_renamed({"x": "foo", "y": "bar"}).show()

╭───────┬───────╮
│ foo   ┆ bar   │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_columns_renamed(self, cols_map: dict[str, str]) -> "DataFrame":
    """Renames multiple columns in the current DataFrame.

    If the columns in the DataFrame schema do not exist, this will be a no-op.

    Args:
        cols_map (Dict[str, str]): Dictionary of columns to rename in the format { existing: new }

    Returns:
        DataFrame: DataFrame with the columns renamed.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df.with_columns_renamed({"x": "foo", "y": "bar"}).show()
        ╭───────┬───────╮
        │ foo   ┆ bar   │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.with_columns_renamed(cols_map)
    return DataFrame(builder)

write_bigtable #

write_bigtable(project_id: str, instance_id: str, table_id: str, row_key_column: str, column_family_mappings: dict[str, str], client_kwargs: dict[str, Any] | None = None, write_kwargs: dict[str, Any] | None = None, serialize_incompatible_types: bool = True) -> DataFrame

Write a DataFrame into a Google Cloud Bigtable table.

Bigtable only accepts datatypes that can be converted to bytes in cells (for more details, please consult the Bigtable documentation: https://cloud.google.com/bigtable/docs/overview#data-types). By default, write_bigtable automatically serializes incompatible types to JSON. This can be disabled by setting auto_convert=False.

This data sink transforms each row of the dataframe into Bigtable rows. A row key is always required. The row_key_column parameter can be used to specify the column name to use for the row key.

Every column must also belong to a column family. The column_family_mappings parameter can be used to specify the column family to use for each column. For example, if you have a column "name" and a column "age", you can specify a "user_data" column family by passing a dictionary like {"name": "user_data", "age": "user_data"}.

EXPERIMENTAL: This features is early in development and will change.

Parameters:

Name	Type	Description	Default
`project_id`	`str`	The Google Cloud project ID.	required
`instance_id`	`str`	The Bigtable instance ID.	required
`table_id`	`str`	The table to write to.	required
`row_key_column`	`str`	Column name for the row key.	required
`column_family_mappings`	`dict[str, str]`	Mapping of column names to column families.	required
`client_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the Bigtable Client constructor.	`None`
`write_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the Bigtable MutationsBatcher.	`None`
`serialize_incompatible_types`	`bool`	Whether to automatically convert non-bytes/int values to Bigtable-compatible formats. If False, will raise an error for unsupported types. Defaults to True.	`True`

Source code in daft/dataframe/dataframe.py

def write_bigtable(
    self,
    project_id: str,
    instance_id: str,
    table_id: str,
    row_key_column: str,
    column_family_mappings: dict[str, str],
    client_kwargs: dict[str, Any] | None = None,
    write_kwargs: dict[str, Any] | None = None,
    serialize_incompatible_types: bool = True,
) -> "DataFrame":
    """Write a DataFrame into a Google Cloud Bigtable table.

    Bigtable only accepts datatypes that can be converted to bytes in cells (for more details, please consult the Bigtable documentation: https://cloud.google.com/bigtable/docs/overview#data-types).
    By default, `write_bigtable` automatically serializes incompatible types to JSON. This can be disabled by setting `auto_convert=False`.

    This data sink transforms each row of the dataframe into Bigtable rows.
    A row key is always required. The `row_key_column` parameter can be used to specify the column name to use for the row key.

    Every column must also belong to a column family. The `column_family_mappings` parameter can be used to specify the column family to use for each column.
    For example, if you have a column "name" and a column "age", you can specify a "user_data" column family by passing a dictionary like {"name": "user_data", "age": "user_data"}.

    EXPERIMENTAL: This features is early in development and will change.

    Args:
        project_id: The Google Cloud project ID.
        instance_id: The Bigtable instance ID.
        table_id: The table to write to.
        row_key_column: Column name for the row key.
        column_family_mappings: Mapping of column names to column families.
        client_kwargs: Optional dictionary of arguments to pass to the Bigtable Client constructor.
        write_kwargs: Optional dictionary of arguments to pass to the Bigtable MutationsBatcher.
        serialize_incompatible_types: Whether to automatically convert non-bytes/int values to Bigtable-compatible formats.
                                      If False, will raise an error for unsupported types. Defaults to True.
    """
    from daft.io.bigtable.bigtable_data_sink import BigtableDataSink

    sink = BigtableDataSink(
        project_id, instance_id, table_id, row_key_column, column_family_mappings, client_kwargs, write_kwargs
    )

    # Preprocess the DataFrame using the sink's validation and preprocessing logic
    df_to_write = sink._preprocess_dataframe(self, serialize_incompatible_types)

    return df_to_write.write_sink(sink)

write_clickhouse #

write_clickhouse(table: str, *, host: str, port: int | None = None, user: str | None = None, password: str | None = None, database: str | None = None, client_kwargs: dict[str, Any] | None = None, write_kwargs: dict[str, Any] | None = None) -> DataFrame

Writes the DataFrame to a ClickHouse table.

Parameters:

Name	Type	Description	Default
`table`	`str`	Name of the ClickHouse table to write to.	required
`host`	`str`	ClickHouse host.	required
`port`	`int \| None`	ClickHouse port.	`None`
`user`	`str \| None`	ClickHouse user.	`None`
`password`	`str \| None`	ClickHouse password.	`None`
`database`	`str \| None`	ClickHouse database.	`None`
`client_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the ClickHouse client constructor.	`None`
`write_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the ClickHouse write() method.	`None`

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3, 4]})
>>> df.write_clickhouse(table="", host="", port=8123, user="", password="")

╭────────────────────┬─────────────────────╮
│ total_written_rows ┆ total_written_bytes │
│ ---                ┆ ---                 │
│ Int64              ┆ Int64               │
╞════════════════════╪═════════════════════╡
│ 4                  ┆ 32                  │
╰────────────────────┴─────────────────────╯

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_clickhouse(
    self,
    table: str,
    *,
    host: str,
    port: int | None = None,
    user: str | None = None,
    password: str | None = None,
    database: str | None = None,
    client_kwargs: dict[str, Any] | None = None,
    write_kwargs: dict[str, Any] | None = None,
) -> "DataFrame":
    """Writes the DataFrame to a ClickHouse table.

    Args:
        table: Name of the ClickHouse table to write to.
        host: ClickHouse host.
        port: ClickHouse port.
        user: ClickHouse user.
        password: ClickHouse password.
        database: ClickHouse database.
        client_kwargs: Optional dictionary of arguments to pass to the ClickHouse client constructor.
        write_kwargs: Optional dictionary of arguments to pass to the ClickHouse write() method.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3, 4]})  # doctest: +SKIP
        >>> df.write_clickhouse(table="", host="", port=8123, user="", password="")  # doctest: +SKIP
        ╭────────────────────┬─────────────────────╮
        │ total_written_rows ┆ total_written_bytes │
        │ ---                ┆ ---                 │
        │ Int64              ┆ Int64               │
        ╞════════════════════╪═════════════════════╡
        │ 4                  ┆ 32                  │
        ╰────────────────────┴─────────────────────╯
    """
    from daft.io.clickhouse.clickhouse_data_sink import ClickHouseDataSink

    sink = ClickHouseDataSink(
        table,
        host=host,
        port=port,
        user=user,
        password=password,
        database=database,
        client_kwargs=client_kwargs,
        write_kwargs=write_kwargs,
    )
    return self.write_sink(sink)

write_csv #

write_csv(root_dir: str | Path, write_mode: Literal['append', 'overwrite', 'overwrite-partitions'] = 'append', partition_cols: list[ColumnInputType] | None = None, io_config: IOConfig | None = None, delimiter: str | None = None, quote: str | None = None, escape: str | None = None, header: bool | None = True) -> DataFrame

Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.

Files will be written to <root_dir>/* with randomly generated UUIDs as the file names.

Parameters:

Name	Type	Description	Default
`root_dir`	`str`	root file path to write CSV files to.	required
`write_mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".	`'append'`
`partition_cols`	`Optional[List[ColumnInputType]]`	How to subpartition each partition further. Defaults to None.	`None`
`io_config`	`Optional[IOConfig]`	configurations to use when interacting with remote storage.	`None`
`delimiter`	`Optional[str]`	Single-character field delimiter (default `,`).	`None`
`quote`	`Optional[str]`	Single-character quote used around fields containing delimiters default `"`.	`None`
`escape`	`Optional[str]`	Single-character escape for special characters default `\\`.	`None`
`header`	`Optional[bool]`	Whether to write a header row with column names, default True.	`True`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The filenames that were written out as strings.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_csv("output_dir", write_mode="overwrite")

Tip

See also df.write_parquet() and df.write_json() other formats for writing DataFrames

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_csv(
    self,
    root_dir: str | pathlib.Path,
    write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
    partition_cols: list[ColumnInputType] | None = None,
    io_config: IOConfig | None = None,
    delimiter: str | None = None,
    quote: str | None = None,
    escape: str | None = None,
    header: bool | None = True,
) -> "DataFrame":
    r"""Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.

    Files will be written to `<root_dir>/*` with randomly generated UUIDs as the file names.

    Args:
        root_dir (str): root file path to write CSV files to.
        write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
        partition_cols (Optional[List[ColumnInputType]], optional): How to subpartition each partition further. Defaults to None.
        io_config (Optional[IOConfig], optional): configurations to use when interacting with remote storage.
        delimiter (Optional[str], optional): Single-character field delimiter (default `,`).
        quote (Optional[str], optional): Single-character quote used around fields containing delimiters default `"`.
        escape (Optional[str], optional): Single-character escape for special characters default `\\`.
        header (Optional[bool], optional): Whether to write a header row with column names, default True.

    Returns:
        DataFrame: The filenames that were written out as strings.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_csv("output_dir", write_mode="overwrite")  # doctest: +SKIP

    Tip:
        See also [`df.write_parquet()`][daft.DataFrame.write_parquet] and [`df.write_json()`][daft.DataFrame.write_json]
        other formats for writing DataFrames

    """
    if write_mode not in ["append", "overwrite", "overwrite-partitions"]:
        raise ValueError(
            f"Only support `append`, `overwrite`, or `overwrite-partitions` mode. {write_mode} is unsupported"
        )
    if write_mode == "overwrite-partitions" and partition_cols is None:
        raise ValueError("Partition columns must be specified to use `overwrite-partitions` mode.")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    cols: list[Expression] | None = None
    if partition_cols is not None:
        cols = column_inputs_to_expressions(tuple(partition_cols))

    file_format_option = PyFormatSinkOption.csv(delimiter=delimiter, quote=quote, escape=escape, header=header)
    builder = self._builder.write_tabular(
        root_dir=root_dir,
        partition_cols=cols,
        write_mode=WriteMode.from_str(write_mode),
        file_format=FileFormat.Csv,
        file_format_option=file_format_option,
        io_config=io_config,
    )

    # Block and write, then retrieve data
    write_df = DataFrame(builder)
    write_df.collect()
    assert write_df._result is not None

    # Populate and return a new disconnected DataFrame
    result_df = DataFrame(write_df._builder)
    result_df._result_cache = write_df._result_cache
    result_df._preview = write_df._preview
    return result_df

write_deltalake #

write_deltalake(table: Union[str, Path, DataCatalogTable, DeltaTable, UnityCatalogTable], partition_cols: list[str] | None = None, mode: Literal['append', 'overwrite', 'error', 'ignore'] = 'append', schema_mode: Literal['merge', 'overwrite'] | None = None, name: str | None = None, description: str | None = None, configuration: Mapping[str, str | None] | None = None, custom_metadata: dict[str, str] | None = None, dynamo_table_name: str | None = None, allow_unsafe_rename: bool = False, io_config: IOConfig | None = None) -> DataFrame

Writes the DataFrame to a Delta Lake table, returning a new DataFrame with the operations that occurred.

Parameters:

Name	Type	Description	Default
`table`	`Union[str, Path, DataCatalogTable, DeltaTable, UnityCatalogTable]`	Destination Delta Lake Table or table URI to write dataframe to.	required
`partition_cols`	`List[str]`	How to subpartition each partition further. If table exists, expected to match table's existing partitioning scheme, otherwise creates the table with specified partition columns. Defaults to None.	`None`
`mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace table with new data, `error` will raise an error if table already exists, and `ignore` will not write anything if table already exists. Defaults to `append`.	`'append'`
`schema_mode`	`str`	Schema mode of the write. If set to `overwrite`, allows replacing the schema of the table when doing `mode=overwrite`. Schema mode `merge` is currently not supported.	`None`
`name`	`str`	User-provided identifier for this table.	`None`
`description`	`str`	User-provided description for this table.	`None`
`configuration`	`Mapping[str, Optional[str]]`	A map containing configuration options for the metadata action.	`None`
`custom_metadata`	`Dict[str, str]`	Custom metadata to add to the commit info.	`None`
`dynamo_table_name`	`str`	Name of the DynamoDB table to be used as the locking provider if writing to S3.	`None`
`allow_unsafe_rename`	`bool`	Whether to allow unsafe rename when writing to S3 or local disk. Defaults to False.	`False`
`io_config`	`IOConfig`	configurations to use when interacting with remote storage.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The operations that occurred with this write.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> import deltalake
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_deltalake("s3://my-bucket/my-deltalake-table")

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_deltalake(
    self,
    table: Union[str, pathlib.Path, "DataCatalogTable", "deltalake.DeltaTable", "UnityCatalogTable"],
    partition_cols: list[str] | None = None,
    mode: Literal["append", "overwrite", "error", "ignore"] = "append",
    schema_mode: Literal["merge", "overwrite"] | None = None,
    name: str | None = None,
    description: str | None = None,
    configuration: Mapping[str, str | None] | None = None,
    custom_metadata: dict[str, str] | None = None,
    dynamo_table_name: str | None = None,
    allow_unsafe_rename: bool = False,
    io_config: IOConfig | None = None,
) -> "DataFrame":
    """Writes the DataFrame to a [Delta Lake](https://docs.delta.io/latest/index.html) table, returning a new DataFrame with the operations that occurred.

    Args:
        table (Union[str, pathlib.Path, DataCatalogTable, deltalake.DeltaTable, UnityCatalogTable]): Destination [Delta Lake Table](https://delta-io.github.io/delta-rs/api/delta_table/) or table URI to write dataframe to.
        partition_cols (List[str], optional): How to subpartition each partition further. If table exists, expected to match table's existing partitioning scheme, otherwise creates the table with specified partition columns. Defaults to None.
        mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace table with new data, `error` will raise an error if table already exists, and `ignore` will not write anything if table already exists. Defaults to `append`.
        schema_mode (str, optional): Schema mode of the write. If set to `overwrite`, allows replacing the schema of the table when doing `mode=overwrite`. Schema mode `merge` is currently not supported.
        name (str, optional): User-provided identifier for this table.
        description (str, optional): User-provided description for this table.
        configuration (Mapping[str, Optional[str]], optional): A map containing configuration options for the metadata action.
        custom_metadata (Dict[str, str], optional): Custom metadata to add to the commit info.
        dynamo_table_name (str, optional): Name of the DynamoDB table to be used as the locking provider if writing to S3.
        allow_unsafe_rename (bool, optional): Whether to allow unsafe rename when writing to S3 or local disk. Defaults to False.
        io_config (IOConfig, optional): configurations to use when interacting with remote storage.

    Returns:
        DataFrame: The operations that occurred with this write.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> import deltalake
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_deltalake("s3://my-bucket/my-deltalake-table")  # doctest: +SKIP
    """
    import json

    import deltalake
    import pyarrow as pa
    from deltalake.exceptions import TableNotFoundError
    from packaging.version import parse

    from daft import from_pydict
    from daft.dependencies import unity_catalog
    from daft.filesystem import get_protocol_from_path
    from daft.io import DataCatalogTable
    from daft.io.delta_lake._deltalake import delta_schema_to_pyarrow
    from daft.io.delta_lake.delta_lake_write import (
        AddAction,
        convert_pa_schema_to_delta,
        create_table_with_add_actions,
    )
    from daft.io.object_store_options import io_config_to_storage_options

    def _create_metadata_param(metadata: dict[str, str] | None) -> Any:
        """From deltalake>=0.20.0 onwards, custom_metadata has to be passed as CommitProperties.

        Args:
            metadata

        Returns:
            DataFrame: metadata for deltalake<0.20.0, otherwise CommitProperties with custom_metadata
        """
        if parse(deltalake.__version__) < parse("0.20.0"):
            return metadata
        else:
            from deltalake import CommitProperties

            return CommitProperties(custom_metadata=metadata)

    if schema_mode == "merge":
        raise ValueError("Schema mode' merge' is not currently supported for write_deltalake.")

    if parse(deltalake.__version__) < parse("0.14.0"):
        raise ValueError(f"Write delta lake is only supported on deltalake>=0.14.0, found {deltalake.__version__}")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    # Retrieve table_uri and storage_options from various backends
    table_uri: str
    storage_options: dict[str, str]

    if isinstance(table, deltalake.DeltaTable):
        table_uri = table.table_uri
        storage_options = table._storage_options or {}
        new_storage_options = io_config_to_storage_options(io_config, table_uri)
        storage_options.update(new_storage_options or {})
    else:
        if isinstance(table, str):
            table_uri = os.path.expanduser(table)
        elif isinstance(table, pathlib.Path):
            table_uri = str(table)
        elif unity_catalog.module_available() and isinstance(table, unity_catalog.UnityCatalogTable):
            table_uri = table.table_uri
            io_config = table.io_config
        elif isinstance(table, DataCatalogTable):
            table_uri = table.table_uri(io_config)
        else:
            raise ValueError(f"Expected table to be a path or a DeltaTable, received: {type(table)}")

        if io_config is None:
            raise ValueError(
                "io_config was not provided to write_deltalake and could not be retrieved from defaults."
            )

        storage_options = io_config_to_storage_options(io_config, table_uri) or {}
        try:
            table = deltalake.DeltaTable(table_uri, storage_options=storage_options)
        except TableNotFoundError:
            table = None

    # see: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/
    scheme = get_protocol_from_path(table_uri)
    if scheme == "s3" or scheme == "s3a":
        if dynamo_table_name is not None:
            storage_options["AWS_S3_LOCKING_PROVIDER"] = "dynamodb"
            storage_options["DELTA_DYNAMO_TABLE_NAME"] = dynamo_table_name
        else:
            storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"

            if not allow_unsafe_rename:
                warnings.warn("No DynamoDB table specified for Delta Lake locking. Defaulting to unsafe writes.")
    elif scheme == "file":
        if allow_unsafe_rename:
            storage_options["MOUNT_ALLOW_UNSAFE_RENAME"] = "true"

    pyarrow_schema = pa.schema((f.name, f.dtype.to_arrow_dtype()) for f in self.schema())

    large_dtypes = True
    delta_schema = convert_pa_schema_to_delta(pyarrow_schema, large_dtypes=large_dtypes)

    if table:
        if partition_cols and partition_cols != table.metadata().partition_columns:
            raise ValueError(
                f"Expected partition columns to match that of the existing table ({table.metadata().partition_columns}), but received: {partition_cols}"
            )
        else:
            partition_cols = table.metadata().partition_columns

        table.update_incremental()

        table_schema = delta_schema_to_pyarrow(table.schema())
        if Schema.from_pyarrow_schema(delta_schema) != Schema.from_pyarrow_schema(table_schema) and not (
            mode == "overwrite" and schema_mode == "overwrite"
        ):
            raise ValueError(
                "Schema of data does not match table schema\n"
                f"Data schema:\n{delta_schema}\nTable Schema:\n{table_schema}"
            )
        if mode == "error":
            raise AssertionError("Delta table already exists, write mode set to error.")
        elif mode == "ignore":
            return from_pydict(
                {
                    "operation": pa.array([], type=pa.string()),
                    "rows": pa.array([], type=pa.int64()),
                    "file_size": pa.array([], type=pa.int64()),
                    "file_name": pa.array([], type=pa.string()),
                }
            )
        version = table.version() + 1
    else:
        version = 0

    if partition_cols is not None:
        for c in partition_cols:
            if self.schema()[c].dtype == DataType.binary():
                raise NotImplementedError("Binary partition columns are not yet supported for Delta Lake writes")

    builder = self._builder.write_deltalake(
        table_uri,
        mode,
        version,
        large_dtypes,
        io_config=io_config,
        partition_cols=partition_cols,
    )
    write_df = DataFrame(builder)
    write_df.collect()

    write_result = write_df.to_pydict()
    assert "add_action" in write_result
    add_actions: list[AddAction] = write_result["add_action"]

    operations = []
    paths = []
    rows = []
    sizes = []

    for add_action in add_actions:
        stats = json.loads(add_action.stats)
        operations.append("ADD")
        paths.append(add_action.path)
        rows.append(stats["numRecords"])
        sizes.append(add_action.size)

    if table is None:
        create_table_with_add_actions(
            table_uri,
            delta_schema,
            add_actions,
            mode,
            partition_cols or [],
            name,
            description,
            configuration,
            storage_options,
            custom_metadata,
        )
    else:
        if mode == "overwrite":
            old_actions = pa.record_batch(table.get_add_actions())
            old_actions_dict = old_actions.to_pydict()
            for i in range(old_actions.num_rows):
                operations.append("DELETE")
                paths.append(old_actions_dict["path"][i])
                rows.append(old_actions_dict["num_records"][i])
                sizes.append(old_actions_dict["size_bytes"][i])

        metadata_param = _create_metadata_param(custom_metadata)
        if parse(deltalake.__version__) < parse("1.0.0"):
            table._table.create_write_transaction(
                add_actions, mode, partition_cols or [], delta_schema, None, metadata_param
            )
        else:
            table._table.create_write_transaction(
                add_actions,
                mode,
                partition_cols or [],
                deltalake.Schema.from_arrow(delta_schema),
                None,
                metadata_param,
            )
        table.update_incremental()

    with_operations = from_pydict(
        {
            "operation": pa.array(operations, type=pa.string()),
            "rows": pa.array(rows, type=pa.int64()),
            "file_size": pa.array(sizes, type=pa.int64()),
            "file_name": pa.array([os.path.basename(fp) for fp in paths], type=pa.string()),
        }
    )

    return with_operations

write_huggingface #

write_huggingface(repo: str, split: str = 'train', data_dir: str = 'data', revision: str = 'main', overwrite: bool = False, commit_message: str = 'Upload dataset using Daft', commit_description: str | None = None, io_config: IOConfig | None = None) -> DataFrame

Write a DataFrame into a Hugging Face dataset.

Parameters:

Name	Type	Description	Default
`repo`	`str`	The ID of the repository to push to in the following format: `<user>/<dataset_name>` or `<org>/<dataset_name>`.	required
`split`	`str`	The name of the split that will be given to that dataset.	`'train'`
`data_dir`	`str`	Directory of the uploaded data files.	`'data'`
`revision`	`str`	Branch to push the uploaded files to.	`'main'`
`overwrite`	`bool`	Whether to overwrite or append.	`False`
`commit_message`	`str`	Message to commit while pushing.	`'Upload dataset using Daft'`
`commit_description`	`str \| None`	Description of the commit that will be created.	`None`
`io_config`	`IOConfig \| None`	Configurations to use when interacting with remote storage.	`None`

Source code in daft/dataframe/dataframe.py

def write_huggingface(
    self,
    repo: str,
    split: str = "train",
    data_dir: str = "data",
    revision: str = "main",
    overwrite: bool = False,
    commit_message: str = "Upload dataset using Daft",
    commit_description: str | None = None,
    io_config: IOConfig | None = None,
) -> "DataFrame":
    """Write a DataFrame into a Hugging Face dataset.

    Args:
        repo: The ID of the repository to push to in the following format: `<user>/<dataset_name>` or `<org>/<dataset_name>`.
        split: The name of the split that will be given to that dataset.
        data_dir: Directory of the uploaded data files.
        revision: Branch to push the uploaded files to.
        overwrite: Whether to overwrite or append.
        commit_message: Message to commit while pushing.
        commit_description: Description of the commit that will be created.
        io_config: Configurations to use when interacting with remote storage.
    """
    from daft.io.huggingface.sink import HuggingFaceSink

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    sink = HuggingFaceSink(
        repo, split, data_dir, revision, overwrite, commit_message, commit_description, io_config.hf
    )
    return self.write_sink(sink)

write_iceberg #

write_iceberg(table: Table, mode: str = 'append', io_config: IOConfig | None = None) -> DataFrame

Writes the DataFrame to an Iceberg table, returning a new DataFrame with the operations that occurred.

Can be run in either append or overwrite mode which will either appends the rows in the DataFrame or will delete the existing rows and then append the DataFrame rows respectively.

Parameters:

Name	Type	Description	Default
`table`	`Table`	Destination PyIceberg Table to write dataframe to.	required
`mode`	`str`	Operation mode of the write. `append` or `overwrite` Iceberg Table. Defaults to `append`.	`'append'`
`io_config`	`IOConfig`	A custom IOConfig to use when accessing Iceberg object storage data. If provided, configurations set in `table` are ignored.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The operations that occurred with this write.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import pyiceberg
>>> import daft
>>>
>>> table = pyiceberg.Table(...)
>>> df = daft.from_pydict({"user_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
>>> df = df.write_iceberg(table, mode="overwrite")

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_iceberg(
    self, table: "pyiceberg.table.Table", mode: str = "append", io_config: IOConfig | None = None
) -> "DataFrame":
    """Writes the DataFrame to an [Iceberg](https://iceberg.apache.org/docs/nightly/) table, returning a new DataFrame with the operations that occurred.

    Can be run in either `append` or `overwrite` mode which will either appends the rows in the DataFrame or will delete the existing rows and then append the DataFrame rows respectively.

    Args:
        table (pyiceberg.table.Table): Destination [PyIceberg Table](https://py.iceberg.apache.org/reference/pyiceberg/table/#pyiceberg.table.Table) to write dataframe to.
        mode (str, optional): Operation mode of the write. `append` or `overwrite` Iceberg Table. Defaults to `append`.
        io_config (IOConfig, optional): A custom IOConfig to use when accessing Iceberg object storage data. If provided, configurations set in `table` are ignored.

    Returns:
        DataFrame: The operations that occurred with this write.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import pyiceberg
        >>> import daft
        >>>
        >>> table = pyiceberg.Table(...)  # doctest: +SKIP
        >>> df = daft.from_pydict({"user_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
        >>> df = df.write_iceberg(table, mode="overwrite")  # doctest: +SKIP

    """
    import pyarrow as pa
    import pyiceberg
    from packaging.version import parse

    from daft.io.iceberg._iceberg import _convert_iceberg_file_io_properties_to_io_config

    if len(table.spec().fields) > 0 and parse(pyiceberg.__version__) < parse("0.7.0"):
        raise ValueError("pyiceberg>=0.7.0 is required to write to a partitioned table")

    if parse(pyiceberg.__version__) < parse("0.6.0"):
        raise ValueError(f"Write Iceberg is only supported on pyiceberg>=0.6.0, found {pyiceberg.__version__}")

    if parse(pa.__version__) < parse("12.0.1"):
        raise ValueError(
            f"Write Iceberg is only supported on pyarrow>=12.0.1, found {pa.__version__}. See this issue for more information: https://github.com/apache/arrow/issues/37054#issuecomment-1668644887"
        )

    if mode not in ["append", "overwrite"]:
        raise ValueError(f"Only support `append` or `overwrite` mode. {mode} is unsupported")

    io_config = (
        _convert_iceberg_file_io_properties_to_io_config(table.io.properties) if io_config is None else io_config
    )
    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    operations = []
    path = []
    rows = []
    size = []

    builder = self._builder.write_iceberg(table, io_config)
    write_df = DataFrame(builder)
    write_df.collect()

    write_result = write_df.to_pydict()
    assert "data_file" in write_result
    data_files = write_result["data_file"]

    if mode == "overwrite":
        deleted_files = table.scan().plan_files()
    else:
        deleted_files = []

    schema = table.schema()
    partitioning: dict[str, list[Any]] = {
        schema.find_field(field.source_id).name: [] for field in table.spec().fields
    }

    for data_file in data_files:
        operations.append("ADD")
        path.append(data_file.file_path)
        rows.append(data_file.record_count)
        size.append(data_file.file_size_in_bytes)

        for field in partitioning.keys():
            partitioning[field].append(getattr(data_file.partition, field, None))

    for pf in deleted_files:
        data_file = pf.file
        operations.append("DELETE")
        path.append(data_file.file_path)
        rows.append(data_file.record_count)
        size.append(data_file.file_size_in_bytes)

        for field in partitioning.keys():
            partitioning[field].append(getattr(data_file.partition, field, None))

    if parse(pyiceberg.__version__) >= parse("0.7.0"):
        from pyiceberg.table import ALWAYS_TRUE, TableProperties

        if parse(pyiceberg.__version__) >= parse("0.8.0"):
            from pyiceberg.utils.properties import property_as_bool

            property_as_bool = property_as_bool
        else:
            from pyiceberg.table import PropertyUtil

            property_as_bool = PropertyUtil.property_as_bool

        tx = table.transaction()

        if mode == "overwrite":
            tx.delete(delete_filter=ALWAYS_TRUE)

        update_snapshot = tx.update_snapshot()

        manifest_merge_enabled = mode == "append" and property_as_bool(
            tx.table_metadata.properties,
            TableProperties.MANIFEST_MERGE_ENABLED,
            TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
        )

        append_method = update_snapshot.merge_append if manifest_merge_enabled else update_snapshot.fast_append

        with append_method() as append_files:
            for data_file in data_files:
                append_files.append_data_file(data_file)

        tx.commit_transaction()
    else:
        from pyiceberg.table import _MergingSnapshotProducer
        from pyiceberg.table.snapshots import Operation

        operations_map = {
            "append": Operation.APPEND,
            "overwrite": Operation.OVERWRITE,
        }

        merge = _MergingSnapshotProducer(operation=operations_map[mode], table=table)

        for data_file in data_files:
            merge.append_data_file(data_file)

        merge.commit()

    with_operations = {
        "operation": pa.array(operations, type=pa.string()),
        "rows": pa.array(rows, type=pa.int64()),
        "file_size": pa.array(size, type=pa.int64()),
        "file_name": pa.array([fp for fp in path], type=pa.string()),
    }

    if partitioning:
        with_operations["partitioning"] = pa.StructArray.from_arrays(
            partitioning.values(), names=partitioning.keys()
        )

    from daft import from_pydict

    # NOTE: We are losing the history of the plan here.
    # This is due to the fact that the logical plan of the write_iceberg returns datafiles but we want to return the above data
    return from_pydict(with_operations)

write_json #

write_json(root_dir: str | Path, write_mode: Literal['append', 'overwrite', 'overwrite-partitions'] = 'append', partition_cols: list[ColumnInputType] | None = None, io_config: IOConfig | None = None) -> DataFrame

Writes the DataFrame as JSON files, returning a new DataFrame with paths to the files that were written.

Files will be written to <root_dir>/* with randomly generated UUIDs as the file names.

Parameters:

Name	Type	Description	Default
`root_dir`	`str`	root file path to write JSON files to.	required
`write_mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".	`'append'`
`partition_cols`	`Optional[List[ColumnInputType]]`	How to subpartition each partition further. Defaults to None.	`None`
`io_config`	`Optional[IOConfig]`	configurations to use when interacting with remote storage.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The filenames that were written out as strings.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_json("output_dir", write_mode="overwrite")

Warning

Currently only supported with the Native runner!

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_json(
    self,
    root_dir: str | pathlib.Path,
    write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
    partition_cols: list[ColumnInputType] | None = None,
    io_config: IOConfig | None = None,
) -> "DataFrame":
    """Writes the DataFrame as JSON files, returning a new DataFrame with paths to the files that were written.

    Files will be written to `<root_dir>/*` with randomly generated UUIDs as the file names.

    Args:
        root_dir (str): root file path to write JSON files to.
        write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
        partition_cols (Optional[List[ColumnInputType]], optional): How to subpartition each partition further. Defaults to None.
        io_config (Optional[IOConfig], optional): configurations to use when interacting with remote storage.

    Returns:
        DataFrame: The filenames that were written out as strings.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_json("output_dir", write_mode="overwrite")  # doctest: +SKIP

    Warning:
        Currently only supported with the Native runner!
    """
    if write_mode not in ["append", "overwrite", "overwrite-partitions"]:
        raise ValueError(
            f"Only support `append`, `overwrite`, or `overwrite-partitions` mode. {write_mode} is unsupported"
        )
    if write_mode == "overwrite-partitions" and partition_cols is None:
        raise ValueError("Partition columns must be specified to use `overwrite-partitions` mode.")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    cols: list[Expression] | None = None
    if partition_cols is not None:
        cols = column_inputs_to_expressions(tuple(partition_cols))

    builder = self._builder.write_tabular(
        root_dir=root_dir,
        partition_cols=cols,
        write_mode=WriteMode.from_str(write_mode),
        file_format=FileFormat.Json,
        io_config=io_config,
    )
    # Block and write, then retrieve data
    write_df = DataFrame(builder)
    write_df.collect()
    assert write_df._result is not None

    # Populate and return a new disconnected DataFrame
    result_df = DataFrame(write_df._builder)
    result_df._result_cache = write_df._result_cache
    result_df._preview = write_df._preview
    return result_df

write_lance #

write_lance(uri: str | Path, mode: Literal['create', 'append', 'overwrite', 'merge'] = 'create', io_config: IOConfig | None = None, schema: Union[Schema, Schema] | None = None, left_on: str | None = None, right_on: str | None = None, **kwargs: Any) -> DataFrame

Writes the DataFrame to a Lance table.

Parameters:

Name	Type	Description	Default
`uri`	`str \| Path`	The URI of the Lance table to write to	required
`mode`	`Literal['create', 'append', 'overwrite', 'merge']`	The write mode. One of "create", "append", "overwrite", or "merge".	`'create'`
`io_config`	`IOConfig`	configurations to use when interacting with remote storage.	`None`
`schema`	`Schema \| Schema`	Desired schema to enforce during write. - If omitted, Daft will use the DataFrame's current schema. - If a pyarrow.Schema is provided, Daft will enforce the field order, types, and nullability by casting the data to the provided schema prior to write. Table-level (dataset) metadata present on the pyarrow schema is preserved during create/overwrite. - If the target Lance dataset already exists, the data will be cast to the existing table schema to ensure compatibility unless `mode="overwrite"`.	`None`
`left_on/right_on`	`Optional[str]`	Only supported in `mode="merge"`. Specify the join key for aligning rows when merging new columns. - If omitted, defaults to `"_rowaddr"`. - If `right_on` is omitted, it defaults to the value of `left_on`. - The DataFrame passed to `write_lance(mode="merge")` must contain `fragment_id` and the join key column specified by `right_on` (or `_rowaddr` by default).	required
`**kwargs`	`Any`	Additional keyword arguments to pass to the Lance writer.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A DataFrame containing metadata about the written Lance table, such as number of fragments, number of deleted rows, number of small files, and version.

Raises:

Type	Description
`TypeError`	If `schema` is provided but not a Daft Schema or a pyarrow.Schema
`ValueError`	When appending and the data schema cannot be cast to the existing table schema

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3, 4]})
>>> df.write_lance("/tmp/lance/my_table.lance")
>>> daft.read_lance("/tmp/lance/my_table.lance").collect()
>>> # Pass additional keyword arguments to the Lance writer
>>> # All additional keyword arguments are passed to `lance.write_fragments`
>>> df.write_lance("/tmp/lance/my_table.lance", mode="overwrite", max_bytes_per_file=1024)

╭───────────────┬──────────────────┬─────────────────┬─────────╮
│ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
│ ---           ┆ ---              ┆ ---             ┆ ---     │
│ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
╞═══════════════╪══════════════════╪═════════════════╪═════════╡
│ 1             ┆ 0                ┆ 1               ┆ 1       │
╰───────────────┴──────────────────┴─────────────────┴─────────╯
(Showing first 1 of 1 rows)
╭───────╮
│ a     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
├╌╌╌╌╌╌╌┤
│ 4     │
╰───────╯
(Showing first 4 of 4 rows)
╭───────────────┬──────────────────┬─────────────────┬─────────╮
│ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
│ ---           ┆ ---              ┆ ---             ┆ ---     │
│ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
╞═══════════════╪══════════════════╪═════════════════╪═════════╡
│ 1             ┆ 0                ┆ 1               ┆ 2       │
╰───────────────┴──────────────────┴─────────────────┴─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_lance(
    self,
    uri: str | pathlib.Path,
    mode: Literal["create", "append", "overwrite", "merge"] = "create",
    io_config: IOConfig | None = None,
    schema: Union[Schema, "pyarrow.Schema"] | None = None,
    left_on: str | None = None,
    right_on: str | None = None,
    **kwargs: Any,
) -> "DataFrame":
    """Writes the DataFrame to a Lance table.

    Args:
      uri: The URI of the Lance table to write to
      mode: The write mode. One of "create", "append", "overwrite", or "merge".
      - "create" will create the dataset if it does not exist, otherwise raise an error.
      - "append" will append to the existing dataset if it exists, otherwise raise an error.
      - "overwrite" will overwrite the existing dataset if it exists, otherwise raise an error.
      - "merge" will add new columns to the existing dataset.
      io_config (IOConfig, optional): configurations to use when interacting with remote storage.
      schema (Schema | pyarrow.Schema, optional): Desired schema to enforce during write.
        - If omitted, Daft will use the DataFrame's current schema.
        - If a pyarrow.Schema is provided, Daft will enforce the field order, types, and nullability
          by casting the data to the provided schema prior to write. Table-level (dataset) metadata present
          on the pyarrow schema is preserved during create/overwrite.
        - If the target Lance dataset already exists, the data will be cast to the existing table schema
          to ensure compatibility unless ``mode="overwrite"``.
      left_on/right_on (Optional[str]): Only supported in ``mode="merge"``. Specify the join key for aligning rows when merging new columns.
          - If omitted, defaults to ``"_rowaddr"``.
          - If ``right_on`` is omitted, it defaults to the value of ``left_on``.
          - The DataFrame passed to ``write_lance(mode="merge")`` must contain ``fragment_id`` and the join key column specified by ``right_on`` (or ``_rowaddr`` by default).
      **kwargs: Additional keyword arguments to pass to the Lance writer.

    Returns:
        DataFrame: A DataFrame containing metadata about the written Lance table, such as number of fragments, number of deleted rows, number of small files, and version.

    Raises:
        TypeError: If ``schema`` is provided but not a Daft Schema or a pyarrow.Schema
        ValueError: When appending and the data schema cannot be cast to the existing table schema

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3, 4]})
        >>> df.write_lance("/tmp/lance/my_table.lance")  # doctest: +SKIP
        ╭───────────────┬──────────────────┬─────────────────┬─────────╮
        │ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
        │ ---           ┆ ---              ┆ ---             ┆ ---     │
        │ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
        ╞═══════════════╪══════════════════╪═════════════════╪═════════╡
        │ 1             ┆ 0                ┆ 1               ┆ 1       │
        ╰───────────────┴──────────────────┴─────────────────┴─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
        >>> daft.read_lance("/tmp/lance/my_table.lance").collect()  # doctest: +SKIP
        ╭───────╮
        │ a     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ├╌╌╌╌╌╌╌┤
        │ 2     │
        ├╌╌╌╌╌╌╌┤
        │ 3     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
        >>> # Pass additional keyword arguments to the Lance writer
        >>> # All additional keyword arguments are passed to `lance.write_fragments`
        >>> df.write_lance("/tmp/lance/my_table.lance", mode="overwrite", max_bytes_per_file=1024)  # doctest: +SKIP
        ╭───────────────┬──────────────────┬─────────────────┬─────────╮
        │ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
        │ ---           ┆ ---              ┆ ---             ┆ ---     │
        │ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
        ╞═══════════════╪══════════════════╪═════════════════╪═════════╡
        │ 1             ┆ 0                ┆ 1               ┆ 2       │
        ╰───────────────┴──────────────────┴─────────────────┴─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    from daft import context as _context
    from daft.io.lance.lance_data_sink import LanceDataSink
    from daft.io.object_store_options import io_config_to_storage_options

    if schema is None:
        schema = self.schema()

    # Non-merge modes do not support schema evolution or custom join keys
    if mode != "merge":
        sanitized_kwargs = {k: v for k, v in kwargs.items() if k not in ("left_on", "right_on")}
        sink = LanceDataSink(uri, schema, mode, io_config, **sanitized_kwargs)
        return self.write_sink(sink)

    # Merge mode semantics
    try:
        import lance
    except ImportError as e:
        raise ImportError(
            "Unable to import the `lance` package, please ensure that Daft is installed with the lance extra dependency: `pip install daft[lance]`"
        ) from e

    io_config = _context.get_context().daft_planning_config.default_io_config if io_config is None else io_config
    storage_options = io_config_to_storage_options(io_config, str(uri) if isinstance(uri, pathlib.Path) else uri)

    # Attempt to load dataset; if not exists, behave like create
    lance_ds = None
    try:
        lance_ds = lance.dataset(uri, storage_options=storage_options)
    except (ValueError, FileNotFoundError, OSError) as _e:
        lance_ds = None

    if lance_ds is None:
        sanitized_kwargs = {k: v for k, v in kwargs.items() if k not in ("left_on", "right_on")}
        sink = LanceDataSink(uri, schema, "create", io_config, **sanitized_kwargs)
        return self.write_sink(sink)

    # Dataset exists: detect schema evolution by checking new columns in incoming DF
    existing_fields: set[str] = set()
    try:
        existing_fields = {getattr(f, "name", str(f)) for f in lance_ds.schema}
    except Exception:
        names = []
        try:
            names = list(getattr(lance_ds.schema, "names", []))
        except Exception:
            try:
                names = [getattr(f, "name", str(f)) for f in getattr(lance_ds.schema, "fields", [])]
            except Exception:
                names = []
        existing_fields = set(names)

    meta_exclusions = {"fragment_id", "_rowaddr", "_rowid"}
    new_cols = [c for c in self.column_names if c not in existing_fields and c not in meta_exclusions]

    if len(new_cols) == 0:
        # Pure append: no schema evolution. Ensure merge-specific params are not forwarded.
        sanitized_kwargs = {k: v for k, v in kwargs.items() if k not in ("left_on", "right_on")}

        sink = LanceDataSink(uri, schema, "append", io_config, **sanitized_kwargs)
        return self.write_sink(sink)

    # Schema evolution: route to per-fragment merge keyed by provided business key or default '_rowaddr'
    join_left = left_on or "_rowaddr"
    join_right = right_on or join_left
    if "fragment_id" not in self.column_names:
        raise ValueError(
            "DataFrame must contain 'fragment_id' column for per-fragment merge in mode='merge'. Read from Lance to include 'fragment_id'."
        )
    if join_right not in self.column_names:
        hint = (
            " Read from Lance with default_scan_options={'with_rowaddr': True} to include '_rowaddr'."
            if join_right == "_rowaddr"
            else ""
        )
        raise ValueError(
            f"DataFrame must contain join key column '{join_right}' for per-fragment merge in mode='merge'." + hint
        )

    from daft.io.lance.lance_merge_column import merge_columns_from_df

    merge_columns_from_df(
        df=self,
        lance_ds=lance_ds,
        uri=uri,
        left_on=join_left,
        right_on=join_right,
        storage_options=storage_options,
    )

    # Build and return stats DataFrame similar to sink.finalize
    dataset = lance.dataset(uri, storage_options=storage_options)
    stats = dataset.stats.dataset_stats()
    from daft.dependencies import pa as _pa
    from daft.recordbatch import MicroPartition

    return DataFrame._from_micropartitions(
        MicroPartition.from_pydict(
            {
                "num_fragments": _pa.array([stats["num_fragments"]], type=_pa.int64()),
                "num_deleted_rows": _pa.array([stats["num_deleted_rows"]], type=_pa.int64()),
                "num_small_files": _pa.array([stats["num_small_files"]], type=_pa.int64()),
                "version": _pa.array([dataset.version], type=_pa.int64()),
            }
        )
    )

write_parquet #

write_parquet(root_dir: str | Path, compression: str = 'snappy', write_mode: Literal['append', 'overwrite', 'overwrite-partitions'] = 'append', partition_cols: list[ColumnInputType] | None = None, io_config: IOConfig | None = None) -> DataFrame

Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.

Files will be written to <root_dir>/* with randomly generated UUIDs as the file names.

Parameters:

Name	Type	Description	Default
`root_dir`	`str`	root file path to write parquet files to.	required
`compression`	`str`	compression algorithm. Defaults to "snappy".	`'snappy'`
`write_mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".	`'append'`
`partition_cols`	`Optional[List[ColumnInputType]]`	How to subpartition each partition further. Defaults to None.	`None`
`io_config`	`Optional[IOConfig]`	configurations to use when interacting with remote storage.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The filenames that were written out as strings.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_parquet("output_dir", write_mode="overwrite")

Tip

See also df.write_csv() and df.write_json() Other formats for writing DataFrames

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_parquet(
    self,
    root_dir: str | pathlib.Path,
    compression: str = "snappy",
    write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
    partition_cols: list[ColumnInputType] | None = None,
    io_config: IOConfig | None = None,
) -> "DataFrame":
    """Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.

    Files will be written to `<root_dir>/*` with randomly generated UUIDs as the file names.

    Args:
        root_dir (str): root file path to write parquet files to.
        compression (str, optional): compression algorithm. Defaults to "snappy".
        write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
        partition_cols (Optional[List[ColumnInputType]], optional): How to subpartition each partition further. Defaults to None.
        io_config (Optional[IOConfig], optional): configurations to use when interacting with remote storage.

    Returns:
        DataFrame: The filenames that were written out as strings.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_parquet("output_dir", write_mode="overwrite")  # doctest: +SKIP

    Tip:
        See also [`df.write_csv()`][daft.DataFrame.write_csv] and [`df.write_json()`][daft.DataFrame.write_json]
        Other formats for writing DataFrames
    """
    if write_mode not in ["append", "overwrite", "overwrite-partitions"]:
        raise ValueError(
            f"Only support `append`, `overwrite`, or `overwrite-partitions` mode. {write_mode} is unsupported"
        )
    if write_mode == "overwrite-partitions" and partition_cols is None:
        raise ValueError("Partition columns must be specified to use `overwrite-partitions` mode.")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    cols: list[Expression] | None = None
    if partition_cols is not None:
        cols = column_inputs_to_expressions(tuple(partition_cols))

    builder = self._builder.write_tabular(
        root_dir=root_dir,
        partition_cols=cols,
        write_mode=WriteMode.from_str(write_mode),
        file_format=FileFormat.Parquet,
        compression=compression,
        io_config=io_config,
    )
    # Block and write, then retrieve data
    write_df = DataFrame(builder)
    write_df.collect()
    assert write_df._result is not None

    # Populate and return a new disconnected DataFrame
    result_df = DataFrame(write_df._builder)
    result_df._result_cache = write_df._result_cache
    result_df._preview = write_df._preview
    return result_df

write_sink #

write_sink(sink: DataSink[WriteResultType]) -> DataFrame

Writes the DataFrame to the given DataSink.

Parameters:

Name	Type	Description	Default
`sink`	`DataSink[WriteResultType]`	The DataSink to write to.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A dataframe from the micropartition returned by the DataSink's `.finalize()` method.

Note

This call is blocking and will execute the DataFrame when called

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_sink(self, sink: "DataSink[WriteResultType]") -> "DataFrame":
    """Writes the DataFrame to the given DataSink.

    Args:
        sink: The DataSink to write to.

    Returns:
        DataFrame: A dataframe from the micropartition returned by the DataSink's `.finalize()` method.

    Note:
        This call is **blocking** and will execute the DataFrame when called
    """
    sink.start()

    builder = self._builder.write_datasink(sink.name(), sink)
    write_df = DataFrame(builder)
    write_df.collect()

    results = write_df.to_pydict()
    assert "write_results" in results
    micropartition = sink.finalize(results["write_results"])
    if micropartition.schema() != sink.schema():
        raise ValueError(
            f"Schema mismatch between the data sink's schema and the result's schema:\nSink schema:\n{sink.schema()}\nResult schema:\n{micropartition.schema()}"
        )
    # TODO(desmond): Connect the old and new logical plan builders so that a .explain() shows the
    # plan from the source all the way to the sink to the sink's results. In theory we can do this
    # for all other sinks too.
    return DataFrame._from_micropartitions(micropartition)

write_turbopuffer #

write_turbopuffer(namespace: str | Expression, api_key: str | None = None, region: str | None = None, distance_metric: Literal['cosine_distance', 'euclidean_squared'] | None = None, schema: dict[str, Any] | None = None, id_column: str | None = None, vector_column: str | None = None, client_kwargs: dict[str, Any] | None = None, write_kwargs: dict[str, Any] | None = None) -> DataFrame

Writes the DataFrame to a Turbopuffer namespace.

This method transforms each row of the dataframe into a turbopuffer document. This means that an id column is always required. Optionally, the id_column parameter can be used to specify the column name to used for the id column. Note that the column with the name specified by id_column will be renamed to "id" when written to turbopuffer.

A vector column is required if the namespace has a vector index. Optionally, the vector_column parameter can be used to specify the column name to used for the vector index. Note that the column with the name specified by vector_column will be renamed to "vector" when written to turbopuffer.

All other columns become attributes.

The namespace parameter can be either a string (for a single namespace) or an expression (for multiple namespaces). When using an expression, the data will be partitioned by the computed namespace values and written to each namespace separately.

For more details on parameters, please see the turbopuffer documentation: https://turbopuffer.com/docs/write

Parameters:

Name	Type	Description	Default
`namespace`	`str \| Expression`	The namespace to write to. Can be a string for a single namespace or an expression for multiple namespaces.	required
`api_key`	`str \| None`	Turbopuffer API key.	`None`
`region`	`str \| None`	Turbopuffer region.	`None`
`distance_metric`	`Literal['cosine_distance', 'euclidean_squared'] \| None`	Distance metric for vector similarity ("cosine_distance", "euclidean_squared").	`None`
`schema`	`dict[str, Any] \| None`	Optional manual schema specification.	`None`
`id_column`	`str \| None`	Optional column name for the id column. The data sink will automatically rename the column to "id" for the id column.	`None`
`vector_column`	`str \| None`	Optional column name for the vector index column. The data sink will automatically rename the column to "vector" for the vector index.	`None`
`client_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the Turbopuffer client constructor. Explicit arguments (api_key, region) will be merged into client_kwargs.	`None`
`write_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the namespace.write() method. Explicit arguments (distance_metric, schema) will be merged into write_kwargs.	`None`

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_turbopuffer(
    self,
    namespace: str | Expression,
    api_key: str | None = None,
    region: str | None = None,
    distance_metric: Literal["cosine_distance", "euclidean_squared"] | None = None,
    schema: dict[str, Any] | None = None,
    id_column: str | None = None,
    vector_column: str | None = None,
    client_kwargs: dict[str, Any] | None = None,
    write_kwargs: dict[str, Any] | None = None,
) -> "DataFrame":
    """Writes the DataFrame to a Turbopuffer namespace.

    This method transforms each row of the dataframe into a turbopuffer document.
    This means that an `id` column is always required. Optionally, the `id_column` parameter can be used to specify the column name to used for the id column.
    Note that the column with the name specified by `id_column` will be renamed to "id" when written to turbopuffer.

    A `vector` column is required if the namespace has a vector index. Optionally, the `vector_column` parameter can be used to specify the column name to used for the vector index.
    Note that the column with the name specified by `vector_column` will be renamed to "vector" when written to turbopuffer.

    All other columns become attributes.

    The namespace parameter can be either a string (for a single namespace) or an expression (for multiple namespaces).
    When using an expression, the data will be partitioned by the computed namespace values and written to each namespace separately.

    For more details on parameters, please see the turbopuffer documentation: https://turbopuffer.com/docs/write

    Args:
        namespace: The namespace to write to. Can be a string for a single namespace or an expression for multiple namespaces.
        api_key: Turbopuffer API key.
        region: Turbopuffer region.
        distance_metric: Distance metric for vector similarity ("cosine_distance", "euclidean_squared").
        schema: Optional manual schema specification.
        id_column: Optional column name for the id column. The data sink will automatically rename the column to "id" for the id column.
        vector_column: Optional column name for the vector index column. The data sink will automatically rename the column to "vector" for the vector index.
        client_kwargs: Optional dictionary of arguments to pass to the Turbopuffer client constructor.
            Explicit arguments (api_key, region) will be merged into client_kwargs.
        write_kwargs: Optional dictionary of arguments to pass to the namespace.write() method.
            Explicit arguments (distance_metric, schema) will be merged into write_kwargs.
    """
    from daft.io.turbopuffer.turbopuffer_data_sink import TurbopufferDataSink

    sink = TurbopufferDataSink(
        namespace, api_key, region, distance_metric, schema, id_column, vector_column, client_kwargs, write_kwargs
    )
    return self.write_sink(sink)

DataFrame#

DataFrame #

column_names #

columns #

__contains__ #

__getitem__ #

__iter__ #

__len__ #

agg #

agg_concat #

agg_list #

agg_set #

any_value #

collect #

concat #

count #

count_rows #

describe #

distinct #

drop_duplicates #

drop_nan #

drop_null #

except_all #

except_distinct #

exclude #

explain #

explode #

filter #

groupby #

intersect #

intersect_all #

into_batches #

into_partitions #

iter_partitions #

iter_rows #

join #

limit #

max #

mean #

melt #

min #

num_partitions #

offset #

pipe #

pivot #

repartition #

sample #

schema #

select #

show #

sort #

stddev #

sum #

summarize #

to_arrow #

to_arrow_iter #

to_dask_dataframe #

to_pandas #

to_pydict #

to_pylist #

to_ray_dataset #

to_torch_iter_dataset #

to_torch_map_dataset #

transform #

union #

union_all #

union_all_by_name #

union_by_name #

unique #

unpivot #

where #

with_column #

with_column_renamed #

with_columns #

with_columns_renamed #

write_bigtable #

write_clickhouse #

write_csv #

write_deltalake #

write_huggingface #

contains #

getitem #

iter #

len #