Related
Description
While working on (zen-xu/pyarrow-stubs#197), I learned some new things about pyarrow.Table.group_by.aggregate.
We could allow a wider range of functions
Currently we allow 10 different Expr aggregation methods:
|
"sum", "mean", "median", "max", "min", "std", "var", "len", "n_unique", "count" |
For pyarrow, they utilize only 9/21 native functions, that are allowed in the native context.
9 native used
How are they mapped over?
|
class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr"]): |
|
_REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Any]] = { |
|
"sum": "sum", |
|
"mean": "mean", |
|
"median": "approximate_median", |
|
"max": "max", |
|
"min": "min", |
|
"std": "stddev", |
|
"var": "variance", |
|
"len": "count", |
|
"n_unique": "count_distinct", |
|
"count": "count", |
|
} |
Some also accept options for a small amount of flexibility.
We seem to be using pc.CountOptions and pc.VarianceOptions which is good 🙂.
But pc.ScalarAggregateOptions and pc.TDigestOptions may also be useful.
Projection
Now this was where it got interesting for me.
The docs provide an "annotation":
aggregations type?
class TableGroupBy
def aggregate(self, aggregations):
"""
Perform an aggregation over the grouped columns of the table.
Parameters
----------
aggregations : list[tuple(str, str)] or \
list[tuple(str, str, FunctionOptions)]
"""
Which we have been following in:
|
aggs: list[tuple[str, str, Any]] = [] |
However, the description in the docs immediately contradicts that by stating:
The column name can be a string, an empty list or a list of column names
Meaning instead of a single str column, we can do:
Updated stubs
UnarySelector: TypeAlias = str # <---------------- We only use this at the moment
NullarySelector: TypeAlias = tuple[()]
NarySelector: TypeAlias = list[str] | tuple[str, ...]
ColumnSelector: TypeAlias = UnarySelector | NullarySelector | NarySelector
Summary
I'm hoping together this could mean:
- We don't need to consider as many cases (for
pyarrow) as complex
- Expressions that expand to multiple outputs can be performed more efficiently (natively)
Related
group_by(...).nw.col/Exprfor.group_by#1385group_bykeys #2325ExprMetadatadown to compliant Exprs from narwhals.Expr #1848uniqueingroup_bycontext #1076modein grouped context #981TableGroupBy.aggregatezen-xu/pyarrow-stubs#197TableGroupBy.aggregatezen-xu/pyarrow-stubs#197 (comment)Description
While working on (zen-xu/pyarrow-stubs#197), I learned some new things about
pyarrow.Table.group_by.aggregate.We could allow a wider range of functions
Currently we allow 10 different
Expraggregation methods:narwhals/narwhals/_compliant/group_by.py
Line 52 in 1642744
For
pyarrow, they utilize only 9/21 native functions, that are allowed in the native context.9 native used
How are they mapped over?
narwhals/narwhals/_arrow/group_by.py
Lines 29 to 41 in 1642744
Some also accept options for a small amount of flexibility.
We seem to be using
pc.CountOptionsandpc.VarianceOptionswhich is good 🙂.But
pc.ScalarAggregateOptionsandpc.TDigestOptionsmay also be useful.Projection
Now this was where it got interesting for me.
The docs provide an "annotation":
aggregationstype?Which we have been following in:
narwhals/narwhals/_arrow/group_by.py
Line 60 in 1642744
However, the description in the docs immediately contradicts that by stating:
Meaning instead of a single
strcolumn, we can do:Updated stubs
Summary
I'm hoping together this could mean:
pyarrow) as complex