[enh]: More permissive `group_by().agg(...)` for `pyarrow`

## Related
- Focused on **introducing** expressions in `group_by(...)`.
 - #1385 
 - #2325
- General
 - #2225
 - #1848
 - #1076
 - #981 
- https://github.com/zen-xu/pyarrow-stubs/pull/197
- https://github.com/zen-xu/pyarrow-stubs/pull/197#issuecomment-2781379291

## Description
While working on (https://github.com/zen-xu/pyarrow-stubs/pull/197), I learned some new things about [`pyarrow.Table.group_by.aggregate`](https://arrow.apache.org/docs/python/compute.html#py-grouped-aggrs).

### We *could* allow a wider range of functions
Currently we allow **10** different `Expr` aggregation methods:
https://github.com/narwhals-dev/narwhals/blob/16427440e6d74939c403083b52ce3fb0af7d63c7/narwhals/_compliant/group_by.py#L52

For `pyarrow`, they utilize *only* **9/21** native functions, that are allowed in the native context.

<details><summary>9 native used</summary>


- [pc.approximate_median](https://arrow.apache.org/docs/python/generated/pyarrow.compute.approximate_median.html)
- [pc.count](https://arrow.apache.org/docs/python/generated/pyarrow.compute.count.html)
- [pc.count_distinct](https://arrow.apache.org/docs/python/generated/pyarrow.compute.count_distinct.html)
- [pc.max](https://arrow.apache.org/docs/python/generated/pyarrow.compute.max.html)
- [pc.mean](https://arrow.apache.org/docs/python/generated/pyarrow.compute.mean.html)
- [pc.min](https://arrow.apache.org/docs/python/generated/pyarrow.compute.min.html)
- [pc.stddev](https://arrow.apache.org/docs/python/generated/pyarrow.compute.stddev.html)
- [pc.sum](https://arrow.apache.org/docs/python/generated/pyarrow.compute.sum.html)
- [pc.variance](https://arrow.apache.org/docs/python/generated/pyarrow.compute.variance.html)


</details> 

<details><summary>How are they mapped over?</summary>


https://github.com/narwhals-dev/narwhals/blob/16427440e6d74939c403083b52ce3fb0af7d63c7/narwhals/_arrow/group_by.py#L29-L41


</details> 

*Some* also accept options for a small amount of flexibility. 
We seem to be using [`pc.CountOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.compute.CountOptions.html) and [`pc.VarianceOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.compute.VarianceOptions.html) which is good 🙂.
But [`pc.ScalarAggregateOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.compute.ScalarAggregateOptions.html) and [`pc.TDigestOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.compute.TDigestOptions.html) *may* also be useful.


### Projection
Now this was where it got interesting for me.
The [docs](https://arrow.apache.org/docs/python/generated/pyarrow.TableGroupBy.html#pyarrow.TableGroupBy.aggregate) provide an *"annotation"*:

<details><summary><code>aggregations</code> type?</summary>



```py
class TableGroupBy
 def aggregate(self, aggregations):
 """
 Perform an aggregation over the grouped columns of the table.

 Parameters
 ----------
 aggregations : list[tuple(str, str)] or \
list[tuple(str, str, FunctionOptions)]
 """
```


</details> 


Which we have been following in:

https://github.com/narwhals-dev/narwhals/blob/16427440e6d74939c403083b52ce3fb0af7d63c7/narwhals/_arrow/group_by.py#L60

However, the description in the docs **immediately contradicts** that by [stating](https://arrow.apache.org/docs/python/generated/pyarrow.TableGroupBy.html#pyarrow.TableGroupBy.aggregate):
> The column name can be a string, an empty list or a list of column names 

Meaning instead of a single `str` column, we can do:

[Updated stubs](https://github.com/zen-xu/pyarrow-stubs/blob/483ce12bfb8c04329efda62615e3ce03f1e57249/pyarrow-stubs/__lib_pxi/table.pyi#L117-L120)

```py
UnarySelector: TypeAlias = str # <---------------- We only use this at the moment
NullarySelector: TypeAlias = tuple[()]
NarySelector: TypeAlias = list[str] | tuple[str, ...]

ColumnSelector: TypeAlias = UnarySelector | NullarySelector | NarySelector
```

## Summary
I'm hoping *together* this could mean:
1. We don't need to consider as many cases (for `pyarrow`) as *complex*
2. Expressions that expand to multiple outputs can be performed more efficiently (natively)

	class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr"]):
	_REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Any]] = {
	"sum": "sum",
	"mean": "mean",
	"median": "approximate_median",
	"max": "max",
	"min": "min",
	"std": "stddev",
	"var": "variance",
	"len": "count",
	"n_unique": "count_distinct",
	"count": "count",
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enh]: More permissive `group_by().agg(...)` for `pyarrow` #2385

Related

Description

We could allow a wider range of functions

Projection

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[enh]: More permissive group_by().agg(...) for pyarrow #2385

Description

Related

Description

We could allow a wider range of functions

Projection

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[enh]: More permissive `group_by().agg(...)` for `pyarrow` #2385