Related to the particular details of implementing various aggregation types, we should first put a bit of energy into the abstract API for aggregating data in a multi-threaded setting
Aggregators must support both hash/group (e.g. "group by" in SQL or data frame libraries) modes and non-group modes.
Aggregations ideally should also support filter pushdown. For example:
select $AGG($EXPR)
from $TABLE
where $PREDICATE
Some systems might materialize the post-predicate / filtered version of $EXPR, then aggregate that. pandas does this for example. Vectorized performance can be much improved by filtering inside the aggregation kernel. How the predicate true/false values are handled may depend on the implementation details of the kernel (e.g. SUM or MEAN will be a bit different from PRODUCT)
Reporter: Wes McKinney / @wesm
Assignee: Francois Saint-Jacques / @fsaintjacques
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-4124. Please see the migration documentation for further details.
Related to the particular details of implementing various aggregation types, we should first put a bit of energy into the abstract API for aggregating data in a multi-threaded setting
Aggregators must support both hash/group (e.g. "group by" in SQL or data frame libraries) modes and non-group modes.
Aggregations ideally should also support filter pushdown. For example:
Some systems might materialize the post-predicate / filtered version of
$EXPR, then aggregate that. pandas does this for example. Vectorized performance can be much improved by filtering inside the aggregation kernel. How the predicate true/false values are handled may depend on the implementation details of the kernel (e.g. SUM or MEAN will be a bit different from PRODUCT)Reporter: Wes McKinney / @wesm
Assignee: Francois Saint-Jacques / @fsaintjacques
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-4124. Please see the migration documentation for further details.