…ation in DecomposeRepeatWithPreAggregation (#60091)
Related PR: #59116 #60045
**Problem Summary:**
There are 2 problems:
1. When `DecomposeRepeatWithPreAggregation` optimization removes the
maximum grouping set, grouping functions (e.g., `grouping()`,
`grouping_id()`) that reference expressions only existing in the removed
grouping set would fail because of index lookup failure: When computing
grouping function values, `GroupingSetShapes.indexOf()` would return -1
for expressions not in `flattenGroupingSetExpression`, causing incorrect
behavior.
Example scenario:
```sql
SELECT a, b, c, grouping(c)
FROM t1
GROUP BY GROUPING SETS ((a, b, c), (a, b), (a), ());
```
After optimization removes the maximum grouping set `(a, b, c)`, the
expression `c` no longer exists in the remaining grouping sets. When
computing `grouping(c)`, the index lookup would fail.
2. The `repeatSlotIdList` calculation was incorrect because it relied on
the assumption that the order of `physicalRepeat.getOutputExpressions()`
matches the order of `outputTuple`. However, this assumption can be
broken during rule rewriting (e.g., in
`DecomposeRepeatWithPreAggregation.constructRepeat()` at line 502-503,
where `replacedRepeatOutputs` is constructed by `new
ArrayList<>(child.getOutput())` and then appends grouping functions,
potentially changing the order). This mismatch causes incorrect slot ID
mapping and wrong query results.
Example scenario:
When `DecomposeRepeatWithPreAggregation` rewrites a repeat plan, the
output expressions order may differ from the output tuple order, leading
to incorrect slot ID mapping in `repeatSlotIdList`.
**Solution:**
1. For problem 1: Modified `Repeat.toShapes()` method to ensure all
expressions referenced by grouping functions are included in
`flattenGroupingSetExpression`, even if they are not in any grouping
set. For expressions that only exist in the removed maximum grouping
set, they are correctly marked as `shouldBeErasedToNull = true` in all
remaining grouping sets, which is the correct SQL semantics.
Key changes:
- `Repeat.toShapes()`: Enhanced to collect all expressions referenced by
grouping functions using `ExpressionUtils.collectToList()` and merge
them into `flattenGroupingSetExpression`
- For expressions not in any grouping set, they are correctly marked as
`shouldBeErasedToNull = true` in all `GroupingSetShape` instances
- This ensures `GroupingSetShapes.indexOf()` always returns a valid
index for grouping function arguments, preventing index lookup failures
2. For problem 2: Modified `Repeat.computeRepeatSlotIdList()` method to
accept an additional `outputSlots` parameter. Instead of relying on the
order of `outputExpressions` matching `outputTuple`, the method now uses
`outputSlots` to correctly map grouping set expressions to their indices
in the output tuple. This ensures correct slot ID calculation even when
rule rewriting changes the order of output expressions.
Key changes:
- `computeRepeatSlotIdList(List<Integer> slotIdList, List<Slot>
outputSlots)`: Added `outputSlots` parameter
- `getGroupingSetsIndexesInOutput(List<Slot> outputSlots)`: Modified to
use `outputSlots` instead of `getOutputExpressions()`
- `indexesOfOutput(List<Slot> outputSlots)`: Modified to build index map
from `outputSlots` instead of `getOutputExpressions()`
- `PhysicalPlanTranslator.visitPhysicalRepeat()`: Updated to pass
`outputSlots` to `computeRepeatSlotIdList()`
3. Another task for this pull request is to set the shuffle key for the
rewritten bottom aggregation. Choosing a suitable shuffle key can
improve the aggregation rate of the top aggregation's local aggregation.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
This PR add a rewrite rule:
The scenario where performance optimization is achieved is when pre-aggregation significantly reduces the amount of data. This rewrite reduces the number of rows generated by the repeat operator, resulting in improved performance.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)