perf: Coalesce page fetches when RowSelection selects all rows#9578
Conversation
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
Probably won't show up in benchmark, but still would be nice. |
08fbe2c to
97bc46b
Compare
When a predicate selects all rows and there is no prior selection, skip creating a RowSelection entirely. This avoids the allocation in RowSelection::from_filters and keeps the selection as None, which enables coalesced page fetches for subsequent predicates. Merging adjacent ranges in scan_ranges is not feasible because the Sparse column chunk format requires individual page offsets for lookup. Instead, this avoids entering the per-page path altogether. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
97bc46b to
dc5b086
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
Should I spend time reviewing that one instead / in addition? I am losing track of all the PRs |
alamb
left a comment
There was a problem hiding this comment.
This seems like a good improvement to me
| }; | ||
| } | ||
|
|
||
| // If the predicate selected all rows and there is no prior selection, |
Perhaps I/we want to first test it in DataFusion? @adriangb also wanted to do aptive filtering which could be similar or work together. |
My intuition (substantiated by some now somewhat outdated benchmarks) is that with morselization and using morsels as the unit of adaptivity we can and probably should do all of the adaptive filter execution in DataFusion. I plan to fast follow the morselization work with the adaptive filtering work because (1) it will work much better with the smaller units of adaptivity, (2) I don't want to bog reviewers down with too many options but (3) I need to make sure that any API changes needed (I expect few to none) can be done before the API changes from morsels are set in stone. |
Summary
RowSelectionselects every row in a row group,fetch_rangesnow treats it as no selection, producing a single whole-column-chunk I/O request instead of N individual page requestsDetails
In
InMemoryRowGroup::fetch_ranges, when both aRowSelectionand anOffsetIndexare present, the code enters a page-level fetch path that usesscan_ranges()to produce individual page ranges. Even when the selection covers all rows, this produces N separate ranges (one per page).The fix: before entering the page-level path, check if the selection's
row_count()equals the row group's total row count. If so, drop the selection and take the simpler whole-column-chunk path.This commonly happens when a multi-predicate
RowFilterhas an early predicate that passes all rows in a row group (e.g.,CounterID = 62on a row group where all rows haveCounterID = 62).Test plan
test_read_multiple_row_filterverifies the coalesced fetch pattern🤖 Generated with Claude Code