Describe the bug
#7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the batch_size. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., apache/datafusion#17575).
While setting max_predicate_cache_size to 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved.
The sources of this issue are:
- The
InMemoryRowGroup::fetch will use the expanded selection based on the provided cache ProjectionMask.
- The
ArrayReaderBuilder::build_reader will use a CachedArrayReader instead of the regular reader, also based on the cache ProjectionMask.
Thus the most straight forward solution I've found is to return None in the ReaderFactory::compute_cache_projection if the max_predicate_cache_size is 0, causing the reader to fetch only the necessary pages:
fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> {
+ if self.max_predicate_cache_size == 0 {
+ return None;
+ }
...
}
The ReaderFactory::read_row_group remains the same, since it already expects the possibility of compute_cache_projection to return None:
let cache_projection = match self.compute_cache_projection(&projection) {
Some(projection) => projection,
None => ProjectionMask::none(meta.columns().len()),
};
@alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR.
To Reproduce
See apache/datafusion#17575.
Expected behavior
Retrieve only the minimum required data pages if max_predicate_cache_size is set to 0.
Additional context
Describe the bug
#7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the
batch_size. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., apache/datafusion#17575).While setting
max_predicate_cache_sizeto 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved.The sources of this issue are:
InMemoryRowGroup::fetchwill use the expanded selection based on the provided cacheProjectionMask.ArrayReaderBuilder::build_readerwill use aCachedArrayReaderinstead of the regular reader, also based on the cacheProjectionMask.Thus the most straight forward solution I've found is to return None in the
ReaderFactory::compute_cache_projectionif themax_predicate_cache_sizeis 0, causing the reader to fetch only the necessary pages:fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> { + if self.max_predicate_cache_size == 0 { + return None; + } ... }The
ReaderFactory::read_row_groupremains the same, since it already expects the possibility ofcompute_cache_projectionto return None:@alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR.
To Reproduce
See apache/datafusion#17575.
Expected behavior
Retrieve only the minimum required data pages if
max_predicate_cache_sizeis set to 0.Additional context