Skip to content

[Parquet] Avoid fetching multiple pages when max_predicate_cache_sizeis 0 #8542

@nuno-faria

Description

@nuno-faria

Describe the bug

#7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the batch_size. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., apache/datafusion#17575).

While setting max_predicate_cache_size to 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved.

The sources of this issue are:

  • The InMemoryRowGroup::fetch will use the expanded selection based on the provided cache ProjectionMask.
  • The ArrayReaderBuilder::build_reader will use a CachedArrayReader instead of the regular reader, also based on the cache ProjectionMask.

Thus the most straight forward solution I've found is to return None in the ReaderFactory::compute_cache_projection if the max_predicate_cache_size is 0, causing the reader to fetch only the necessary pages:

    fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> {
+       if self.max_predicate_cache_size == 0 {
+           return None;
+       }
       ...
    }

The ReaderFactory::read_row_group remains the same, since it already expects the possibility of compute_cache_projection to return None:

        let cache_projection = match self.compute_cache_projection(&projection) {
            Some(projection) => projection,
            None => ProjectionMask::none(meta.columns().len()),
        };

@alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR.

To Reproduce

See apache/datafusion#17575.

Expected behavior

Retrieve only the minimum required data pages if max_predicate_cache_size is set to 0.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions