Add get_byte_ranges method to AsyncFileReader trait#2115

thinkharderdev · 2022-07-20T20:03:55Z

Which issue does this PR close?

Rationale for this change

In certain cases it is better from a performance perspective to fetch data in parallel such as reading from object storage. This adds a hook into the AsyncFileReader trait to allow upstream consumers of this API to do that.

What changes are included in this PR?

Add get_byte_ranges(&mut self, ranges: Vec<Range<usize>>) method to AsyncFileReader trait with a default implementation that will fallback to calling get_bytes serially for the provided ranges.

Are there any user-facing changes?

No

thinkharderdev · 2022-07-20T20:04:08Z

@tustvold @alamb

codecov-commenter · 2022-07-20T20:55:43Z

Codecov Report

Merging #2115 (76647dd) into master (3096591) will decrease coverage by 0.02%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #2115      +/-   ##
==========================================
- Coverage   83.76%   83.74%   -0.03%     
==========================================
  Files         225      225              
  Lines       59457    59473      +16     
==========================================
- Hits        49806    49805       -1     
- Misses       9651     9668      +17

Impacted Files	Coverage Δ
parquet/src/arrow/async_reader.rs	`0.00% <0.00%> (ø)`
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️
parquet/src/encodings/encoding.rs	`93.43% <0.00%> (-0.20%)`	⬇️
arrow/src/datatypes/datatype.rs	`64.41% <0.00%> (+0.35%)`	⬆️

tustvold

Only some minor nits

tustvold · 2022-07-20T22:00:48Z

parquet/src/arrow/async_reader.rs

+                                .get_byte_ranges(fetch_ranges)
+                                .await?
+                                .into_iter()
+                                .enumerate()


.zip(update_chunks.iter_mut()) might be cleaner?

tustvold · 2022-07-20T22:05:47Z

parquet/src/arrow/async_reader.rs

+                            let mut fetch_ranges =
+                                Vec::with_capacity(column_chunks.len());
+
+                            let mut update_chunks: Vec<(


My gut says that it would be cleaner to just iterate through the column_chunks and use filter_map to extract the ranges, pass this to AsyncFileReader. Convert the result to an iterator and then iterate the column_chunks again, popping the next element from the iterator for each included column.

Not a big deal though

yeah, I think this was cleaner

ursabot · 2022-07-21T12:11:36Z

Benchmark runs are scheduled for baseline = 3096591 and contender = be0d34d. be0d34d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add get_byte_ranges method to AsyncFileReader trait

1c7ae00

github-actions bot added the parquet Changes to the parquet crate label Jul 20, 2022

Remove overhead

c8f1717

thinkharderdev mentioned this pull request Jul 20, 2022

Parallel fetching of column chunks when reading parquet files apache/datafusion#2949

Closed

linting

a725223

tustvold approved these changes Jul 20, 2022

View reviewed changes

pr comments

76647dd

tustvold merged commit be0d34d into apache:master Jul 21, 2022

This was referenced Aug 8, 2022

Remove get_byte_ranges where bound #2366

Merged

AsyncFileReaderNo Longer Object-Safe #2372

Closed

alamb mentioned this pull request Sep 13, 2022

[EPIC] Parquet filter pushdown into scan apache/datafusion#3462

Closed

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_byte_ranges method to AsyncFileReader trait#2115