Improve parquet BinaryView / StringView decoder performance (up to -35%)#9236
Improve parquet BinaryView / StringView decoder performance (up to -35%)#9236Dandandan merged 19 commits intoapache:mainfrom
Conversation
|
run benchmark arrow_reader arrow_reader_clickbench |
|
🤖 |
|
run benchmark arrow_reader arrow_reader_clickbench |
| while self.offset < self.buf.len() && read != to_read { | ||
| if self.offset + 4 > self.buf.len() { | ||
| let mut utf8_validation_begin = offset; | ||
| for _ in 0..to_read { |
There was a problem hiding this comment.
Somehow using extend here makes things slower (even though I can see in the profile it now uses the trusted len path instead of Vec:push
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
run benchmark arrow_reader |
| let read = self.decoder.read(len, |keys| { | ||
| output | ||
| .views | ||
| .extend(keys.iter().map(|k| match dict.views.get(*k as usize) { |
There was a problem hiding this comment.
This is the largest optimization.
| } | ||
|
|
||
| if self.validate_utf8 { | ||
| if VALIDATE_UTF8 { |
There was a problem hiding this comment.
I think this is the main improvement for the plain decoder
|
🤖: Benchmark completed Details
|
|
🤖 |
Those are some pretty nice improvements |
|
Also |
|
🤖: Benchmark completed Details
|
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
I reviewed this carefully and the changes make sense to me -- thank you @Dandandan
| let len = u32::from_le_bytes(len_bytes); | ||
|
|
||
| let start_offset = self.offset + 4; | ||
| let len = u32::from_le_bytes(unsafe { *(buf.as_ptr().add(offset) as *const [u8; 4]) }); |
There was a problem hiding this comment.
The idea here is to save a bounds check for the length, right?
There was a problem hiding this comment.
Yeah... although I probably need to recheck whether it makes a difference.
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_reader arrow_reader_clickbench |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
|
🤖: Benchmark completed Details
|
|
This is so good |
alamb
left a comment
There was a problem hiding this comment.
Looks good other than a very subtle bug related to utf8 validation.
# Which issue does this PR close? - Related to #9236 # Rationale for this change While reviewing #9236 from @Dandandan I found (with codex's help) that the UTF8 validation for long strings was slightly different. This test shows the problem (it passes on main but fails on the current PR) # What changes are included in this PR? Add new test # Are these changes tested? yes only tests # Are there any user-facing changes? No
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
🥳 |
Which issue does this PR close?
Rationale for this change
Details
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?