Prevent FixedSizeBinaryArray::value offset truncation#9850
Conversation
e77819a to
7faf57d
Compare
| /// checking for overflow. | ||
| #[inline] | ||
| fn value_offset_at(&self, i: usize) -> i32 { | ||
| self.value_length * i as i32 |
There was a problem hiding this comment.
this arithmetic can overflow if the result is larger than i32::MAX
| /// Caller is responsible for ensuring that the index is within the bounds | ||
| /// of the array | ||
| /// of the array and the resulting byte offset fits in `i32` | ||
| pub unsafe fn value_unchecked(&self, i: usize) -> &[u8] { |
There was a problem hiding this comment.
Would it make sense to add new methods that are alternatives to value_offset and value_offset_at that return usize, so we don't need this limitation? Or at least update this method so it doesn't use them and doesn't suffer from i32 overflow. Because with this change, value now handles when value_offset is greater than max i32, but value_unchecked still doesn't.
I would expect value_unchecked to work correctly for all cases where value doesn't panic.
And existing code that uses value_unchecked might validate the index but not be aware of this hidden safety requirement.
There was a problem hiding this comment.
Short answer is yes. I also spent some more time reviewing the code in FixedSizeBinaryArray and I am now convinced there are several other miuses of i32 <-> usize . I am working on an improvement, though I worry it will be a larger PR
There was a problem hiding this comment.
|
I also added some more docs for FixedSizeBinaryArray that may help reviewers |
|
I have played around with several options for improving this code. I think there are several potential i32 math overflow issues, but fixing them all in a single PR is going to be somewhat hard to review and take some time What I am thinking about is adding a new invariant to the FixedSizeArray constructor that prevents constructing arrays with value buffers larger than 2GB as a temporary workaround in one PR. Then I can fixup the actual arithmetic in a second: |
# Which issue does this PR close? - related to #9850 # Rationale for this change While working on #9850 I noticed that FixedSizeBinaryArray::offset always returns zero. This was confusing to me and so I wanted to document it for future readrs # What changes are included in this PR? Add some comments # Are these changes tested? By CI # Are there any user-facing changes? No, this is internal comments
I took this approach in the following PR: I will create a follow on PR from that one that changes the internals of FixedSizeBinary to use |
# Which issue does this PR close? - related to https://github.com/apache/arrow-rs/pull/9850/changes#r3170266352 # Rationale for this change While working on #9850 I felt it would help to have some better docs about what a FixedSizeBinaryArray actually was, so I made some # What changes are included in this PR? Add some more background / explanatory docs about this array # Are these changes tested? By CI # Are there any user-facing changes? Docs only
# Which issue does this PR close? - Closes #9850 # Rationale for this change `FixedSizeBinaryArray::value_offset_at` works use `i32` arithmetic which can overflow. For offsets beyond `i32::MAX`, that can be bad # What changes are included in this PR? 1. Prevent any FixedSizedBinaryArrays from being constructed where the offset calculation could overflow 2. Add some other overflow checks As @adamreeve [pointed out](#9850 (comment)) on #9850 there are several places where the `i32` arithmetic is problematic in `FixedSizeBinaryArray`. I will fix them for real in a different, follow on PR, by switching to entirely `usize` based arithmetic for offset calculations However, since I hope to backport this PR to older releases, I would like something that is easy to review and has the least potential for unintended consequences. # Are these changes tested? I added unit tests. However, I can't find any way to fully trigger the actual paths short of trying to allocate very large arrays, which I don't think is appropriate for unit tests. # Are there any user-facing changes? Better limit checking
…#9872) - Closes apache#9850 `FixedSizeBinaryArray::value_offset_at` works use `i32` arithmetic which can overflow. For offsets beyond `i32::MAX`, that can be bad 1. Prevent any FixedSizedBinaryArrays from being constructed where the offset calculation could overflow 2. Add some other overflow checks As @adamreeve [pointed out](apache#9850 (comment)) on apache#9850 there are several places where the `i32` arithmetic is problematic in `FixedSizeBinaryArray`. I will fix them for real in a different, follow on PR, by switching to entirely `usize` based arithmetic for offset calculations However, since I hope to backport this PR to older releases, I would like something that is easy to review and has the least potential for unintended consequences. I added unit tests. However, I can't find any way to fully trigger the actual paths short of trying to allocate very large arrays, which I don't think is appropriate for unit tests. Better limit checking
# Which issue does this PR close? - related to apache#9850 # Rationale for this change While working on apache#9850 I noticed that FixedSizeBinaryArray::offset always returns zero. This was confusing to me and so I wanted to document it for future readrs # What changes are included in this PR? Add some comments # Are these changes tested? By CI # Are there any user-facing changes? No, this is internal comments
# Which issue does this PR close? - related to https://github.com/apache/arrow-rs/pull/9850/changes#r3170266352 # Rationale for this change While working on apache#9850 I felt it would help to have some better docs about what a FixedSizeBinaryArray actually was, so I made some # What changes are included in this PR? Add some more background / explanatory docs about this array # Are these changes tested? By CI # Are there any user-facing changes? Docs only
…#9872) # Which issue does this PR close? - Closes apache#9850 # Rationale for this change `FixedSizeBinaryArray::value_offset_at` works use `i32` arithmetic which can overflow. For offsets beyond `i32::MAX`, that can be bad # What changes are included in this PR? 1. Prevent any FixedSizedBinaryArrays from being constructed where the offset calculation could overflow 2. Add some other overflow checks As @adamreeve [pointed out](apache#9850 (comment)) on apache#9850 there are several places where the `i32` arithmetic is problematic in `FixedSizeBinaryArray`. I will fix them for real in a different, follow on PR, by switching to entirely `usize` based arithmetic for offset calculations However, since I hope to backport this PR to older releases, I would like something that is easy to review and has the least potential for unintended consequences. # Are these changes tested? I added unit tests. However, I can't find any way to fully trigger the actual paths short of trying to allocate very large arrays, which I don't think is appropriate for unit tests. # Are there any user-facing changes? Better limit checking
Which issue does this PR close?
Rationale for this change
FixedSizeBinaryArray::value_offset_atcast the requested index toi32before multiplying by the element width. For indexes beyondi32::MAX, that truncation could produce a negative byte offset and causevalue()to read before the start of the value buffer.What changes are included in this PR?
Note I also added some more docs for FixedSizeBinaryArray that may help reviewers
Are these changes tested?
I can't find any way to test this this issue without actually allocating a large array (over 2GB)
Are there any user-facing changes?
Better limit checking