Skip to content

Prevent FixedSizeBinaryArray::value offset truncation#9850

Closed
alamb wants to merge 1 commit into
apache:mainfrom
alamb:codex/fixed-size-binary-offset-overflow
Closed

Prevent FixedSizeBinaryArray::value offset truncation#9850
alamb wants to merge 1 commit into
apache:mainfrom
alamb:codex/fixed-size-binary-offset-overflow

Conversation

@alamb

@alamb alamb commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • None.

Rationale for this change

FixedSizeBinaryArray::value_offset_at cast the requested index to i32 before multiplying by the element width. For indexes beyond i32::MAX, that truncation could produce a negative byte offset and cause value() to read before the start of the value buffer.

What changes are included in this PR?

  1. Check for offset overflow
  2. Adds regression tests

Note I also added some more docs for FixedSizeBinaryArray that may help reviewers

Are these changes tested?

I can't find any way to test this this issue without actually allocating a large array (over 2GB)

Are there any user-facing changes?

Better limit checking

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 29, 2026
@alamb alamb force-pushed the codex/fixed-size-binary-offset-overflow branch from e77819a to 7faf57d Compare April 29, 2026 17:42
/// checking for overflow.
#[inline]
fn value_offset_at(&self, i: usize) -> i32 {
self.value_length * i as i32

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this arithmetic can overflow if the result is larger than i32::MAX

@alamb alamb marked this pull request as ready for review April 29, 2026 18:22
/// Caller is responsible for ensuring that the index is within the bounds
/// of the array
/// of the array and the resulting byte offset fits in `i32`
pub unsafe fn value_unchecked(&self, i: usize) -> &[u8] {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add new methods that are alternatives to value_offset and value_offset_at that return usize, so we don't need this limitation? Or at least update this method so it doesn't use them and doesn't suffer from i32 overflow. Because with this change, value now handles when value_offset is greater than max i32, but value_unchecked still doesn't.

I would expect value_unchecked to work correctly for all cases where value doesn't panic.

And existing code that uses value_unchecked might validate the index but not be aware of this hidden safety requirement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer is yes. I also spent some more time reviewing the code in FixedSizeBinaryArray and I am now convinced there are several other miuses of i32 <-> usize . I am working on an improvement, though I worry it will be a larger PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb

alamb commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

I also added some more docs for FixedSizeBinaryArray that may help reviewers

@alamb

alamb commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

I have played around with several options for improving this code. I think there are several potential i32 math overflow issues, but fixing them all in a single PR is going to be somewhat hard to review and take some time

What I am thinking about is adding a new invariant to the FixedSizeArray constructor that prevents constructing arrays with value buffers larger than 2GB as a temporary workaround in one PR. Then I can fixup the actual arithmetic in a second:

alamb added a commit that referenced this pull request May 1, 2026
# Which issue does this PR close?

- related to #9850

# Rationale for this change

While working on #9850 I noticed
that FixedSizeBinaryArray::offset always returns zero.

This was confusing to me and so I wanted to document it for future
readrs

# What changes are included in this PR?

Add some comments

# Are these changes tested?

By CI

# Are there any user-facing changes?
No, this is internal comments
@alamb alamb marked this pull request as draft May 1, 2026 20:51
@alamb

alamb commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

I have played around with several options for improving this code. I think there are several potential i32 math overflow issues, but fixing them all in a single PR is going to be somewhat hard to review and take some time

What I am thinking about is adding a new invariant to the FixedSizeArray constructor that prevents constructing arrays with value buffers larger than 2GB as a temporary workaround in one PR. Then I can fixup the actual arithmetic in a second:

I took this approach in the following PR:

I will create a follow on PR from that one that changes the internals of FixedSizeBinary to use usize arithmetic everywhere, but that will be a more invasive change

alamb added a commit that referenced this pull request May 1, 2026
# Which issue does this PR close?
- related to
https://github.com/apache/arrow-rs/pull/9850/changes#r3170266352

# Rationale for this change

While working on #9850 I felt it
would help to have some better docs about what a FixedSizeBinaryArray
actually was, so I made some

# What changes are included in this PR?

Add some more background / explanatory docs about this array

# Are these changes tested?

By CI
# Are there any user-facing changes?

Docs only
alamb added a commit that referenced this pull request May 4, 2026
# Which issue does this PR close?

- Closes #9850

# Rationale for this change

`FixedSizeBinaryArray::value_offset_at` works use `i32` arithmetic which
can overflow. For offsets beyond `i32::MAX`, that can be bad

# What changes are included in this PR?

1. Prevent any FixedSizedBinaryArrays from being constructed where the
offset calculation could overflow
2. Add some other overflow checks

As @adamreeve [pointed
out](#9850 (comment))
on #9850 there are several places
where the `i32` arithmetic is problematic in `FixedSizeBinaryArray`. I
will fix them for real in a different, follow on PR, by switching to
entirely `usize` based arithmetic for offset calculations

However, since I hope to backport this PR to older releases, I would
like something that is easy to review and has the least potential for
unintended consequences.

# Are these changes tested?

I added unit tests. However, I can't find any way to fully trigger the
actual paths short of trying to allocate very large arrays, which I
don't think is appropriate for unit tests.

# Are there any user-facing changes?
Better limit checking
@alamb alamb closed this in #9872 May 4, 2026
alamb added a commit to alamb/arrow-rs that referenced this pull request May 5, 2026
…#9872)

- Closes apache#9850

`FixedSizeBinaryArray::value_offset_at` works use `i32` arithmetic which
can overflow. For offsets beyond `i32::MAX`, that can be bad

1. Prevent any FixedSizedBinaryArrays from being constructed where the
offset calculation could overflow
2. Add some other overflow checks

As @adamreeve [pointed
out](apache#9850 (comment))
on apache#9850 there are several places
where the `i32` arithmetic is problematic in `FixedSizeBinaryArray`. I
will fix them for real in a different, follow on PR, by switching to
entirely `usize` based arithmetic for offset calculations

However, since I hope to backport this PR to older releases, I would
like something that is easy to review and has the least potential for
unintended consequences.

I added unit tests. However, I can't find any way to fully trigger the
actual paths short of trying to allocate very large arrays, which I
don't think is appropriate for unit tests.

Better limit checking
alamb added a commit that referenced this pull request May 6, 2026
…9872) (#9928)

- Part of #9858
- Fixes #9898 in 57.x releases

This PR:
- Backports #9872 from @alamb to
the `57_maintenance` line
- Supersedes the earlier closed attempt
#9850 referenced by the issue
alamb added a commit that referenced this pull request May 6, 2026
…9872) (#9917)

- Part of #9857
- Fixes #9898 in 56.x releases

This PR:
- Backports #9872 from @alamb to
the `56_maintenance` line
- Supersedes the earlier closed attempt
#9850 referenced by the issue
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?

- related to apache#9850

# Rationale for this change

While working on apache#9850 I noticed
that FixedSizeBinaryArray::offset always returns zero.

This was confusing to me and so I wanted to document it for future
readrs

# What changes are included in this PR?

Add some comments

# Are these changes tested?

By CI

# Are there any user-facing changes?
No, this is internal comments
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?
- related to
https://github.com/apache/arrow-rs/pull/9850/changes#r3170266352

# Rationale for this change

While working on apache#9850 I felt it
would help to have some better docs about what a FixedSizeBinaryArray
actually was, so I made some

# What changes are included in this PR?

Add some more background / explanatory docs about this array

# Are these changes tested?

By CI
# Are there any user-facing changes?

Docs only
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…#9872)

# Which issue does this PR close?

- Closes apache#9850

# Rationale for this change

`FixedSizeBinaryArray::value_offset_at` works use `i32` arithmetic which
can overflow. For offsets beyond `i32::MAX`, that can be bad

# What changes are included in this PR?

1. Prevent any FixedSizedBinaryArrays from being constructed where the
offset calculation could overflow
2. Add some other overflow checks

As @adamreeve [pointed
out](apache#9850 (comment))
on apache#9850 there are several places
where the `i32` arithmetic is problematic in `FixedSizeBinaryArray`. I
will fix them for real in a different, follow on PR, by switching to
entirely `usize` based arithmetic for offset calculations

However, since I hope to backport this PR to older releases, I would
like something that is easy to review and has the least potential for
unintended consequences.

# Are these changes tested?

I added unit tests. However, I can't find any way to fully trigger the
actual paths short of trying to allocate very large arrays, which I
don't think is appropriate for unit tests.

# Are there any user-facing changes?
Better limit checking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants