fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886
Merged
Conversation
Closes apache#9885. `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize` panics with `attempt to add with overflow` in debug builds and silently wraps in release builds. The cargo-fuzz `csv_reader` harness being prototyped for apache#5332 reproduces this in single-digit minutes from an empty corpus with a 72-byte input. After this patch the same input returns `ArrowError::CsvError("CSV record offsets overflowed usize while flushing")` instead of panicking. Includes a regression test in `reader::records::tests` driving the 72-byte fuzz repro through `RecordDecoder::decode` + `flush`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alamb
reviewed
May 5, 2026
Contributor
alamb
left a comment
There was a problem hiding this comment.
Thank you for this @masumi-ryugo
| } | ||
| decoder.clear(); | ||
| } | ||
| // Reaching this assertion at all means we did not panic. |
| } | ||
| input = &input[consumed..]; | ||
| // The buggy version panics inside `flush()`; the patched version | ||
| // either returns rows or surfaces a clean `CsvError`. |
Contributor
There was a problem hiding this comment.
it should always return an error, right? Can we change the test to follow similar pattern to existing tests like
let err = decoder.flush().unwrap_err();
assert_eq!("msg", err.to_to_string());I think that will be more concise and clearer what the expected value is
- Restore the original two-line "what" comment on the offset rebase loop and drop the verbose "add with overflow / debug vs release" exposition; the code is self-explanatory. - Switch back to the existing `try_for_each` chain pattern instead of nested `for` loops, so the structure matches the rest of the file. - Replace the regression test's loop-and-discard structure with a white-box construction that stages the overflow directly. The test now uses the standard `flush().unwrap_err()` + `assert_eq` pattern (matching `test_invalid_fields`) and asserts the specific `Csv error: CSV record offsets overflowed usize while flushing` message rather than just "did not panic".
Rich-T-kid
pushed a commit
to Rich-T-kid/arrow-rs
that referenced
this pull request
Jun 2, 2026
…e#9886) Closes apache#9885. ## What `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize`: - panics with `attempt to add with overflow` in debug builds (and the cargo-fuzz `csv_reader` harness that found this is built with `--debug-assertions`); - silently wraps to a wildly out-of-bounds index in release builds, which then trips an unrelated `assert!` / `unwrap` somewhere downstream. ## Fix Switch the accumulator to `checked_add` and surface the overflow as `ArrowError::CsvError` instead. The body of the loop becomes a normal `for` loop because `?` doesn't compose with the previous closure form. ```rust let mut row_offset: usize = 0; for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) { let offset = row_offset; for x in row.iter_mut() { *x = x.checked_add(offset).ok_or_else(|| { ArrowError::CsvError( "CSV record offsets overflowed usize while flushing".to_string(), ) })?; row_offset = *x; } } ``` ## Repro The cargo-fuzz `csv_reader` harness from [`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses) (per apache#5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes: ``` 0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1| 0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.| 0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....| 0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.| 0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i| ``` Before this PR (run on `main` HEAD against the cargo-fuzz harness): ``` thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21: attempt to add with overflow ``` After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns `ArrowError::CsvError(...)` for callers to handle. ## Tests Adds `reader::records::tests::test_flush_offset_overflow_does_not_panic`, which feeds the 72-byte fuzz repro through `RecordDecoder::decode` + `flush` and asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass. ## Alternatives considered - **Cap by `self.data_len`**: each emitted offset is supposed to be ≤ `self.data_len`, so an explicit cap would also turn the overflow into a clean error. I went with `checked_add` because it's the more targeted change — it doesn't add a new invariant on `csv_core`'s output, only refuses to compute something that would have been arithmetically nonsensical anyway. - **Use `saturating_add`**: would silently truncate the offset and then mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8 data` error or panic deeper in the call stack. Worse signal. xref apache#5332 apache#9883 apache#9884 --------- Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9885.
What
RecordDecoder::flushwalks the per-row offsets emitted bycsv_core::Readerand accumulates them so each end offset is absolute overself.dataafter the loop. The accumulator was a plainusizeand the loop body did*x += offset, which on malformed input that drivescsv_coreto emit row-relative offsets large enough to wrap ausize:attempt to add with overflowin debug builds (and the cargo-fuzzcsv_readerharness that found this is built with--debug-assertions);assert!/unwrapsomewhere downstream.Fix
Switch the accumulator to
checked_addand surface the overflow asArrowError::CsvErrorinstead. The body of the loop becomes a normalforloop because?doesn't compose with the previous closure form.Repro
The cargo-fuzz
csv_readerharness fromfuzz/initial-harnesses(per #5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes:Before this PR (run on
mainHEAD against the cargo-fuzz harness):After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns
ArrowError::CsvError(...)for callers to handle.Tests
Adds
reader::records::tests::test_flush_offset_overflow_does_not_panic, which feeds the 72-byte fuzz repro throughRecordDecoder::decode+flushand asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass.Alternatives considered
self.data_len: each emitted offset is supposed to be ≤self.data_len, so an explicit cap would also turn the overflow into a clean error. I went withchecked_addbecause it's the more targeted change — it doesn't add a new invariant oncsv_core's output, only refuses to compute something that would have been arithmetically nonsensical anyway.saturating_add: would silently truncate the offset and then mis-sliceself.data, producing a confusingEncountered invalid UTF-8 dataerror or panic deeper in the call stack. Worse signal.xref #5332 #9883 #9884