fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886

masumi-ryugo · 2026-05-03T15:34:49Z

Closes #9885.

What

RecordDecoder::flush walks the per-row offsets emitted by csv_core::Reader and accumulates them so each end offset is absolute over self.data after the loop. The accumulator was a plain usize and the loop body did *x += offset, which on malformed input that drives csv_core to emit row-relative offsets large enough to wrap a usize:

panics with attempt to add with overflow in debug builds (and the cargo-fuzz csv_reader harness that found this is built with --debug-assertions);
silently wraps to a wildly out-of-bounds index in release builds, which then trips an unrelated assert! / unwrap somewhere downstream.

Fix

Switch the accumulator to checked_add and surface the overflow as ArrowError::CsvError instead. The body of the loop becomes a normal for loop because ? doesn't compose with the previous closure form.

let mut row_offset: usize = 0;
for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
    let offset = row_offset;
    for x in row.iter_mut() {
        *x = x.checked_add(offset).ok_or_else(|| {
            ArrowError::CsvError(
                "CSV record offsets overflowed usize while flushing".to_string(),
            )
        })?;
        row_offset = *x;
    }
}

Repro

The cargo-fuzz csv_reader harness from fuzz/initial-harnesses (per #5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes:

0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|

Before this PR (run on main HEAD against the cargo-fuzz harness):

thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow

After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns ArrowError::CsvError(...) for callers to handle.

Tests

Adds reader::records::tests::test_flush_offset_overflow_does_not_panic, which feeds the 72-byte fuzz repro through RecordDecoder::decode + flush and asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass.

Alternatives considered

Cap by self.data_len: each emitted offset is supposed to be ≤ self.data_len, so an explicit cap would also turn the overflow into a clean error. I went with checked_add because it's the more targeted change — it doesn't add a new invariant on csv_core's output, only refuses to compute something that would have been arithmetically nonsensical anyway.
Use saturating_add: would silently truncate the offset and then mis-slice self.data, producing a confusing Encountered invalid UTF-8 data error or panic deeper in the call stack. Worse signal.

xref #5332 #9883 #9884

Closes apache#9885. `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize` panics with `attempt to add with overflow` in debug builds and silently wraps in release builds. The cargo-fuzz `csv_reader` harness being prototyped for apache#5332 reproduces this in single-digit minutes from an empty corpus with a 72-byte input. After this patch the same input returns `ArrowError::CsvError("CSV record offsets overflowed usize while flushing")` instead of panicking. Includes a regression test in `reader::records::tests` driving the 72-byte fuzz repro through `RecordDecoder::decode` + `flush`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alamb

Thank you for this @masumi-ryugo

alamb · 2026-05-05T14:32:35Z

+            }
+            decoder.clear();
+        }
+        // Reaching this assertion at all means we did not panic.


there is no assertion here 🤔

alamb · 2026-05-05T14:32:44Z

+            }
+            input = &input[consumed..];
+            // The buggy version panics inside `flush()`; the patched version
+            // either returns rows or surfaces a clean `CsvError`.


it should always return an error, right? Can we change the test to follow similar pattern to existing tests like

let err = decoder.flush().unwrap_err(); assert_eq!("msg", err.to_to_string());

I think that will be more concise and clearer what the expected value is

- Restore the original two-line "what" comment on the offset rebase loop and drop the verbose "add with overflow / debug vs release" exposition; the code is self-explanatory. - Switch back to the existing `try_for_each` chain pattern instead of nested `for` loops, so the structure matches the rest of the file. - Replace the regression test's loop-and-discard structure with a white-box construction that stages the overflow directly. The test now uses the standard `flush().unwrap_err()` + `assert_eq` pattern (matching `test_invalid_fields`) and asserts the specific `Csv error: CSV record offsets overflowed usize while flushing` message rather than just "did not panic".

alamb

Thank you @masumi-ryugo

…e#9886) Closes apache#9885. ## What `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize`: - panics with `attempt to add with overflow` in debug builds (and the cargo-fuzz `csv_reader` harness that found this is built with `--debug-assertions`); - silently wraps to a wildly out-of-bounds index in release builds, which then trips an unrelated `assert!` / `unwrap` somewhere downstream. ## Fix Switch the accumulator to `checked_add` and surface the overflow as `ArrowError::CsvError` instead. The body of the loop becomes a normal `for` loop because `?` doesn't compose with the previous closure form. ```rust let mut row_offset: usize = 0; for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) { let offset = row_offset; for x in row.iter_mut() { *x = x.checked_add(offset).ok_or_else(|| { ArrowError::CsvError( "CSV record offsets overflowed usize while flushing".to_string(), ) })?; row_offset = *x; } } ``` ## Repro The cargo-fuzz `csv_reader` harness from [`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses) (per apache#5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes: ``` 0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1| 0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.| 0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....| 0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.| 0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i| ``` Before this PR (run on `main` HEAD against the cargo-fuzz harness): ``` thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21: attempt to add with overflow ``` After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns `ArrowError::CsvError(...)` for callers to handle. ## Tests Adds `reader::records::tests::test_flush_offset_overflow_does_not_panic`, which feeds the 72-byte fuzz repro through `RecordDecoder::decode` + `flush` and asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass. ## Alternatives considered - **Cap by `self.data_len`**: each emitted offset is supposed to be ≤ `self.data_len`, so an explicit cap would also turn the overflow into a clean error. I went with `checked_add` because it's the more targeted change — it doesn't add a new invariant on `csv_core`'s output, only refuses to compute something that would have been arithmetically nonsensical anyway. - **Use `saturating_add`**: would silently truncate the offset and then mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8 data` error or panic deeper in the call stack. Worse signal. xref apache#5332 apache#9883 apache#9884 --------- Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the arrow Changes to the arrow crate label May 3, 2026

alamb reviewed May 5, 2026

View reviewed changes

alamb approved these changes May 6, 2026

View reviewed changes

alamb merged commit ded985c into apache:main May 6, 2026
23 checks passed

alamb mentioned this pull request May 7, 2026

arrow-csv: integer overflow panic in Reader::records::flush #9885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886