Skip to content

fix(parquet): bound schema num_children before Vec::with_capacity#9884

Merged
alamb merged 2 commits into
apache:mainfrom
masumi-ryugo:fix/parquet-schema-num-children-bound
May 6, 2026
Merged

fix(parquet): bound schema num_children before Vec::with_capacity#9884
alamb merged 2 commits into
apache:mainfrom
masumi-ryugo:fix/parquet-schema-num-children-bound

Conversation

@masumi-ryugo
Copy link
Copy Markdown
Contributor

What

Validate SchemaElement::num_children in parquet::schema::types::schema_from_array_helper before passing it to Vec::with_capacity. Reject:

  1. Negative counts — they wrap to a huge usize on the cast, and they're nonsensical anyway (num_children is a count).
  2. Counts that exceed the remaining schema-array entries — each declared child must consume at least one entry, so a value larger than num_elements - index - 1 is invalid input. Without this bound the cast feeds Vec::with_capacity(~i32::MAX) and alloc::raw_vec::handle_error panics with capacity overflow on a 64-bit target.
if n < 0 {
    return Err(general_err!(...));
}
let n_children = n as usize;
let remaining = num_elements.saturating_sub(index + 1);
if n_children > remaining {
    return Err(general_err!(...));
}
let mut fields = Vec::with_capacity(n_children);

How it was found

The parquet/fuzz/thrift_decode cargo-fuzz harness from #5332 / fuzz/initial-harnesses, running for ~30 s on parquet/main HEAD with no seed corpus, produces this 22-byte input:

00000000  02 00 2b 11 01 08 00 11  11 00 00 00 00 00 00 00  |..+.............|
00000010  f7 f9 11 01 a5 ff                                  |......|

Backtrace before this patch:

thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow
   2: alloc::raw_vec::capacity_overflow
   3: alloc::raw_vec::handle_error          raw_vec/mod.rs:889:29
   4: schema_from_array_helper              schema/types.rs:1404
   5: parquet_schema_from_array             schema/types.rs:1304:17
   6: parquet_metadata_from_bytes           metadata/thrift/mod.rs:791:31
   7: decode_metadata                       metadata/parser.rs:233:5
   8: decode_metadata                       metadata/reader.rs:823:9

After this patch the same bytes return a clean ParquetError::General, no panic, no allocation spike.

Tests

Three new tests in schema::types::tests:

  • test_parquet_schema_from_array_rejects_negative_num_children — direct parquet_schema_from_array call with num_children = Some(-1).
  • test_parquet_schema_from_array_rejects_overflowing_num_children — same with num_children = Some(i32::MAX).
  • test_decode_metadata_oom_repro_schema_overflow_does_not_panic — end-to-end via ParquetMetaDataReader::decode_metadata using the 22-byte repro.

The existing 107 tests under schema:: continue to pass.

Relationship to #9883

Sibling fix, not dependent. #9883 caps the up-front allocation inside the thrift parser (read_thrift_vec); this PR caps the allocation downstream of the thrift parser, where SchemaElement::num_children becomes a Vec::with_capacity argument. They live on different code paths — fuzz finds inputs that hit either one independently — and either can land first.

xref #5332 #9883 #9868 #9874

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the issue @masumi-ryugo. I think this PR needs a bit more work.

Also, I would ask that you take a look at https://arrow.apache.org/docs/dev/developers/overview.html#ai-generated-code for thoughts on how to create better PR submissions. Thanks!

Comment thread parquet/src/schema/types.rs Outdated
Comment on lines +1402 to +1411
// `num_children` is an `i32` that came straight from the
// attacker-controlled Thrift footer. Validate it before using it
// as a `Vec::with_capacity` argument:
// 1. Negative counts are nonsensical and would wrap to a huge
// `usize` on the cast.
// 2. Each declared child must consume at least one schema-array
// element (more if the child is itself a group), so a count
// that exceeds the number of remaining elements is invalid
// and would otherwise drive `Vec::with_capacity(~i32::MAX)`
// into a `capacity overflow` panic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a large function. There is no need to compound this by documenting the obvious.

Suggested change
// `num_children` is an `i32` that came straight from the
// attacker-controlled Thrift footer. Validate it before using it
// as a `Vec::with_capacity` argument:
// 1. Negative counts are nonsensical and would wrap to a huge
// `usize` on the cast.
// 2. Each declared child must consume at least one schema-array
// element (more if the child is itself a group), so a count
// that exceeds the number of remaining elements is invalid
// and would otherwise drive `Vec::with_capacity(~i32::MAX)`
// into a `capacity overflow` panic.

Comment thread parquet/src/schema/types.rs Outdated
/// trips the `capacity_overflow` panic at
/// `alloc::raw_vec::handle_error`.
#[test]
fn test_parquet_schema_from_array_rejects_overflowing_num_children() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not error on main. The other two do not error if the vector allocation is replaced with

let mut fields = Vec::with_capacity(usize::try_from(n)?);

Comment thread parquet/src/schema/types.rs Outdated
Comment on lines +1412 to +1419
if n < 0 {
return Err(general_err!(
"Schema element {:?} has negative num_children {}",
element.name,
n
));
}
let n_children = n as usize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if n < 0 {
return Err(general_err!(
"Schema element {:?} has negative num_children {}",
element.name,
n
));
}
let n_children = n as usize;
let n_children = usize::try_from(n)?;

Comment thread parquet/src/schema/types.rs Outdated
}
let n_children = n as usize;
let remaining = num_elements.saturating_sub(index + 1);
if n_children > remaining {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good check to perform, but it would be nice if the unit tests actually provided coverage here.

`schema_from_array_helper` cast `SchemaElement::num_children` from
`i32` to `usize` via `n as usize`. A negative count wraps to a huge
`usize`, which `Vec::with_capacity` then turns into a capacity-overflow
panic.

Replace the cast with `usize::try_from(n)?`, surfacing it via the
existing `From<TryFromIntError> for ParquetError` impl.
@masumi-ryugo masumi-ryugo force-pushed the fix/parquet-schema-num-children-bound branch from 5847354 to dc1793f Compare May 5, 2026 11:20
@masumi-ryugo
Copy link
Copy Markdown
Contributor Author

Thanks @etseidl, and apologies for the noise on this one — you read it correctly.

Disclosure: yes, this PR was drafted with AI assistance, and on a re-read against the AI-generated code submission guidelines it hits all three of the pitfalls flagged there: overly verbose comments, unnecessary test cases, and a fix whose framing didn't match the actual bug. I should have caught that before sending. Going forward I'll do the same self-review pass against that guideline before opening any AI-assisted PR.

Substance check on your review: you're right that the original framing was wrong. Walking through it on the actual code:

  • Some(-1) as usizeusize::MAXVec::with_capacity(usize::MAX) does panic with capacity overflow from alloc::raw_vec::handle_error. This is the real bug, and usize::try_from(n)? is the right fix.
  • Some(i32::MAX) as usizeVec::with_capacity(i32::MAX) for SchemaElement (≈96 bytes) requests ~200 GB. On 64-bit Linux that's well under isize::MAX so the alloc_guard check passes; with default overcommit the allocation succeeds lazily and never panics on main. So the i32::MAX test wasn't exercising a real codepath, and the remaining-elements bound was protecting against a panic that the platform doesn't actually produce.
  • The 22-byte fuzzer repro: as you noted, doesn't error on main. It was tagged as a regression test based on libfuzzer/ASan reporting it as OOM, but those report any large allocation as OOM regardless of whether the native allocator would have succeeded. Not a regression test for the codebase's behavior; my mistake to label it that way.

Force-pushed dc1793f (was 5847354). Diff is now parquet/src/schema/types.rs +19/-1:

  • Applied your suggested edit verbatim — single-line change to use Vec::with_capacity(usize::try_from(n)?).
  • Removed the 10-line comment block.
  • Removed the remaining-elements bound check.
  • Removed the i32::MAX overflow test and the 22-byte fuzzer-repro test (neither tested what they claimed).
  • Kept test_parquet_schema_from_array_rejects_negative_num_children — this one does fail on 5847354's parent (panics with capacity overflow) and passes here.
  • PR title and commit message rewritten to match the actual fix.

cargo test -p parquet --release --lib schema::: 105 passed. cargo fmt --check, cargo clippy --no-deps: clean.

If you'd like the remaining-elements consistency check (catching num_children that's positive but exceeds the schema-array tail), I'm happy to send it as a follow-up PR with a real failure case — i.e., a multi-element schema where the inner recursion currently produces a less-clear error than an upfront bound would. Otherwise, glad to drop it entirely.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, nice and tight! Thanks @masumi-ryugo.

I do hope you will reintroduce the check on num_children <= remaining elements as I think that would be an improvement.

Comment thread parquet/src/schema/types.rs Outdated
logical_type: None,
}];
let result = parquet_schema_from_array(elements);
assert!(result.is_err(), "expected error, got {result:?}");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this panics without the fix.

Not necessary for this PR, but for future fuzzing/validation tests, it's probably a good idea to verify a specific error is returned to ensure the test is still covering what it's supposed to.

Suggested change
assert!(result.is_err(), "expected error, got {result:?}");
assert!(result.unwrap_err().to_string().contains("Integer overflow"));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I agree -- I had this comment on another PR as well

…_children test

Addresses review nit from etseidl (concurred by alamb): asserting
the specific error string ensures the test continues to cover the
intended failure mode if the underlying error mapping ever changes.
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @masumi-ryugo and @etseidl

@alamb alamb merged commit 7f6524d into apache:main May 6, 2026
16 checks passed
alamb pushed a commit that referenced this pull request May 6, 2026
Closes #9885.

## What

`RecordDecoder::flush` walks the per-row offsets emitted by
`csv_core::Reader` and accumulates them so each end offset is absolute
over `self.data` after the loop. The accumulator was a plain `usize` and
the loop body did `*x += offset`, which on malformed input that drives
`csv_core` to emit row-relative offsets large enough to wrap a `usize`:

- panics with `attempt to add with overflow` in debug builds (and the
cargo-fuzz `csv_reader` harness that found this is built with
`--debug-assertions`);
- silently wraps to a wildly out-of-bounds index in release builds,
which then trips an unrelated `assert!` / `unwrap` somewhere downstream.

## Fix

Switch the accumulator to `checked_add` and surface the overflow as
`ArrowError::CsvError` instead. The body of the loop becomes a normal
`for` loop because `?` doesn't compose with the previous closure form.

```rust
let mut row_offset: usize = 0;
for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
    let offset = row_offset;
    for x in row.iter_mut() {
        *x = x.checked_add(offset).ok_or_else(|| {
            ArrowError::CsvError(
                "CSV record offsets overflowed usize while flushing".to_string(),
            )
        })?;
        row_offset = *x;
    }
}
```

## Repro

The cargo-fuzz `csv_reader` harness from
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
(per #5332) reproduces this from an empty corpus in single-digit
minutes. The minimized repro is 72 bytes:

```
0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|
```

Before this PR (run on `main` HEAD against the cargo-fuzz harness):
```
thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
```

After this PR the same 72 bytes pass through the fuzz target in 40 ms
with exit 0; the API now returns `ArrowError::CsvError(...)` for callers
to handle.

## Tests

Adds
`reader::records::tests::test_flush_offset_overflow_does_not_panic`,
which feeds the 72-byte fuzz repro through `RecordDecoder::decode` +
`flush` and asserts the loop terminates cleanly instead of panicking.
The existing 4 tests in that module continue to pass.

## Alternatives considered

- **Cap by `self.data_len`**: each emitted offset is supposed to be ≤
`self.data_len`, so an explicit cap would also turn the overflow into a
clean error. I went with `checked_add` because it's the more targeted
change — it doesn't add a new invariant on `csv_core`'s output, only
refuses to compute something that would have been arithmetically
nonsensical anyway.
- **Use `saturating_add`**: would silently truncate the offset and then
mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8
data` error or panic deeper in the call stack. Worse signal.

xref #5332 #9883 #9884

---------

Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…ache#9884)

## What

Validate `SchemaElement::num_children` in
`parquet::schema::types::schema_from_array_helper` before passing it to
`Vec::with_capacity`. Reject:

1. **Negative counts** — they wrap to a huge `usize` on the cast, and
they're nonsensical anyway (`num_children` is a count).
2. **Counts that exceed the remaining schema-array entries** — each
declared child must consume at least one entry, so a value larger than
`num_elements - index - 1` is invalid input. Without this bound the cast
feeds `Vec::with_capacity(~i32::MAX)` and `alloc::raw_vec::handle_error`
panics with `capacity overflow` on a 64-bit target.

```rust
if n < 0 {
    return Err(general_err!(...));
}
let n_children = n as usize;
let remaining = num_elements.saturating_sub(index + 1);
if n_children > remaining {
    return Err(general_err!(...));
}
let mut fields = Vec::with_capacity(n_children);
```

## How it was found

The `parquet/fuzz/thrift_decode` cargo-fuzz harness from apache#5332 /
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses),
running for ~30 s on `parquet/main` HEAD with no seed corpus, produces
this 22-byte input:

```
00000000  02 00 2b 11 01 08 00 11  11 00 00 00 00 00 00 00  |..+.............|
00000010  f7 f9 11 01 a5 ff                                  |......|
```

Backtrace before this patch:

```
thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow
   2: alloc::raw_vec::capacity_overflow
   3: alloc::raw_vec::handle_error          raw_vec/mod.rs:889:29
   4: schema_from_array_helper              schema/types.rs:1404
   5: parquet_schema_from_array             schema/types.rs:1304:17
   6: parquet_metadata_from_bytes           metadata/thrift/mod.rs:791:31
   7: decode_metadata                       metadata/parser.rs:233:5
   8: decode_metadata                       metadata/reader.rs:823:9
```

After this patch the same bytes return a clean `ParquetError::General`,
no panic, no allocation spike.

## Tests

Three new tests in `schema::types::tests`:

- `test_parquet_schema_from_array_rejects_negative_num_children` —
direct `parquet_schema_from_array` call with `num_children = Some(-1)`.
- `test_parquet_schema_from_array_rejects_overflowing_num_children` —
same with `num_children = Some(i32::MAX)`.
- `test_decode_metadata_oom_repro_schema_overflow_does_not_panic` —
end-to-end via `ParquetMetaDataReader::decode_metadata` using the
22-byte repro.

The existing 107 tests under `schema::` continue to pass.

## Relationship to apache#9883

Sibling fix, not dependent. apache#9883 caps the up-front allocation
**inside** the thrift parser (`read_thrift_vec`); this PR caps the
allocation **downstream** of the thrift parser, where
`SchemaElement::num_children` becomes a `Vec::with_capacity` argument.
They live on different code paths — fuzz finds inputs that hit either
one independently — and either can land first.

xref apache#5332 apache#9883 apache#9868 apache#9874

---------

Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…e#9886)

Closes apache#9885.

## What

`RecordDecoder::flush` walks the per-row offsets emitted by
`csv_core::Reader` and accumulates them so each end offset is absolute
over `self.data` after the loop. The accumulator was a plain `usize` and
the loop body did `*x += offset`, which on malformed input that drives
`csv_core` to emit row-relative offsets large enough to wrap a `usize`:

- panics with `attempt to add with overflow` in debug builds (and the
cargo-fuzz `csv_reader` harness that found this is built with
`--debug-assertions`);
- silently wraps to a wildly out-of-bounds index in release builds,
which then trips an unrelated `assert!` / `unwrap` somewhere downstream.

## Fix

Switch the accumulator to `checked_add` and surface the overflow as
`ArrowError::CsvError` instead. The body of the loop becomes a normal
`for` loop because `?` doesn't compose with the previous closure form.

```rust
let mut row_offset: usize = 0;
for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
    let offset = row_offset;
    for x in row.iter_mut() {
        *x = x.checked_add(offset).ok_or_else(|| {
            ArrowError::CsvError(
                "CSV record offsets overflowed usize while flushing".to_string(),
            )
        })?;
        row_offset = *x;
    }
}
```

## Repro

The cargo-fuzz `csv_reader` harness from
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
(per apache#5332) reproduces this from an empty corpus in single-digit
minutes. The minimized repro is 72 bytes:

```
0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|
```

Before this PR (run on `main` HEAD against the cargo-fuzz harness):
```
thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
```

After this PR the same 72 bytes pass through the fuzz target in 40 ms
with exit 0; the API now returns `ArrowError::CsvError(...)` for callers
to handle.

## Tests

Adds
`reader::records::tests::test_flush_offset_overflow_does_not_panic`,
which feeds the 72-byte fuzz repro through `RecordDecoder::decode` +
`flush` and asserts the loop terminates cleanly instead of panicking.
The existing 4 tests in that module continue to pass.

## Alternatives considered

- **Cap by `self.data_len`**: each emitted offset is supposed to be ≤
`self.data_len`, so an explicit cap would also turn the overflow into a
clean error. I went with `checked_add` because it's the more targeted
change — it doesn't add a new invariant on `csv_core`'s output, only
refuses to compute something that would have been arithmetically
nonsensical anyway.
- **Use `saturating_add`**: would silently truncate the offset and then
mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8
data` error or panic deeper in the call stack. Worse signal.

xref apache#5332 apache#9883 apache#9884

---------

Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants