fix(parquet): bound schema num_children before Vec::with_capacity#9884
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Thanks for pointing out the issue @masumi-ryugo. I think this PR needs a bit more work.
Also, I would ask that you take a look at https://arrow.apache.org/docs/dev/developers/overview.html#ai-generated-code for thoughts on how to create better PR submissions. Thanks!
| // `num_children` is an `i32` that came straight from the | ||
| // attacker-controlled Thrift footer. Validate it before using it | ||
| // as a `Vec::with_capacity` argument: | ||
| // 1. Negative counts are nonsensical and would wrap to a huge | ||
| // `usize` on the cast. | ||
| // 2. Each declared child must consume at least one schema-array | ||
| // element (more if the child is itself a group), so a count | ||
| // that exceeds the number of remaining elements is invalid | ||
| // and would otherwise drive `Vec::with_capacity(~i32::MAX)` | ||
| // into a `capacity overflow` panic. |
There was a problem hiding this comment.
This is already a large function. There is no need to compound this by documenting the obvious.
| // `num_children` is an `i32` that came straight from the | |
| // attacker-controlled Thrift footer. Validate it before using it | |
| // as a `Vec::with_capacity` argument: | |
| // 1. Negative counts are nonsensical and would wrap to a huge | |
| // `usize` on the cast. | |
| // 2. Each declared child must consume at least one schema-array | |
| // element (more if the child is itself a group), so a count | |
| // that exceeds the number of remaining elements is invalid | |
| // and would otherwise drive `Vec::with_capacity(~i32::MAX)` | |
| // into a `capacity overflow` panic. |
| /// trips the `capacity_overflow` panic at | ||
| /// `alloc::raw_vec::handle_error`. | ||
| #[test] | ||
| fn test_parquet_schema_from_array_rejects_overflowing_num_children() { |
There was a problem hiding this comment.
This test does not error on main. The other two do not error if the vector allocation is replaced with
let mut fields = Vec::with_capacity(usize::try_from(n)?);| if n < 0 { | ||
| return Err(general_err!( | ||
| "Schema element {:?} has negative num_children {}", | ||
| element.name, | ||
| n | ||
| )); | ||
| } | ||
| let n_children = n as usize; |
There was a problem hiding this comment.
| if n < 0 { | |
| return Err(general_err!( | |
| "Schema element {:?} has negative num_children {}", | |
| element.name, | |
| n | |
| )); | |
| } | |
| let n_children = n as usize; | |
| let n_children = usize::try_from(n)?; |
| } | ||
| let n_children = n as usize; | ||
| let remaining = num_elements.saturating_sub(index + 1); | ||
| if n_children > remaining { |
There was a problem hiding this comment.
This is probably a good check to perform, but it would be nice if the unit tests actually provided coverage here.
`schema_from_array_helper` cast `SchemaElement::num_children` from `i32` to `usize` via `n as usize`. A negative count wraps to a huge `usize`, which `Vec::with_capacity` then turns into a capacity-overflow panic. Replace the cast with `usize::try_from(n)?`, surfacing it via the existing `From<TryFromIntError> for ParquetError` impl.
5847354 to
dc1793f
Compare
|
Thanks @etseidl, and apologies for the noise on this one — you read it correctly. Disclosure: yes, this PR was drafted with AI assistance, and on a re-read against the AI-generated code submission guidelines it hits all three of the pitfalls flagged there: overly verbose comments, unnecessary test cases, and a fix whose framing didn't match the actual bug. I should have caught that before sending. Going forward I'll do the same self-review pass against that guideline before opening any AI-assisted PR. Substance check on your review: you're right that the original framing was wrong. Walking through it on the actual code:
Force-pushed
If you'd like the remaining-elements consistency check (catching |
etseidl
left a comment
There was a problem hiding this comment.
Looks good now, nice and tight! Thanks @masumi-ryugo.
I do hope you will reintroduce the check on num_children <= remaining elements as I think that would be an improvement.
| logical_type: None, | ||
| }]; | ||
| let result = parquet_schema_from_array(elements); | ||
| assert!(result.is_err(), "expected error, got {result:?}"); |
There was a problem hiding this comment.
I verified this panics without the fix.
Not necessary for this PR, but for future fuzzing/validation tests, it's probably a good idea to verify a specific error is returned to ensure the test is still covering what it's supposed to.
| assert!(result.is_err(), "expected error, got {result:?}"); | |
| assert!(result.unwrap_err().to_string().contains("Integer overflow")); |
There was a problem hiding this comment.
yes I agree -- I had this comment on another PR as well
…_children test Addresses review nit from etseidl (concurred by alamb): asserting the specific error string ensures the test continues to cover the intended failure mode if the underlying error mapping ever changes.
alamb
left a comment
There was a problem hiding this comment.
Thank you @masumi-ryugo and @etseidl
Closes #9885. ## What `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize`: - panics with `attempt to add with overflow` in debug builds (and the cargo-fuzz `csv_reader` harness that found this is built with `--debug-assertions`); - silently wraps to a wildly out-of-bounds index in release builds, which then trips an unrelated `assert!` / `unwrap` somewhere downstream. ## Fix Switch the accumulator to `checked_add` and surface the overflow as `ArrowError::CsvError` instead. The body of the loop becomes a normal `for` loop because `?` doesn't compose with the previous closure form. ```rust let mut row_offset: usize = 0; for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) { let offset = row_offset; for x in row.iter_mut() { *x = x.checked_add(offset).ok_or_else(|| { ArrowError::CsvError( "CSV record offsets overflowed usize while flushing".to_string(), ) })?; row_offset = *x; } } ``` ## Repro The cargo-fuzz `csv_reader` harness from [`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses) (per #5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes: ``` 0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1| 0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.| 0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....| 0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.| 0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i| ``` Before this PR (run on `main` HEAD against the cargo-fuzz harness): ``` thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21: attempt to add with overflow ``` After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns `ArrowError::CsvError(...)` for callers to handle. ## Tests Adds `reader::records::tests::test_flush_offset_overflow_does_not_panic`, which feeds the 72-byte fuzz repro through `RecordDecoder::decode` + `flush` and asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass. ## Alternatives considered - **Cap by `self.data_len`**: each emitted offset is supposed to be ≤ `self.data_len`, so an explicit cap would also turn the overflow into a clean error. I went with `checked_add` because it's the more targeted change — it doesn't add a new invariant on `csv_core`'s output, only refuses to compute something that would have been arithmetically nonsensical anyway. - **Use `saturating_add`**: would silently truncate the offset and then mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8 data` error or panic deeper in the call stack. Worse signal. xref #5332 #9883 #9884 --------- Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache#9884) ## What Validate `SchemaElement::num_children` in `parquet::schema::types::schema_from_array_helper` before passing it to `Vec::with_capacity`. Reject: 1. **Negative counts** — they wrap to a huge `usize` on the cast, and they're nonsensical anyway (`num_children` is a count). 2. **Counts that exceed the remaining schema-array entries** — each declared child must consume at least one entry, so a value larger than `num_elements - index - 1` is invalid input. Without this bound the cast feeds `Vec::with_capacity(~i32::MAX)` and `alloc::raw_vec::handle_error` panics with `capacity overflow` on a 64-bit target. ```rust if n < 0 { return Err(general_err!(...)); } let n_children = n as usize; let remaining = num_elements.saturating_sub(index + 1); if n_children > remaining { return Err(general_err!(...)); } let mut fields = Vec::with_capacity(n_children); ``` ## How it was found The `parquet/fuzz/thrift_decode` cargo-fuzz harness from apache#5332 / [`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses), running for ~30 s on `parquet/main` HEAD with no seed corpus, produces this 22-byte input: ``` 00000000 02 00 2b 11 01 08 00 11 11 00 00 00 00 00 00 00 |..+.............| 00000010 f7 f9 11 01 a5 ff |......| ``` Backtrace before this patch: ``` thread '<unnamed>' panicked at raw_vec/mod.rs:28:5: capacity overflow 2: alloc::raw_vec::capacity_overflow 3: alloc::raw_vec::handle_error raw_vec/mod.rs:889:29 4: schema_from_array_helper schema/types.rs:1404 5: parquet_schema_from_array schema/types.rs:1304:17 6: parquet_metadata_from_bytes metadata/thrift/mod.rs:791:31 7: decode_metadata metadata/parser.rs:233:5 8: decode_metadata metadata/reader.rs:823:9 ``` After this patch the same bytes return a clean `ParquetError::General`, no panic, no allocation spike. ## Tests Three new tests in `schema::types::tests`: - `test_parquet_schema_from_array_rejects_negative_num_children` — direct `parquet_schema_from_array` call with `num_children = Some(-1)`. - `test_parquet_schema_from_array_rejects_overflowing_num_children` — same with `num_children = Some(i32::MAX)`. - `test_decode_metadata_oom_repro_schema_overflow_does_not_panic` — end-to-end via `ParquetMetaDataReader::decode_metadata` using the 22-byte repro. The existing 107 tests under `schema::` continue to pass. ## Relationship to apache#9883 Sibling fix, not dependent. apache#9883 caps the up-front allocation **inside** the thrift parser (`read_thrift_vec`); this PR caps the allocation **downstream** of the thrift parser, where `SchemaElement::num_children` becomes a `Vec::with_capacity` argument. They live on different code paths — fuzz finds inputs that hit either one independently — and either can land first. xref apache#5332 apache#9883 apache#9868 apache#9874 --------- Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com>
…e#9886) Closes apache#9885. ## What `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize`: - panics with `attempt to add with overflow` in debug builds (and the cargo-fuzz `csv_reader` harness that found this is built with `--debug-assertions`); - silently wraps to a wildly out-of-bounds index in release builds, which then trips an unrelated `assert!` / `unwrap` somewhere downstream. ## Fix Switch the accumulator to `checked_add` and surface the overflow as `ArrowError::CsvError` instead. The body of the loop becomes a normal `for` loop because `?` doesn't compose with the previous closure form. ```rust let mut row_offset: usize = 0; for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) { let offset = row_offset; for x in row.iter_mut() { *x = x.checked_add(offset).ok_or_else(|| { ArrowError::CsvError( "CSV record offsets overflowed usize while flushing".to_string(), ) })?; row_offset = *x; } } ``` ## Repro The cargo-fuzz `csv_reader` harness from [`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses) (per apache#5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes: ``` 0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1| 0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.| 0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....| 0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.| 0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i| ``` Before this PR (run on `main` HEAD against the cargo-fuzz harness): ``` thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21: attempt to add with overflow ``` After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns `ArrowError::CsvError(...)` for callers to handle. ## Tests Adds `reader::records::tests::test_flush_offset_overflow_does_not_panic`, which feeds the 72-byte fuzz repro through `RecordDecoder::decode` + `flush` and asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass. ## Alternatives considered - **Cap by `self.data_len`**: each emitted offset is supposed to be ≤ `self.data_len`, so an explicit cap would also turn the overflow into a clean error. I went with `checked_add` because it's the more targeted change — it doesn't add a new invariant on `csv_core`'s output, only refuses to compute something that would have been arithmetically nonsensical anyway. - **Use `saturating_add`**: would silently truncate the offset and then mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8 data` error or panic deeper in the call stack. Worse signal. xref apache#5332 apache#9883 apache#9884 --------- Co-authored-by: masumi ryugo <280057467+masumi-ryugo@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What
Validate
SchemaElement::num_childreninparquet::schema::types::schema_from_array_helperbefore passing it toVec::with_capacity. Reject:usizeon the cast, and they're nonsensical anyway (num_childrenis a count).num_elements - index - 1is invalid input. Without this bound the cast feedsVec::with_capacity(~i32::MAX)andalloc::raw_vec::handle_errorpanics withcapacity overflowon a 64-bit target.How it was found
The
parquet/fuzz/thrift_decodecargo-fuzz harness from #5332 /fuzz/initial-harnesses, running for ~30 s onparquet/mainHEAD with no seed corpus, produces this 22-byte input:Backtrace before this patch:
After this patch the same bytes return a clean
ParquetError::General, no panic, no allocation spike.Tests
Three new tests in
schema::types::tests:test_parquet_schema_from_array_rejects_negative_num_children— directparquet_schema_from_arraycall withnum_children = Some(-1).test_parquet_schema_from_array_rejects_overflowing_num_children— same withnum_children = Some(i32::MAX).test_decode_metadata_oom_repro_schema_overflow_does_not_panic— end-to-end viaParquetMetaDataReader::decode_metadatausing the 22-byte repro.The existing 107 tests under
schema::continue to pass.Relationship to #9883
Sibling fix, not dependent. #9883 caps the up-front allocation inside the thrift parser (
read_thrift_vec); this PR caps the allocation downstream of the thrift parser, whereSchemaElement::num_childrenbecomes aVec::with_capacityargument. They live on different code paths — fuzz finds inputs that hit either one independently — and either can land first.xref #5332 #9883 #9868 #9874