Skip to content

fix: ParquetError when reading corrupt parquet file with truncated data instead of Panic#9725

Merged
alamb merged 4 commits into
apache:mainfrom
xuzifu666:fix_panic
Apr 16, 2026
Merged

fix: ParquetError when reading corrupt parquet file with truncated data instead of Panic#9725
alamb merged 4 commits into
apache:mainfrom
xuzifu666:fix_panic

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @xuzifu666

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @xuzifu666

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 15, 2026

I would feel more comfortable if there were a test for this, but I'm having a hard time reproducing. Probably not worth holding this up for since it's simply replacing asserts with errors. What do you think @alamb?

@xuzifu666
Copy link
Copy Markdown
Member Author

I would feel more comfortable if there were a test for this, but I'm having a hard time reproducing. Probably not worth holding this up for since it's simply replacing asserts with errors. What do you think @alamb?

Yes, this test is indeed difficult to build. I simulated a scenario with an error file as much as possible to verify that a ParquetError is returned when reading the file.

Comment thread parquet/src/file/metadata/reader.rs Outdated
read_and_check(f.as_file(), PageIndexPolicy::Skip).unwrap();
}

/// Test that corrupted parquet files return ParquetError instead of panicking
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is too specific and will be hard to maintain over the long run as the programatic generation of bad data will be brittle (if we change how the thrift is written, for example, the truncation may go down a different path).

If we want to test this type of error condition I think we should either:

  1. Check in a broken parquet file
  2. (better) implement a fuzzer that breaks the parquet file in random ways (maybe truncates it arbitrarily, etc) and ensures there are no panics

But for this PR if we want coverage I recommend we check in the bad file rather than trying to generate it programatically. We can consider parquet fuzz tsting as a follow on

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! Previous test way is not suitable, I had deleted it. Can we resolve this issue in this way for now? Parquet fuzz tsting could be a follow-up task~ @alamb

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed a ticket to track fuzz:

Comment thread parquet/src/file/metadata/reader.rs Outdated
read_and_check(f.as_file(), PageIndexPolicy::Skip).unwrap();
}

/// Test that corrupted parquet files return ParquetError instead of panicking
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed a ticket to track fuzz:

@alamb alamb merged commit 84b3454 into apache:main Apr 16, 2026
16 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 16, 2026

Thanks @xuzifu666 and @etseidl

Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…ta instead of Panic (apache#9725)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#9705

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic when reading corrupt parquet file with truncated data instead of ParquetError

3 participants