Skip to content

Table.add_files fails for Parquet files with DecimalType columns stored as FIXED_LEN_BYTE_ARRAY when precision allows INT32/INT64 #2057

@CaptainEureka

Description

@CaptainEureka

Apache Iceberg version

0.9.1 (latest release)

Please describe the bug 🐞

When attempting to add Parquet files to an Iceberg table using Table.add_files, the operation fails if a column defined as DecimalType in the Iceberg schema is physically stored as FIXED_LEN_BYTE_ARRAY in the Parquet file, even if the decimal's precision would typically map to INT32 or INT64 according to Iceberg's preferred Parquet mapping.

I see in the Iceberg Spec that on-write the mapping is correct. However, the current behaviour seems to overly restrict the physical Parquet type for decimals during the file addition process. I believe this greatly limits the kinds of parquet files that can be "added" to an Iceberg table this way.

Steps to Reproduce:

  1. Define an Iceberg table schema with a DecimalType column, for example, Decimal(10, 2).
    • Iceberg's preferred Parquet physical type for Decimal(10, 2) would be INT64.
  2. Create a Parquet file where the corresponding column for this Decimal(10, 2) is physically stored as FIXED_LEN_BYTE_ARRAY. The data itself is valid for Decimal(10, 2).
  3. Attempt to add this Parquet file to the Iceberg table using Table.add_files.

Behavior:

The Table.add_files operation fails, with the following error:

ValueError: Unexpected physical type FIXED_LEN_BYTE_ARRAY for DecimalType(10, 2) expected INT32

indicating a mismatch between the expected physical type (e.g., INT64) and the actual physical type (FIXED_LEN_BYTE_ARRAY) found in the Parquet file for the decimal column.

Expected Behavior:

The Table.add_files operation should succeed and correctly read the decimal values from the FIXED_LEN_BYTE_ARRAY physical storage. The Iceberg reader/writer should be lenient with the physical storage format of decimals OR otherwise Table.add_files should document these limitations.

Environment:

  • Python version: 3.12.9
  • Parquet library and version: pyarrow 20.0.0

P.S. If this is just user error and I shouldn't be trying to do things this way I'd be happy to hear alternatives.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions