Skip to content

[Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows) #1491

@sundaresanr

Description

@sundaresanr

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Encountered the following error while calling overwrite() on a dataset with over 3 million rows (~1GB parquet file size; ~6 GB pyarrow table size):

pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Backtrace:

txn.overwrite(pat, overwrite_filter=overwrite_filter)
  File ".../site-packages/pyiceberg/table/__init__.py", line 470, in overwrite
    for data_file in data_files:
  File ".../site-packages/pyiceberg/io/pyarrow.py", line 2636, in _dataframe_to_data_files
    partitions = _determine_partitions(spec=table_metadata.spec(), schema=table_metadata.schema(), arrow_table=df)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/pyiceberg/io/pyarrow.py", line 2726, in _determine_partitions
    arrow_table = arrow_table.take(sort_indices)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 2133, in pyarrow.lib._Tabular.take
  File ".../site-packages/pyarrow/compute.py", line 487, in take
    return call_function('take', [data, indices], options, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 590, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 385, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Seems to be related to:

apache/arrow#25822
apache/arrow#33049

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions