Skip to content

Applying Filter on Top-Level Struct Columns Throws Error #1778

@srilman

Description

@srilman

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

When attempting to apply filter to top-level struct columns, such as null / not-null, an error occurs. For example:

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StructType, IntegerType, StringType
import pyiceberg.expressions as pe
import pyarrow as pa

catalog = SqlCatalog("sql_catalog", uri="sqlite:///:memory:")
catalog.create_namespace("ns")

schema = Schema(
    NestedField(1, "structs", StructType(
        NestedField(2, "id", IntegerType(), required=True),
        NestedField(3, "name", StringType(), required=True),
    )),
)
table = catalog.create_table("ns.struct_table", schema, "/tmp/wh/ns/struct_table")

df = pa.Table.from_pydict({
    "structs": [
        {"id": 1, "name": "a"},
        {"id": 2, "name": "b"},
        {"id": 3, "name": "c"},
    ]
}, schema=schema.as_arrow())
table.append(df)

print(list(table.scan(row_filter=pe.NotNull("structs")).plan_files()))
Traceback (most recent call last):
  File "/Users/slade/bodo/mono/develop/test.py", line 27, in <module>
    print(list(table.scan(row_filter=pe.NotNull("structs")).plan_files()))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slade/bodo/mono/develop/.pixi/envs/default/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1697, in plan_files
    if manifest_evaluators[manifest_file.partition_spec_id](manifest_file)
       ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/Users/slade/bodo/mono/develop/.pixi/envs/default/lib/python3.12/site-packages/pyiceberg/expressions/__init__.py", line 201, in bind
    accessor = schema.accessor_for_field(field.field_id)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slade/bodo/mono/develop/.pixi/envs/default/lib/python3.12/site-packages/pyiceberg/schema.py", line 280, in accessor_for_field
    raise ValueError(f"Could not find accessor for field with id: {field_id}")
ValueError: Could not find accessor for field with id: 1

It looks like the cause is an intention feature of the field to accessor map for schemas. See the docstring of class _BuildPositionAccessors:

class _BuildPositionAccessors(SchemaVisitor[Dict[Position, Accessor]]):
    """A schema visitor for generating a field ID to accessor index.

    Example:
        >>> from pyiceberg.schema import Schema
        >>> from pyiceberg.types import *
        >>> schema = Schema(
        ...     NestedField(field_id=2, name="id", field_type=IntegerType(), required=False),
        ...     NestedField(field_id=1, name="data", field_type=StringType(), required=True),
        ...     NestedField(
        ...         field_id=3,
        ...         name="location",
        ...         field_type=StructType(
        ...             NestedField(field_id=5, name="latitude", field_type=FloatType(), required=False),
        ...             NestedField(field_id=6, name="longitude", field_type=FloatType(), required=False),
        ...         ),
        ...         required=True,
        ...     ),
        ...     schema_id=1,
        ...     identifier_field_ids=[1],
        ... )
        >>> result = build_position_accessors(schema)
        >>> expected = {
        ...     2: Accessor(position=0, inner=None),
        ...     1: Accessor(position=1, inner=None),
        ...     5: Accessor(position=2, inner=Accessor(position=0, inner=None)),
        ...     6: Accessor(position=2, inner=Accessor(position=1, inner=None))
        ... }
        >>> result == expected
        True
    """

But I'm not exactly sure why. Looking at all uses, I don't see a reason why the id_to_accessor map shouldn't include top-level structs. Is there a reason why, or is this just a bug? If its just a bug, I think this is a 1-2 line fix in _BuildPositionAccessors.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions