Apache Iceberg version
0.9.0 (latest release)
Please describe the bug 🐞
Issue:
table.inspect.entries() fails when table is MOR table and has Delete Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using StaticTable.from_metadata().
Stacktrace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 table.inspect.entries()
File ~/Documents/project-repos/git-repos/lakehouse-health-analyzer/venv/lib/python3.12/site-packages/pyiceberg/table/inspect.py:208, in InspectTable.entries(self, snapshot_id)
188 partition = entry.data_file.partition
189 partition_record_dict = {
190 field.name: partition[pos]
191 for pos, field in enumerate(self.tbl.metadata.specs()[manifest.partition_spec_id].fields)
192 }
194 entries.append(
195 {
196 "status": entry.status.value,
197 "snapshot_id": entry.snapshot_id,
198 "sequence_number": entry.sequence_number,
199 "file_sequence_number": entry.file_sequence_number,
200 "data_file": {
201 "content": entry.data_file.content,
202 "file_path": entry.data_file.file_path,
203 "file_format": entry.data_file.file_format,
204 "partition": partition_record_dict,
205 "record_count": entry.data_file.record_count,
206 "file_size_in_bytes": entry.data_file.file_size_in_bytes,
207 "column_sizes": dict(entry.data_file.column_sizes),
--> 208 "value_counts": dict(entry.data_file.value_counts),
209 "null_value_counts": dict(entry.data_file.null_value_counts),
210 "nan_value_counts": dict(entry.data_file.nan_value_counts),
211 "lower_bounds": entry.data_file.lower_bounds,
212 "upper_bounds": entry.data_file.upper_bounds,
213 "key_metadata": entry.data_file.key_metadata,
214 "split_offsets": entry.data_file.split_offsets,
215 "equality_ids": entry.data_file.equality_ids,
216 "sort_order_id": entry.data_file.sort_order_id,
217 "spec_id": entry.data_file.spec_id,
218 },
219 "readable_metrics": readable_metrics,
220 }
221 )
223 return pa.Table.from_pylist(
224 entries,
225 schema=entries_schema,
226 )
TypeError: 'NoneType' object is not iterable
Replication
This issue can be replicated by following the instructions below:
- Create an Iceberg MOR table using
Spark 3.5.0 with Iceberg 1.5.0
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, array, rand
DW_PATH='../warehouse'
spark = SparkSession.builder \
.master("local[4]") \
.appName("iceberg-mor-test") \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.apache.spark:spark-avro_2.12:3.5.0')\
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
.config('spark.sql.catalog.local','org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.local.type','hadoop') \
.config('spark.sql.catalog.local.warehouse',DW_PATH) \
.getOrCreate()
t1 = spark.range(10000).withColumn("year", lit(2023))
t1 = t1.withColumn("business_vertical",
array(lit("Retail"), lit("SME"), lit("Cor"), lit("Analytics"))\
.getItem((rand()*4).cast("int")))
t1.coalesce(1).writeTo('local.db.pyic_mor_test').partitionedBy('year').using('iceberg')\
.tableProperty('format-version','2')\
.tableProperty('write.delete.mode','merge-on-read')\
.tableProperty('write.update.mode','merge-on-read')\
.tableProperty('write.merge.mode','merge-on-read')\
.create()
- Checking Table Properties to make sure table is MOR
spark.sql("SHOW TBLPROPERTIES local.db.pyic_mor_test").show(truncate=False)
+-------------------------------+-------------------+
|key |value |
+-------------------------------+-------------------+
|current-snapshot-id |2543645387796664537|
|format |iceberg/parquet |
|format-version |2 |
|write.delete.mode |merge-on-read |
|write.merge.mode |merge-on-read |
|write.parquet.compression-codec|zstd |
|write.update.mode |merge-on-read |
+-------------------------------+-------------------+
- Running an
UPDATE statement to generate a Delete File
spark.sql(f"UPDATE local.db.pyic_mor_test SET business_vertical = 'DataEngineering' WHERE id > 7000")
- Checking if Delete File is generated
spark.table(f"local.db.pyic_mor_test.delete_files").show()
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|content| file_path|file_format|spec_id|partition|record_count|file_size_in_bytes| column_sizes|value_counts|null_value_counts|nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id| readable_metrics|
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
| 1|/Users/akashdeepg...| PARQUET| 0| {2023}| 2999| 4878|{2147483546 -> 21...| NULL| NULL| NULL|{2147483546 -> [2...|{2147483546 -> [2...| NULL| [4]| NULL| NULL|{{NULL, NULL, NUL...|
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
Reading Spark created table from PyIceberg
from pyiceberg.table import StaticTable
# Using latest metadata.json path
metadata_path = "./warehouse/db/pyic_mor_test/metadata/v2.metadata.json"
table = StaticTable.from_metadata(metadata_path)
# This will break with the stacktrace provided above
table.inspect.entries()
Issue found after debugging
I did some debugging and figured out the inspect.entries() break for MOR tables while reading the *-delete.parquet files present in table.
While reading the Delete file, value_counts is coming as null. I can see that ManifestEntryStatus is ADDED and DataFile content is DataFileContent.POSITION_DELETES which seems to be correct.
I further looked into the manifest.avro file that holds the entry for delete parquet files. And well, value_counts populated there itself is NULL. That's the reason entry.data_file.value_counts is coming as null.
value_counts as null can also be seen in above in the output of query of delete_files table.
Willingness to contribute
Apache Iceberg version
0.9.0 (latest release)
Please describe the bug 🐞
Issue:
table.inspect.entries()fails when table is MOR table and has Delete Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 usingStaticTable.from_metadata().Stacktrace:
Replication
This issue can be replicated by following the instructions below:
Spark 3.5.0withIceberg 1.5.0UPDATEstatement to generate a Delete FileReading Spark created table from PyIceberg
Issue found after debugging
I did some debugging and figured out the
inspect.entries()break for MOR tables while reading the*-delete.parquetfiles present in table.While reading the Delete file,
value_countsis coming as null. I can see thatManifestEntryStatusisADDEDandDataFilecontent isDataFileContent.POSITION_DELETESwhich seems to be correct.I further looked into the
manifest.avrofile that holds the entry for delete parquet files. And well,value_countspopulated there itself isNULL. That's the reasonentry.data_file.value_countsis coming asnull.value_countsasnullcan also be seen in above in the output of query ofdelete_filestable.Willingness to contribute