Feature Request / Improvement
Hello!
I'm using PyIceberg 0.7.1
I have a use-case where I need to count rows given a certain filter, and I was expecting it to be doable with PyIceberg as a metadata-only operation, given that manifest files contain counts of rows in each data file.
I figured out this code to count rows:
query = "col1 = 'val_X' AND col2 = 'val_Y' AND ..."
scan = table.scan(row_filter=query)
df = scan.to_duckdb("data")
res = df.sql("SELECT count(*) FROM data")
but this is loading the data filtered (using the query expression) into memory first, and then does the calculation of the count.
I couldn't figure out the code that would return the result without converting either to duckdb or to pyarrow dataframe first.
Is there a way to do such operation without loading data into memory - as a metadata-only operation?
If not, I believe this would be a good feature to have in PyIceberg.
I have tried Daft, which is supposed to be a fully lazily optimized query engine interface on top of PyIceberg tables, but it still seems to need to load data into memory, even when I do .limit(1).
Feature Request / Improvement
Hello!
I'm using PyIceberg 0.7.1
I have a use-case where I need to count rows given a certain filter, and I was expecting it to be doable with PyIceberg as a metadata-only operation, given that manifest files contain counts of rows in each data file.
I figured out this code to count rows:
but this is loading the data filtered (using the
queryexpression) into memory first, and then does the calculation of the count.I couldn't figure out the code that would return the result without converting either to
duckdbor topyarrowdataframe first.Is there a way to do such operation without loading data into memory - as a metadata-only operation?
If not, I believe this would be a good feature to have in PyIceberg.
I have tried Daft, which is supposed to be a
fully lazily optimized query engine interface on top of PyIceberg tables, but it still seems to need to load data into memory, even when I do.limit(1).