-
-
Notifications
You must be signed in to change notification settings - Fork 48
Closed
Milestone
Description
Currently we only collect basic statistics about processing pipeline like execution time and total rows count or schema.
The goal is to add column statistics similar to those from parquet.
This would significantly improve datasets analysis and speedup building schema definitions.
List of stats we should start from below:
All Columns
- distinct count
- nulls count
Int/Float/Date/DateTime
- max / min
String
- length
Map/List
- elements_count