Right now when analyze option is configured and passed to df()->run() dataframe following statistics are goign to be collected:
- execution time (hrtime)
- total processed rows (int)
- schema (optionally collected, when configured)
- column statistics (optionally collected when configured)
Another interesting metric that I believe we can collect next to hrtime is memory consumption through memory_get_usage(true)
That will simplify monitoring of pipelines