Skip to content

Conversation

@norberttech
Copy link
Member

@norberttech norberttech commented Dec 27, 2023

Change Log

Added

  • After partitionBy only write or fetch actions are available
  • Moved all scalar functions to ScalarFunctionChain abstract factory implemented by all scalar functions

Fixed

  • Double partitioning

Changed

Removed

  • Partitioning related configuration from Flow Execution Context

Deprecated

Security


Description

Closes: #885

Warning

I'm still not sure if that's the best approach. Even though it makes DX a bit better, it also makes it less clear if partition pruning was applied or not. I'm still debating if not to keep everything except PartitionPruningOptimization.

Before:
$flow = (new Flow())
    ->read(from_csv(__FLOW_DATA__ . '/partitioned'))
    ->filterPartitions(all(ref('country')->equals(lit('pl')), ref('t_shirt_color')->equals(lit('green'))))
    ->collect()
    ->sortBy(ref('id'))
    ->write(to_output());

After:

$flow = (new Flow())
    ->read(from_csv(__FLOW_DATA__ . '/partitioned'))
    ->filter(all(ref('country')->equals(lit('pl')), ref('t_shirt_color')->equals(lit('green'))))
    ->collect()
    ->sortBy(ref('id'))
    ->write(to_output());

or

$flow = (new Flow())
    ->read(from_csv(__FLOW_DATA__ . '/partitioned'))
    ->filter(ref('country')->equals(lit('pl')))
    ->filter(ref('t_shirt_color')->equals(lit('green')))
    ->collect()
    ->sortBy(ref('id'))
    ->write(to_output());

For now optimizer that was detecting if filter should be applied on the partition was reverted, there are too many edge cases with this approach that I missed. We can get back to this later, but first, we should implement:

  • paths scanning before extraction to even understand if there are any partitions
  • logger (to better understand what's happening inside)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 27, 2023

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+-------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode              | rstdev          |
+-----------------------+-------------------+------+-----+------------------+-------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.154mb +0.00%  | 752.422ms +4.24%  | ±1.19% -48.87%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.801mb 0.00%    | 346.868ms +13.45% | ±0.59% -38.65%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.900mb -0.00%   | 980.243ms +5.12%  | ±0.43% +145.75% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 239.613mb -0.00% | 1.142s -0.08%     | ±0.65% -65.64%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.676mb -0.18%   | 56.911ms +110.02% | ±0.35% +0.14%   |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.678mb -0.17%   | 463.283ms +11.33% | ±1.35% +125.65% |
+-----------------------+-------------------+------+-----+------------------+-------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 110.405mb -0.01% | 63.803ms +0.90% | ±0.29% -48.00% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 94.770mb -0.01%  | 450.967ms +1.03% | ±0.69% -79.45%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.849mb -0.01%  | 71.880ms +1.14%  | ±0.48% +552.44% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 105.431mb -0.07% | 55.177ms -3.83%  | ±0.98% +83.24%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 320.641mb -0.02% | 1.246s -0.79%    | ±1.23% +202.79% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.726mb -0.04%  | 39.796ms -2.52%  | ±0.17% -82.45%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.049mb -0.03% | 392.976ms +2.28% | ±0.95% +13.25%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 59.767mb -0.06%  | 194.559ms +0.10% | ±0.37% -63.48%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.843mb +0.08%  | 41.413ms +3.43%  | ±0.79% -60.84%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.399mb +0.01%  | 325.969ms -1.84% | ±0.33% -75.06%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.322mb +0.03%  | 65.231ms -1.05%  | ±0.93% -34.52%  |
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 76.464mb +0.01%  | 3.915ms +3.10%   | ±3.20% +285.97% |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 96.256mb +0.00%  | 180.691ms +1.12% | ±0.22% -70.42%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 74.782mb +0.01%  | 18.142ms +0.92%  | ±0.90% +30.36%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 77.704mb +0.01%  | 1.921ms +12.49%  | ±0.47% -79.77%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 77.704mb +0.01%  | 1.934ms +15.50%  | ±0.93% +29.36%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 74.815mb +0.01%  | 2.617ms +6.05%   | ±2.84% -4.69%   |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 75.345mb +0.01%  | 14.119ms +0.88%  | ±1.78% +219.22% |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 75.345mb +0.01%  | 14.627ms +4.35%  | ±2.43% +76.91%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 73.248mb +0.01%  | 1.700μs +5.85%   | ±0.00% -100.00% |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 73.248mb +0.01%  | 0.300μs 0.00%    | ±0.00% 0.00%    |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 86.870mb +0.01%  | 12.822ms +1.88%  | ±0.45% -74.00%  |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 116.164mb +0.00% | 62.708ms +0.86%  | ±0.21% -85.04%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 75.865mb +0.01%  | 1.446ms +19.50%  | ±1.83% +73.07%  |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 78.138mb +0.01%  | 37.267ms +5.19%  | ±1.54% +77.59%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 77.966mb +0.01%  | 3.875ms +0.16%   | ±2.41% +126.16% |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 73.393mb +0.01%  | 40.204ms +3.05%  | ±1.63% -5.99%   |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 73.394mb +0.01%  | 39.854ms +1.89%  | ±0.73% +54.81%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 73.393mb +0.01%  | 39.751ms +1.29%  | ±2.45% +71.02%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 75.690mb +0.01%  | 7.534ms +2.16%   | ±0.36% -60.09%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 73.248mb +0.01%  | 29.338ms +1.50%  | ±1.15% +4.17%   |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 73.248mb +0.01%  | 13.520μs +0.10%  | ±1.27% +263.24% |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 73.248mb +0.01%  | 16.106μs +0.51%  | ±0.29% -75.00%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 96.258mb +0.00%  | 182.588ms -1.87% | ±0.19% -74.87%  |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 55c4d75 into flow-php:1.x Dec 28, 2023
@norberttech norberttech deleted the feature/partition-pruning-optimizer branch December 28, 2023 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partition Pruning - optimizer

1 participant