Skip to content

Conversation

@norberttech
Copy link
Member

Change Log

Added

Fixed

  • DataFrame::aggregate returns the same instance of dataframe with updated pipeline

Changed

Removed

Deprecated

Security


Description

As much as I hate using reflections I even more hate the idea of exposing Pipelines outside the dataframe. Pipelines are considered to be internal and should not be exposed as it's not something we want to provide BC promise.

I couldn't find a better compromise between keeping the data frame mutable (it's just a big builder for the pipeline) and allowing it to expose partial builders like in this case GroupBy. Since we are just building a pipeline, returning a new instance of its builder is highly problematic since it's already a different object so it can't be referenced anymore.

<?php

$df = df();
$df->read(...);
$df->withEntry(...);
$df->aggregate(...)
// without reflection this would not work anymore
$df->rename(...)
$rows = $df->fetch();

The goal of GroupedDataFrame is to not let users do something like for example DataFrame::groupBy()->run(). That would be highly problematic since it would be a source of memory leaks as grouping without aggregation is just collecting all rows. InnerBuilders for more advanced transformations are also improving DX since IDE will guide developers through all possible methods.

@github-actions
Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.290mb +0.00%  | 847.026ms -0.53% | ±0.53% +29.39%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 5.010mb +0.01%   | 344.675ms -0.68% | ±0.63% +23.47%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 5.165mb +0.00%   | 1.077s -1.59%    | ±0.13% -87.59%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 135.837mb +0.00% | 911.921ms -0.96% | ±0.46% -22.65%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.925mb +0.01%   | 35.490ms -1.39%  | ±2.58% +785.80% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.931mb +0.01%   | 433.845ms -1.70% | ±0.87% +16.94%  |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 116.231mb +0.00% | 59.593ms -1.70% | ±0.58% -41.36% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 96.676mb +0.00%  | 452.794ms -1.90% | ±0.71% +36.10%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 55.151mb +0.00%  | 67.738ms -2.68%  | ±1.74% +34.47%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 107.585mb +0.00% | 50.570ms -3.24%  | ±0.46% -20.00%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 227.005mb +0.00% | 1.412s -1.37%    | ±0.78% +222.28% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.967mb +0.00%  | 37.618ms -4.13%  | ±0.23% -80.19%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 87.051mb +0.00%  | 3.309ms -12.97%  | ±1.74% +126.82% |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.649mb +0.00% | 188.537ms +0.19% | ±1.10% +119.81% |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.369mb +0.00%  | 18.714ms -2.01%  | ±0.59% -56.83%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.291mb +0.00%  | 1.663ms -13.56%  | ±0.77% +123.65% |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.291mb +0.00%  | 1.632ms -19.57%  | ±1.32% -59.11%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.403mb +0.00%  | 2.473ms -15.43%  | ±3.13% +352.03% |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 85.932mb +0.00%  | 15.949ms -4.59%  | ±0.36% -28.03%  |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 85.932mb +0.00%  | 15.984ms -7.00%  | ±1.08% +20.24%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 83.836mb +0.00%  | 1.800μs -5.56%   | ±0.00% -100.00% |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 83.836mb +0.00%  | 0.300μs -25.00%  | ±0.00% -100.00% |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 93.186mb +0.00%  | 12.535ms -8.56%  | ±1.69% +280.64% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.557mb +0.00% | 61.649ms -1.76%  | ±0.26% -81.79%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.452mb +0.00%  | 1.223ms -22.95%  | ±2.07% -8.62%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 89.799mb +0.00%  | 62.461ms -4.68%  | ±0.54% -59.92%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.553mb +0.00%  | 3.839ms -4.62%   | ±0.84% -52.79%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 83.914mb +0.00%  | 39.598ms -3.59%  | ±1.20% +49.71%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 83.915mb +0.00%  | 40.761ms -0.52%  | ±1.42% -30.34%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 83.914mb +0.00%  | 40.496ms -2.55%  | ±0.28% -67.83%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.278mb +0.00%  | 7.362ms -2.85%   | ±0.76% -25.88%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 83.836mb +0.00%  | 28.908ms -5.60%  | ±0.69% +43.27%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 83.836mb +0.00%  | 13.100μs -4.93%  | ±0.62% -30.87%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 83.836mb +0.00%  | 15.888μs -3.67%  | ±0.60% +108.00% |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.651mb +0.00% | 191.701ms -0.32% | ±0.82% +41.67%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.729mb +0.00% | 519.883ms -6.72% | ±0.80% -71.20%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 60.207mb +0.00%  | 255.641ms -3.98% | ±0.81% +141.91% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 15.140mb +0.00%  | 54.415ms -5.18%  | ±0.41% -70.00%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.966mb +0.00%  | 433.937ms -1.68% | ±0.40% +84.71%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.505mb +0.00%  | 85.211ms -2.81%  | ±0.55% -1.01%   |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 42fe036 into flow-php:1.x Apr 27, 2024
@norberttech norberttech deleted the bug/data-frame-aggregation branch May 9, 2024 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant