Skip to content

Conversation

@norberttech
Copy link
Member

@norberttech norberttech commented Jul 4, 2025

Resolves: #1755

Change Log


Added

Fixed

  • high memory consumption even when reading small chunks

Changed

Removed

Deprecated

Security

This is how it started: https://blackfire.io/profiles/fe7a4799-56bc-4cab-9171-31c3579d045b/graph
This is where we are now: https://blackfire.io/profiles/74a0e0b5-8d0c-4de3-a8a8-eb280a27c261/graph

image

@norberttech norberttech force-pushed the bug-parquet-reader-memory-consumption branch 3 times, most recently from 3376095 to 2caf0bb Compare July 4, 2025 15:26
@norberttech
Copy link
Member Author

This is just a first step to proper parquet memory optimization, next step would be to optimize writer #1755 and bring back size based control over row groups / pages

@codecov
Copy link

codecov bot commented Jul 4, 2025

Codecov Report

Attention: Patch coverage is 95.30516% with 40 lines in your changes missing coverage. Please review.

Project coverage is 81.58%. Comparing base (c9da137) to head (aa89f20).
Report is 2 commits behind head on 1.x.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1757      +/-   ##
==========================================
+ Coverage   81.30%   81.58%   +0.27%     
==========================================
  Files         717      716       -1     
  Lines       19948    20118     +170     
==========================================
+ Hits        16219    16413     +194     
+ Misses       3729     3705      -24     
Components Coverage Δ
etl 88.41% <ø> (+0.01%) ⬆️
cli 85.46% <ø> (ø)
lib-array-dot 94.56% <ø> (ø)
lib-azure-sdk 61.35% <ø> (ø)
lib-doctrine-dbal-bulk 93.88% <ø> (ø)
lib-filesystem 78.02% <ø> (ø)
lib-types 53.43% <ø> (ø)
lib-parquet 85.46% <95.63%> (+1.28%) ⬆️
lib-parquet-viewer 83.11% <ø> (ø)
lib-snappy 90.23% <ø> (-0.94%) ⬇️
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 97.04% <ø> (ø)
symfony-http-foundation 74.41% <ø> (ø)
adapter-chartjs 86.70% <ø> (ø)
adapter-csv 88.85% <ø> (ø)
adapter-doctrine 89.89% <ø> (ø)
adapter-elasticsearch 97.23% <ø> (ø)
adapter-google-sheet 83.87% <ø> (ø)
adapter-http 58.10% <ø> (ø)
adapter-json 87.98% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.95% <ø> (ø)
adapter-parquet 78.92% <40.00%> (+0.28%) ⬆️
adapter-text 84.44% <ø> (ø)
adapter-xml 82.73% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@norberttech norberttech force-pushed the bug-parquet-reader-memory-consumption branch from bef9187 to 9a5aeed Compare July 8, 2025 02:04
@norberttech norberttech force-pushed the bug-parquet-reader-memory-consumption branch from 8b8aa80 to 609f9fc Compare July 9, 2025 01:38
@flow-php flow-php deleted a comment from github-actions bot Jul 9, 2025
@flow-php flow-php deleted a comment from github-actions bot Jul 9, 2025
@flow-php flow-php deleted a comment from github-actions bot Jul 9, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 9, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+------------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject                | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+------------------------+------+-----+------------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k      | 1    | 3   | 4.843mb +0.07%   | 438.108ms -0.00% | ±0.34% -45.20%  |
| ExcelExtractorBench   | bench_extract_10k_ods  | 1    | 3   | 65.538mb +0.01%  | 1.048s -2.67%    | ±1.87% +50.63%  |
| ExcelExtractorBench   | bench_extract_10k_xlsx | 1    | 3   | 67.584mb +0.01%  | 1.690s -1.70%    | ±0.58% -76.98%  |
| JsonExtractorBench    | bench_extract_10k      | 1    | 3   | 5.341mb +0.19%   | 1.131s -0.73%    | ±0.53% +341.74% |
| ParquetExtractorBench | bench_extract_10k      | 1    | 3   | 86.396mb -87.72% | 9.253s +921.62%  | ±0.43% -18.10%  |
| TextExtractorBench    | bench_extract_10k      | 1    | 3   | 4.565mb +0.08%   | 42.211ms -0.48%  | ±0.61% -53.92%  |
| XmlExtractorBench     | bench_extract_10k      | 1    | 3   | 4.551mb +0.08%   | 594.989ms -0.65% | ±0.41% -55.20%  |
+-----------------------+------------------------+------+-----+------------------+------------------+-----------------+
Transformers
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.562mb +0.02%  | 72.825ms +0.92% | ±2.06% +71.22% |
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.300mb +0.00% | 67.159ms +2.28% | ±0.57% +0.87%  |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+--------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak           | mode             | rstdev          |
+--------------------+----------------+------+-----+--------------------+------------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.504mb +0.01%    | 85.559ms -2.18%  | ±1.95% +82.11%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 80.585mb +0.00%    | 100.278ms -3.56% | ±1.38% +501.48% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 166.296mb +402.30% | 19.055s +846.17% | ±0.10% -70.97%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.868mb +0.02%    | 29.937ms -1.11%  | ±0.26% -62.58%  |
+--------------------+----------------+------+-----+--------------------+------------------+-----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.512mb +0.01%  | 405.245ms -0.07% | ±0.40% -51.80%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.570mb +0.03%  | 82.029ms +0.13%  | ±1.14% +157.64% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 105.982mb +0.00% | 655.258ms +0.84% | ±1.08% -56.62%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.256mb +0.01%  | 332.192ms +1.39% | ±0.88% +96.18%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.842mb +0.02%  | 69.412ms -0.38%  | ±1.16% +93.10%  |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.453mb +0.00%  | 3.828ms -6.18%   | ±3.20% +166.33% |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.823mb +0.00% | 237.260ms -0.66% | ±0.66% +579.73% |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.543mb +0.00%  | 23.837ms -0.06%  | ±0.18% -69.29%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.327mb +0.00%  | 1.672ms +23.03%  | ±3.46% +251.14% |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.327mb +0.00%  | 1.745ms +28.62%  | ±1.45% -58.77%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.488mb +0.00%  | 3.496ms +2.23%   | ±1.10% +137.41% |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 93.017mb +0.00%  | 15.327ms -12.79% | ±0.39% -62.36%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 93.017mb +0.00%  | 15.650ms -10.69% | ±0.97% +51.42%  |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.706mb +0.00%  | 1.994μs +5.28%   | ±2.40% -5.08%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.706mb +0.00%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.766mb +0.00% | 15.287ms +8.29%  | ±2.80% +323.98% |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.194mb +0.00% | 68.415ms +4.12%  | ±0.48% +28.82%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.537mb +0.00%  | 1.634ms +38.58%  | ±0.96% -19.97%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.906mb +0.00%  | 61.204ms -1.91%  | ±1.00% -25.08%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.590mb +0.00%  | 3.712ms +3.06%   | ±2.30% +33.78%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 92.068mb +0.00%  | 41.904ms +6.67%  | ±2.11% +34.98%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 92.068mb +0.00%  | 41.599ms +1.61%  | ±0.48% -75.38%  |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 92.068mb +0.00%  | 41.153ms +3.34%  | ±1.48% +33.95%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.149mb +0.00%  | 7.982ms -0.18%   | ±0.83% -42.16%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.899mb +0.00%  | 29.946ms +0.83%  | ±0.38% -76.01%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.706mb +0.00%  | 14.446μs +0.70%  | ±3.13% +4.03%   |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.706mb +0.00%  | 16.012μs +3.44%  | ±0.59% -26.92%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.824mb +0.00% | 243.024ms +0.74% | ±0.87% +10.64%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
Parquet Library
+--------------------+---------------------------------+------+-----+----------+-----------+--------+
| benchmark          | subject                         | revs | its | mem_peak | mode      | rstdev |
+--------------------+---------------------------------+------+-----+----------+-----------+--------+
| ParquetReaderBench | bench_page_headers              | 1    | 3   | 6.637mb  | 3.291s    | ±0.87% |
| ParquetReaderBench | bench_read_metadata             | 1    | 3   | 5.322mb  | 17.997ms  | ±0.64% |
| ParquetReaderBench | bench_read_schema               | 1    | 3   | 5.322mb  | 18.137ms  | ±0.20% |
| ParquetReaderBench | bench_read_values_all_columns   | 1    | 3   | 9.042mb  | 5.634s    | ±0.48% |
| ParquetReaderBench | bench_read_values_single_column | 1    | 3   | 6.340mb  | 234.698ms | ±0.57% |
| ParquetReaderBench | bench_read_values_with_limit    | 1    | 3   | 6.857mb  | 28.516ms  | ±0.47% |
| ParquetWriterBench | bench_write_batch               | 1    | 3   | 9.456mb  | 165.481ms | ±0.19% |
| ParquetWriterBench | bench_write_gzip                | 1    | 3   | 9.767mb  | 177.358ms | ±0.87% |
| ParquetWriterBench | bench_write_row_by_row          | 1    | 3   | 9.456mb  | 164.507ms | ±0.54% |
| ParquetWriterBench | bench_write_snappy              | 1    | 3   | 9.456mb  | 164.143ms | ±0.57% |
| ParquetWriterBench | bench_write_uncompressed        | 1    | 3   | 9.582mb  | 163.738ms | ±0.77% |
+--------------------+---------------------------------+------+-----+----------+-----------+--------+

@norberttech norberttech merged commit d775693 into 1.x Jul 9, 2025
22 checks passed
@norberttech norberttech deleted the bug-parquet-reader-memory-consumption branch July 9, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal]: Optimize Parquet Write Memory Consumption

3 participants