Improve the CSVExtractor by removing duplicated operations#1665
Improve the CSVExtractor by removing duplicated operations#1665norberttech merged 1 commit intoflow-php:1.xfrom
CSVExtractor by removing duplicated operations#1665Conversation
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.776mb -0.01% | 417.596ms -28.25% | ±0.57% -7.90% |
| ExcelExtractorBench | bench_extract_10k_ods | 1 | 3 | 65.486mb +0.00% | 1.044s -2.68% | ±0.65% +170.21% |
| ExcelExtractorBench | bench_extract_10k_xlsx | 1 | 3 | 67.532mb +0.00% | 1.670s -0.38% | ±0.20% -58.57% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 5.018mb +0.00% | 1.284s +0.31% | ±2.71% +475.46% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 86.321mb +0.00% | 921.218ms -0.33% | ±0.33% -14.05% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 4.499mb +0.01% | 38.380ms -1.73% | ±0.26% -76.39% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.494mb +0.01% | 604.170ms -0.01% | ±0.07% -77.08% |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
Transformers+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 123.236mb +0.00% | 66.514ms +1.53% | ±0.81% +4.55% |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 18.498mb +0.00% | 72.976ms -0.43% | ±0.17% -52.31% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench | bench_load_10k | 1 | 3 | 62.435mb -0.00% | 85.168ms -4.23% | ±0.94% +20.56% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 79.706mb +0.00% | 96.908ms -0.57% | ±1.09% -43.40% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 165.387mb +0.00% | 20.705s -0.73% | ±0.10% -69.05% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 17.805mb +0.00% | 30.994ms -2.80% | ±0.31% -0.25% |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
Building Blocks+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 101.784mb +0.00% | 648.727ms -1.19% | ±0.67% -43.26% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 53.134mb +0.00% | 329.581ms +1.98% | ±0.96% +130.07% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 14.384mb +0.00% | 68.681ms -5.04% | ±0.66% -76.11% |
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 93.389mb +0.00% | 3.516ms -5.99% | ±2.91% -16.13% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 110.758mb +0.00% | 235.154ms -0.44% | ±0.34% -67.90% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 93.478mb +0.00% | 23.450ms -3.43% | ±0.97% -25.93% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 94.264mb +0.00% | 1.689ms +4.65% | ±3.75% +5.03% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 94.264mb +0.00% | 1.583ms -6.10% | ±2.93% +30.87% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 92.424mb +0.00% | 3.431ms -6.01% | ±2.23% -1.16% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 92.953mb +0.00% | 16.304ms -2.26% | ±2.97% +138.88% |
| RowsBench | bench_find_on_10k | 2 | 3 | 92.953mb +0.00% | 15.472ms -3.45% | ±0.51% -23.66% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 91.642mb +0.00% | 2.000μs +0.30% | ±0.00% -100.00% |
| RowsBench | bench_first_on_10k | 10 | 3 | 91.642mb +0.00% | 0.400μs -20.00% | ±0.00% +0.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 100.703mb +0.00% | 14.648ms -7.26% | ±0.55% -48.40% |
| RowsBench | bench_map_on_10k | 2 | 3 | 130.130mb +0.00% | 67.425ms -3.83% | ±1.01% -19.47% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 93.473mb +0.00% | 1.526ms +1.88% | ±0.61% -78.41% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 96.841mb +0.00% | 62.472ms -1.39% | ±0.14% -82.36% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 94.526mb +0.00% | 4.166ms +8.75% | ±2.86% -18.40% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 92.003mb +0.00% | 39.951ms -2.00% | ±0.82% -73.92% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 92.004mb +0.00% | 40.135ms +0.25% | ±1.31% -43.94% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 92.003mb +0.00% | 39.377ms -4.71% | ±1.98% +25.67% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 94.085mb +0.00% | 8.285ms +0.49% | ±0.75% +35.89% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 91.835mb +0.00% | 29.668ms -0.04% | ±2.58% +57.62% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 91.642mb +0.00% | 14.382μs -2.03% | ±0.99% -14.08% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 91.642mb +0.00% | 17.290μs +4.49% | ±2.64% -18.17% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 110.759mb +0.00% | 239.544ms +1.49% | ±1.30% +221.91% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 42.070mb +0.00% | 430.178ms +2.05% | ±0.44% -33.82% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 11.448mb +0.00% | 85.267ms -0.21% | ±0.83% +57.51% |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## 1.x #1665 +/- ##
==========================================
- Coverage 82.08% 82.08% -0.01%
==========================================
Files 703 703
Lines 19064 19059 -5
==========================================
- Hits 15649 15644 -5
Misses 3415 3415
🚀 New features to boost your workflow:
|
|
what are the performance benefits of this? |
60b5542 to
d9d65d8
Compare
|
I would say it's worth considering ~25-30% of performance boost when reading 10k rows. |
|
Ha! It's indeed faster enough to make a difference! 🎉 So here are the results of following benchmark. Code used to generate benchmark dataset<?php
declare(strict_types=1);
use function Flow\ETL\DSL\from_array;
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\DSL\overwrite;
use function Flow\ETL\Adapter\CSV\to_csv;
use Faker\Factory;
use Flow\ETL\Rows;
include __DIR__ . '/../../../vendor/autoload.php';
$faker = Factory::create();
$skus = [
['sku' => 'SKU_0001', 'name' => 'Product 1', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0002', 'name' => 'Product 2', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0003', 'name' => 'Product 3', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0004', 'name' => 'Product 4', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0005', 'name' => 'Product 5', 'price' => $faker->randomFloat(2, 0, 500)],
];
function generateOrders($faker, array $skus, int $count) : \Generator {
for ($i = 0; $i < $count; $i++) {
yield [
'order_id' => $faker->uuid,
'created_at' => $faker->dateTimeThisYear,
'updated_at' => \random_int(0, 1) === 1 ? $faker->dateTimeThisMonth : null,
'discount' => \random_int(0, 1) === 1 ? $faker->randomFloat(2, 0, 50) : null,
'email' => $faker->email,
'customer' => $faker->firstName . ' ' . $faker->lastName,
'address' => [
'street' => $faker->streetAddress,
'city' => $faker->city,
'zip' => $faker->postcode,
'country' => $faker->country,
],
'notes' => \array_map(
static fn($i) => $faker->sentence,
\range(1, $faker->numberBetween(1, 5))
),
'items' => \array_map(
static fn(int $index) => [
'sku' => $skus[$skuIndex = $faker->numberBetween(1, 4)]['sku'],
'quantity' => $faker->numberBetween(1, 10),
'price' => $skus[$skuIndex]['price']
],
\range(1, $faker->numberBetween(1, 4))
),
];
}
}
$ordersSchema = require __DIR__ . '/schema.php';
data_frame()
->read(from_array(generateOrders($faker, $skus, 1_000_000))->withSchema($ordersSchema))
->saveMode(overwrite())
->write(to_csv(__DIR__ . '/dataset/orders.csv'))
->batchSize(10_000)
->run(function (Rows $rows) {
echo "Generated {$rows->count()} rows\n";
});Schema Code<?php
use function Flow\ETL\DSL\schema;
use function Flow\ETL\DSL\uuid_schema;
use function Flow\ETL\DSL\datetime_schema;
use function Flow\ETL\DSL\float_schema;
use function Flow\ETL\DSL\str_schema;
use function Flow\ETL\DSL\struct_schema;
use function Flow\Types\DSL\type_structure;
use function Flow\Types\DSL\type_string;
use function Flow\Types\DSL\type_list;
use function Flow\ETL\DSL\list_schema;
use function Flow\Types\DSL\type_integer;
use function Flow\Types\DSL\type_float;
return schema(
uuid_schema('order_id'),
datetime_schema('created_at'),
datetime_schema('updated_at', true),
float_schema('discount', true),
str_schema('email'),
str_schema('customer'),
struct_schema(
'address',
type_structure([
'street' => type_string(),
'city' => type_string(),
'zip' => type_string(),
'country' => type_string(),
])
),
list_schema('notes', type_list(type_string())),
list_schema('items', type_list(
type_structure([
'sku' => type_string(),
'quantity' => type_integer(),
'price' => type_float(),
])
))
);Benchmark Code<?php
declare(strict_types=1);
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\Adapter\CSV\from_csv;
use Flow\ETL\Monitoring\Memory\Consumption;
include __DIR__ . '/../../../vendor/autoload.php';
$schema = require __DIR__ . '/schema.php';
$memory = new Consumption();
$report = data_frame()
->read(from_csv(__DIR__ . '/dataset/orders.csv')->withSchema($schema))
->run(function() use ($memory) {
$memory->current();
},analyze: true);
echo "Total rows: " . \number_format($report->statistics()->totalRows()) . "\n";
echo "Processing time : {$report->statistics()->executionTime->highResolutionTime->toString()}\n";
echo "Memory Max usage : {$memory->max()->inMb()}Mb\n";Benchmarks executed in nix-shell (the one from monorepo)
With following php.ini```ini date.timezone = UTC max_execution_time = 0 error_reporting = 0 display_errors = Off log_errors = Off opcache.enable = 0 opcache.enable_cli = 0 realpath_cache_size = 0 zend.assertions = -1 max_input_time = 3600 max_input_nesting_level = 64 memory_limit = -1 post_max_size = 200M upload_max_filesize = 150M file_uploads = On max_file_uploads = 20 short_open_tag = off ```To make sure results are stable I executed each benchmark 3 times on each branch. Branch
|
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description