Skip to content

Added validator to Parquet Writer#807

Merged
norberttech merged 1 commit intoflow-php:1.xfrom
norberttech:feature/parquet-validation
Nov 20, 2023
Merged

Added validator to Parquet Writer#807
norberttech merged 1 commit intoflow-php:1.xfrom
norberttech:feature/parquet-validation

Conversation

@norberttech
Copy link
Member

Change Log

Added

  • validator to Parquet Writer

Fixed

Changed

Removed

Deprecated

Security


Description

Closes: #757

The main goal of this validator is to confirm that all required columns are set in row and that required columns are not getting null values.

@github-actions
Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 34.745mb +0.00%  | 1.132s -1.12%    | ±0.93% -22.95%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.604mb +0.04%   | 303.457ms -1.55% | ±3.00% -6.45%   |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.769mb +0.04%   | 1.392s -0.95%    | ±1.99% +173.51% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 239.474mb +0.00% | 1.570s -0.81%    | ±1.36% +802.09% |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.558mb +0.04%   | 24.300ms -2.86%  | ±1.69% +149.88% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.558mb +0.04%   | 404.153ms +0.34% | ±0.30% +37.60%  |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 110.245mb +0.00% | 64.868ms +2.56% | ±0.85% -54.59% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 94.726mb +0.00%  | 442.100ms +0.88% | ±1.04% +55.52%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.710mb +0.00%  | 70.645ms +0.92%  | ±0.42% -44.55%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 105.308mb +0.00% | 54.677ms +2.40%  | ±0.36% -62.14%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 320.781mb +0.00% | 1.482s +5.67%    | ±0.59% +230.76% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.588mb +0.01%  | 41.400ms +0.21%  | ±0.42% -39.74%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 76.292mb +0.00%  | 2.148ms -1.65%   | ±2.14% +39.13%  |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 96.082mb +0.00%  | 182.383ms +1.88% | ±0.47% +23.06%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 74.608mb +0.00%  | 18.224ms +2.92%  | ±0.39% -30.85%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 75.430mb +0.00%  | 1.822ms +13.74%  | ±2.78% +100.90% |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 75.430mb +0.00%  | 1.674ms +2.92%   | ±1.70% -25.68%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 74.644mb +0.00%  | 2.489ms +0.80%   | ±0.70% -58.81%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 75.173mb +0.00%  | 14.294ms +3.23%  | ±1.55% +143.63% |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 75.173mb +0.00%  | 14.219ms +3.70%  | ±0.67% -35.23%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 73.075mb +0.00%  | 1.606μs 0.00%    | ±2.89% 0.00%    |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 73.075mb +0.00%  | 0.400μs +33.33%  | ±0.00% +0.00%   |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 86.632mb +0.00%  | 12.897ms +1.96%  | ±1.81% +354.26% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 115.992mb +0.00% | 64.272ms -0.77%  | ±2.46% +38.87%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 75.693mb +0.00%  | 1.788ms +3.89%   | ±2.67% +12.34%  |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 77.961mb +0.00%  | 32.895ms +2.28%  | ±0.59% -56.93%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 77.794mb +0.00%  | 4.466ms -1.99%   | ±1.45% -51.86%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 73.218mb +0.00%  | 38.341ms -0.33%  | ±0.65% +51.90%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 73.219mb +0.00%  | 38.693ms +0.45%  | ±0.13% -89.10%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 73.218mb +0.00%  | 39.370ms +3.16%  | ±0.64% -39.74%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 75.518mb +0.00%  | 7.435ms +2.24%   | ±0.78% -20.22%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 73.076mb +0.00%  | 29.032ms +2.62%  | ±0.26% -50.39%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 73.075mb +0.00%  | 12.883μs -0.18%  | ±1.66% +355.91% |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 73.075mb +0.00%  | 16.679μs +5.45%  | ±2.76% +100.24% |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 96.083mb +0.00%  | 184.627ms +2.19% | ±2.21% +87.32%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 98.212mb +0.00%  | 947.471ms -0.18% | ±1.03% -59.84%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 21.904mb +0.01%  | 188.805ms -2.75% | ±0.30% +21.50%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 115.836mb +0.00% | 771.782ms -0.68% | ±0.84% -68.17%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 59.554mb +0.00%  | 381.244ms +0.48% | ±0.81% +98.80%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.676mb +0.01%  | 77.913ms +0.64%  | ±1.14% -25.30%  |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 2e20054 into flow-php:1.x Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet Writer - Schema Validation

1 participant