Fixed bitpacking zero values #731

norberttech · 2023-11-05T12:14:55Z

Change Log

Added

Fixed

bitpacking zero values

Changed

Removed

Deprecated

Security

Description

Closes: #727

Here is how it can be reproduced:

$dateTime = new \DateTimeImmutable('2023-01-01 00:00:00 UTC');

(new Flow())
    ->read(From::array([
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
        ['date_time' => $dateTime],
    ]))
    ->mode(SaveMode::Overwrite)
    ->write(Parquet::to(__DIR__ . '/bytes_offset.parquet'))
    ->run();

(new Flow())
    ->read(Parquet::from(__DIR__ . '/bytes_offset.parquet'))
    ->batchSize(10)
    ->write(To::output())
    ->run();

So what happens here, is that date_time column has very low cardinality ratio (all values are exactly the same) so dictionary encoding is being applied.
The problem starts when we are writing indices to the buffer, in this case we are going to get a dictionary with a single value and index position 0. So we are going to write following index:

[0, 0, 0, 0, 0, 0, 0] // each value points to the first element of the dictionary

Because of that, while bit packing our bitWidth (determines the maximum size of bits needed to pack each element of given array) is 0, and because of that, the algorithm is getting confused about the total number of bytes to read, that's why its reading 0 bytes and that's why Bytes are actually empty.

I double-checked that with python, it seems it's using the same approach as it's reading our parquet files without any warnings.

norberttech · 2023-11-05T12:16:00Z

src/lib/parquet/src/Flow/Parquet/ParquetFile/Data/RLEBitPackedHybrid.php

        }

+        if ($bitWidth === 0) {
+            $output = \array_merge($output, \array_fill(0, \min($numGroups * 8, $maxItems), 0));


we need to take min here, since from time to time we might need to pack less than 8 values and numGroups is always divisible by 8. So if we would pack zero 4 times, this would return us an array of 8 zeroes.

github-actions · 2023-11-05T12:17:53Z

Flow PHP - Benchmarks

^{_{Results of the benchmarks from this PR are compared with the results from 1.x branch.}}

Extractors

+-----------------------+-------------------+------+-----+------------------+-------------------+----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode              | rstdev         |
+-----------------------+-------------------+------+-----+------------------+-------------------+----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.132mb +0.25%  | 436.329ms +21.66% | ±0.44% +5.11%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.789mb +0.01%   | 345.090ms +27.74% | ±0.29% -66.47% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.937mb +0.01%   | 695.094ms +21.92% | ±0.03% -95.12% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 233.496mb +0.00% | 983.462ms +27.69% | ±0.11% -75.49% |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.783mb +0.01%   | 23.131ms +21.13%  | ±0.65% -34.71% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.783mb +0.01%   | 555.088ms +37.51% | ±0.37% -18.98% |
+-----------------------+-------------------+------+-----+------------------+-------------------+----------------+

Transformers

+-----------------------------+--------------------------+------+-----+-----------------+------------------+-----------------+
| benchmark                   | subject                  | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------------+--------------------------+------+-----+-----------------+------------------+-----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 87.050mb +0.00% | 66.200ms +37.21% | ±0.52% +116.71% |
+-----------------------------+--------------------------+------+-----+-----------------+------------------+-----------------+

Loaders

+--------------------+----------------+------+-----+------------------+-------------------+------------------+
| benchmark          | subject        | revs | its | mem_peak         | mode              | rstdev           |
+--------------------+----------------+------+-----+------------------+-------------------+------------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 93.290mb +0.00%  | 712.654ms +26.02% | ±2.10% +212.69%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 46.066mb +0.00%  | 69.274ms +1.11%   | ±3.43% +1646.56% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 88.568mb +0.00%  | 75.443ms +23.31%  | ±0.84% +35.35%   |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 284.008mb +0.00% | 1.510s +30.17%    | ±0.31% -71.17%   |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 16.560mb +0.00%  | 36.015ms -13.41%  | ±0.88% -65.67%   |
+--------------------+----------------+------+-----+------------------+-------------------+------------------+

Building Blocks

+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak        | mode              | rstdev          |
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 60.659mb +0.00% | 4.002ms +80.67%   | ±1.71% +121.56% |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 80.451mb +0.00% | 177.242ms +16.79% | ±0.96% +246.30% |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 58.977mb +0.00% | 18.023ms +19.69%  | ±0.87% -12.14%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 59.798mb +0.00% | 2.847ms +58.07%   | ±1.02% -59.15%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 59.798mb +0.00% | 2.905ms +62.14%   | ±0.79% +8.74%   |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 59.011mb +0.00% | 3.855ms +47.40%   | ±0.95% +122.69% |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 59.540mb +0.00% | 23.054ms +63.80%  | ±0.06% -94.56%  |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 59.539mb +0.00% | 23.250ms +63.94%  | ±1.57% +160.00% |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 57.611mb +0.00% | 2.206μs +30.23%   | ±2.11% -25.37%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 57.611mb +0.00% | 0.500μs +25.00%   | ±0.00% -100.00% |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 65.844mb +0.00% | 13.534ms +31.05%  | ±0.50% -80.18%  |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 91.364mb +0.00% | 60.316ms +24.51%  | ±0.35% -68.27%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 60.061mb +0.00% | 3.138ms +54.57%   | ±2.05% -7.87%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 62.330mb +0.00% | 46.717ms +40.74%  | ±0.89% +552.48% |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 62.161mb +0.00% | 7.693ms +61.62%   | ±0.85% +336.86% |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 57.611mb +0.00% | 50.161ms +33.27%  | ±0.12% -92.42%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 57.611mb +0.00% | 50.040ms +35.02%  | ±0.19% -82.62%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 57.611mb +0.00% | 50.515ms +34.24%  | ±0.52% +563.76% |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 59.885mb +0.00% | 9.293ms +27.88%   | ±0.92% +21.30%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 57.610mb +0.00% | 37.381ms +31.59%  | ±0.46% +433.83% |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 57.611mb +0.00% | 20.706μs +61.92%  | ±0.23% -69.29%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 57.611mb +0.00% | 25.921μs +68.38%  | ±0.48% +57.18%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 80.452mb +0.00% | 179.399ms +14.43% | ±0.92% +358.80% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 91.736mb -0.01% | 146.817ms +26.33% | ±0.61% -35.77%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 47.615mb -0.03% | 74.263ms +26.46%  | ±0.20% -89.98%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 12.386mb +0.01% | 17.721ms +28.67%  | ±0.88% -48.23%  |
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+

Fixed bitpacking zero values

11b2a59

github-actions bot added lib-parquet size: XS labels Nov 5, 2023

norberttech commented Nov 5, 2023

View reviewed changes

norberttech mentioned this pull request Nov 5, 2023

Convert PHP warnings to test failures #725

Closed

norberttech merged commit 3fef49f into flow-php:1.x Nov 5, 2023

norberttech deleted the bug/parquet-bytes-array branch December 5, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fixed bitpacking zero values #731

Fixed bitpacking zero values #731

Uh oh!

norberttech commented Nov 5, 2023 •

edited

Loading

Uh oh!

norberttech Nov 5, 2023

Uh oh!

github-actions bot commented Nov 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fixed bitpacking zero values #731

Fixed bitpacking zero values #731

Uh oh!

Conversation

norberttech commented Nov 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Log

Added

Fixed

Changed

Removed

Deprecated

Security

Description

Uh oh!

norberttech Nov 5, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 5, 2023

Flow PHP - Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

norberttech commented Nov 5, 2023 •

edited

Loading