-
-
Notifications
You must be signed in to change notification settings - Fork 48
Fixed bitpacking zero values #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| if ($bitWidth === 0) { | ||
| $output = \array_merge($output, \array_fill(0, \min($numGroups * 8, $maxItems), 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to take min here, since from time to time we might need to pack less than 8 values and numGroups is always divisible by 8. So if we would pack zero 4 times, this would return us an array of 8 zeroes.
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+-------------------+------+-----+------------------+-------------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+-------------------+------+-----+------------------+-------------------+----------------+
| AvroExtractorBench | bench_extract_10k | 1 | 3 | 35.132mb +0.25% | 436.329ms +21.66% | ±0.44% +5.11% |
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.789mb +0.01% | 345.090ms +27.74% | ±0.29% -66.47% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 4.937mb +0.01% | 695.094ms +21.92% | ±0.03% -95.12% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 233.496mb +0.00% | 983.462ms +27.69% | ±0.11% -75.49% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 4.783mb +0.01% | 23.131ms +21.13% | ±0.65% -34.71% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.783mb +0.01% | 555.088ms +37.51% | ±0.37% -18.98% |
+-----------------------+-------------------+------+-----+------------------+-------------------+----------------+
Transformers+-----------------------------+--------------------------+------+-----+-----------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------------+--------------------------+------+-----+-----------------+------------------+-----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 87.050mb +0.00% | 66.200ms +37.21% | ±0.52% +116.71% |
+-----------------------------+--------------------------+------+-----+-----------------+------------------+-----------------+
Loaders+--------------------+----------------+------+-----+------------------+-------------------+------------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+-------------------+------------------+
| AvroLoaderBench | bench_load_10k | 1 | 3 | 93.290mb +0.00% | 712.654ms +26.02% | ±2.10% +212.69% |
| CSVLoaderBench | bench_load_10k | 1 | 3 | 46.066mb +0.00% | 69.274ms +1.11% | ±3.43% +1646.56% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 88.568mb +0.00% | 75.443ms +23.31% | ±0.84% +35.35% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 284.008mb +0.00% | 1.510s +30.17% | ±0.31% -71.17% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 16.560mb +0.00% | 36.015ms -13.41% | ±0.88% -65.67% |
+--------------------+----------------+------+-----+------------------+-------------------+------------------+
Building Blocks+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 60.659mb +0.00% | 4.002ms +80.67% | ±1.71% +121.56% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 80.451mb +0.00% | 177.242ms +16.79% | ±0.96% +246.30% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 58.977mb +0.00% | 18.023ms +19.69% | ±0.87% -12.14% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 59.798mb +0.00% | 2.847ms +58.07% | ±1.02% -59.15% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 59.798mb +0.00% | 2.905ms +62.14% | ±0.79% +8.74% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 59.011mb +0.00% | 3.855ms +47.40% | ±0.95% +122.69% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 59.540mb +0.00% | 23.054ms +63.80% | ±0.06% -94.56% |
| RowsBench | bench_find_on_10k | 2 | 3 | 59.539mb +0.00% | 23.250ms +63.94% | ±1.57% +160.00% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 57.611mb +0.00% | 2.206μs +30.23% | ±2.11% -25.37% |
| RowsBench | bench_first_on_10k | 10 | 3 | 57.611mb +0.00% | 0.500μs +25.00% | ±0.00% -100.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 65.844mb +0.00% | 13.534ms +31.05% | ±0.50% -80.18% |
| RowsBench | bench_map_on_10k | 2 | 3 | 91.364mb +0.00% | 60.316ms +24.51% | ±0.35% -68.27% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 60.061mb +0.00% | 3.138ms +54.57% | ±2.05% -7.87% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 62.330mb +0.00% | 46.717ms +40.74% | ±0.89% +552.48% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 62.161mb +0.00% | 7.693ms +61.62% | ±0.85% +336.86% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 57.611mb +0.00% | 50.161ms +33.27% | ±0.12% -92.42% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 57.611mb +0.00% | 50.040ms +35.02% | ±0.19% -82.62% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 57.611mb +0.00% | 50.515ms +34.24% | ±0.52% +563.76% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 59.885mb +0.00% | 9.293ms +27.88% | ±0.92% +21.30% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 57.610mb +0.00% | 37.381ms +31.59% | ±0.46% +433.83% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 57.611mb +0.00% | 20.706μs +61.92% | ±0.23% -69.29% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 57.611mb +0.00% | 25.921μs +68.38% | ±0.48% +57.18% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 80.452mb +0.00% | 179.399ms +14.43% | ±0.92% +358.80% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 91.736mb -0.01% | 146.817ms +26.33% | ±0.61% -35.77% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 47.615mb -0.03% | 74.263ms +26.46% | ±0.20% -89.98% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 12.386mb +0.01% | 17.721ms +28.67% | ±0.88% -48.23% |
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
|
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description
Closes: #727
Here is how it can be reproduced:
So what happens here, is that date_time column has very low cardinality ratio (all values are exactly the same) so dictionary encoding is being applied.
The problem starts when we are writing indices to the buffer, in this case we are going to get a dictionary with a single value and index position 0. So we are going to write following index:
Because of that, while bit packing our bitWidth (determines the maximum size of bits needed to pack each element of given array) is 0, and because of that, the algorithm is getting confused about the total number of bytes to read, that's why its reading 0 bytes and that's why Bytes are actually empty.
I double-checked that with python, it seems it's using the same approach as it's reading our parquet files without any warnings.