Skip to content

Conversation

@norberttech
Copy link
Member

Change Log

Added

Fixed

  • Fixed writing column statistics when there is only one row that is null

Changed

Removed

Deprecated

Security


Description

This issue was reported on our Discord channel by user guiguiolol.
Code to reproduce this issue:

<?php

use Flow\Parquet\ParquetFile\Schema;
use Flow\Parquet\ParquetFile\Schema\FlatColumn;
use Flow\Parquet\Writer;

require_once __DIR__ . '/../../../vendor/autoload.php';

$writer = new Writer();

$schema = Schema::with(
    FlatColumn::int32('id'),
);

if (file_exists(__DIR__ . '/test.parquet')) {
    unlink(__DIR__ . '/test.parquet');
}

$writer->write(
    __DIR__ . '/test.parquet',
    $schema,
    [
        [
            'id' => null,
        ]
    ]
);

Parquet files created like this can't be read by other software (confirmed through duckdb).

When min/max buffers are empty we can safely make them null (that's what pyarrow is doing).
Below python code to reproduce the same scenario:

import pyarrow as pa
import pyarrow.parquet as pq
import os

# Define the directory and file path
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(dir_path, 'test.parquet')

# Define the schema with an int32 'id' column that is nullable
schema = pa.schema([
    pa.field('id', pa.int32(), nullable=True)
])

# Prepare the data with 'id' set to None (null)
data = {
    'id': [None]
}

# Create a PyArrow Table with the data and schema
table = pa.Table.from_pydict(data, schema=schema)

# Remove the Parquet file if it already exists
if os.path.exists(file_path):
    os.remove(file_path)

# Write the table to a Parquet file
pq.write_table(table, file_path)

@github-actions
Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.602mb +0.00%  | 514.006ms -0.10% | ±1.00% +141.27% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.690mb +0.00%  | 1.086s +0.38%    | ±0.50% -46.68%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 29.145mb +0.00% | 431.769ms -3.16% | ±0.71% +5.71%   |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.330mb +0.00%  | 33.308ms -1.47%  | ±0.11% -74.24%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.312mb +0.00%  | 641.925ms -1.88% | ±0.42% -70.74%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 116.606mb +0.00% | 60.244ms -1.16% | ±0.27% -66.12% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.799mb +0.00%  | 139.343ms -0.81% | ±0.79% +194.33% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 90.382mb +0.00%  | 114.931ms -1.17% | ±0.19% -70.25%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 124.436mb +0.00% | 1.226s -3.35%    | ±0.05% -89.45%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.521mb +0.00%  | 43.340ms -4.19%  | ±0.66% +26.13%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 87.355mb +0.00%  | 3.373ms -15.35%  | ±1.29% +52.64%  |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.958mb +0.00% | 189.810ms -2.10% | ±0.50% -36.04%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.678mb +0.00%  | 19.022ms -2.04%  | ±0.54% -2.95%   |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.595mb +0.00%  | 1.620ms -17.63%  | ±1.12% -41.44%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.595mb +0.00%  | 1.634ms -17.02%  | ±0.40% -44.52%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.707mb +0.00%  | 2.783ms -14.72%  | ±0.41% -88.21%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 86.236mb +0.00%  | 15.051ms -4.42%  | ±0.26% -78.27%  |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 86.236mb +0.00%  | 15.404ms -0.39%  | ±0.09% -91.81%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 84.139mb +0.00%  | 1.706μs -14.95%  | ±2.72% +17.31%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 84.139mb +0.00%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 93.490mb +0.00%  | 12.635ms -11.37% | ±0.66% -55.35%  |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.861mb +0.00% | 62.015ms -2.36%  | ±1.07% -16.90%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.755mb +0.00%  | 1.394ms -20.73%  | ±2.42% -8.61%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 90.107mb +0.00%  | 63.424ms -3.03%  | ±1.15% +152.38% |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.857mb +0.00%  | 4.141ms -12.77%  | ±0.12% -94.53%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 84.289mb +0.00%  | 39.993ms -6.52%  | ±1.32% -26.65%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 84.289mb +0.00%  | 41.013ms -3.62%  | ±0.78% -64.44%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 84.289mb +0.00%  | 40.439ms -1.79%  | ±0.72% -63.98%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.581mb +0.00%  | 7.419ms +0.72%   | ±0.61% -52.51%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 84.139mb +0.00%  | 29.002ms -2.79%  | ±0.95% -33.90%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 84.139mb +0.00%  | 13.312μs -4.51%  | ±0.71% -60.88%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 84.139mb +0.00%  | 15.879μs -3.18%  | ±0.78% +0.00%   |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.959mb +0.00% | 192.652ms -0.98% | ±2.63% +212.27% |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 53.181mb +0.00%  | 396.449ms -1.27% | ±0.97% +449.97% |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 13.447mb +0.00%  | 81.240ms +0.22%  | ±0.55% -77.45%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 107.443mb +0.00% | 480.736ms -0.43% | ±0.74% -20.09%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.801mb +0.00%  | 242.706ms +0.25% | ±1.00% -36.83%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.639mb +0.00%  | 52.619ms -0.82%  | ±0.29% -79.54%  |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 4602665 into flow-php:1.x Oct 11, 2024
@norberttech norberttech deleted the bug/parquet-writing-single-nullable-row branch December 5, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant