Skip to content

Conversation

@norberttech
Copy link
Member

@norberttech norberttech commented Nov 7, 2023

Change Log

Added

  • CLI App - Parquet viewer

Fixed

  • missing dependencies in parquet lib

Changed

Removed

Deprecated

Security


Description

Closes: #735

$ bin/parquet.php                 

Flow PHP - Parquet Viewer 1.x-dev

Usage:
  command [options] [arguments]

Options:
  -h, --help            Display help for the given command. When no command is given display help for the list command
  -q, --quiet           Do not output any message
  -V, --version         Display this application version
      --ansi|--no-ansi  Force (or disable --no-ansi) ANSI output
  -n, --no-interaction  Do not ask any interactive question
  -v|vv|vvv, --verbose  Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Available commands:
  completion     Dump the shell completion script
  help           Display help for a command
  list           List commands
 read
  read:data      Read data from parquet file
  read:metadata  Read metadata from parquet file
$ bin/parquet.php read:metadata --help

Description:
  Read metadata from parquet file

Usage:
  read:metadata [options] [--] <file>

Arguments:
  file                  path to parquet file

Options:
      --columns         Display column details
      --row-groups      Display row group details
      --column-chunks   Display column chunks details
      --statistics      Display column chunks statistics details
      --page-headers    Display page headers details
  -h, --help            Display help for the given command. When no command is given display help for the list command
  -q, --quiet           Do not output any message
  -V, --version         Display this application version
      --ansi|--no-ansi  Force (or disable --no-ansi) ANSI output
  -n, --no-interaction  Do not ask any interactive question
  -v|vv|vvv, --verbose  Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
$ bin/parquet.php read:data --help    

Description:
  Read data from parquet file

Usage:
  read:data [options] [--] <file>

Arguments:
  file                           path to parquet file

Options:
  -c, --columns[=COLUMNS]        columns to read (multiple values allowed)
  -l, --limit[=LIMIT]            limit number of rows to read [default: 10]
  -b, --batch-size[=BATCH-SIZE]  batch size [default: 1000]
  -t, --truncate[=TRUNCATE]      Truncate values in cells to given length
  -h, --help                     Display help for the given command. When no command is given display help for the list command
  -q, --quiet                    Do not output any message
  -V, --version                  Display this application version
      --ansi|--no-ansi           Force (or disable --no-ansi) ANSI output
  -n, --no-interaction           Do not ask any interactive question
  -v|vv|vvv, --verbose           Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
$ bin/parquet.php read:data flow.parquet -t 10 

+---------+------------+------------+-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| boolean |      int32 |      int64 | float |     double |    decimal |     string |       date |   datetime |       time | list_of_da | map_of_int | list_of_st | struct_fla |
+---------+------------+------------+-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
|    true |  601197183 | 1252452733 | 10.25 |     0.3608 |     458.78 | Et itaque  | 2023-08-16 | 2023-07-18 | DateInterv | [{"date":" | {"a":"1517 | ["Ea quia  | {"id":"1", |
|    true | 1898860246 | 8060496734 | 10.25 | 327433.404 |    3284.82 | Et consequ | 2023-09-23 | 2023-02-05 | DateInterv | [{"date":" | {"a":"1533 | ["Debitis  | {"id":"2", |
|   false | 1315048828 | 3129070533 | 10.25 | 2203746.64 |  569329.21 | Aut est op | 2023-05-30 | 2023-07-15 | DateInterv | [{"date":" | {"a":"1749 | ["Non volu | {"id":"3", |
|   false | 1558719417 | 6878707872 | 10.25 |     1.6939 | 25955397.6 | Dolorem ma | 2023-08-13 | 2023-03-24 | DateInterv | [{"date":" | {"a":"1720 | ["Nihil qu | {"id":"4", |
|    true | 1012067503 | 8967249410 | 10.25 | 1750536.83 |       8.25 | Aut velit  | 2023-10-16 | 2023-10-05 | DateInterv | [{"date":" | {"a":"8118 | ["Voluptas | {"id":"5", |
|    true |   28238480 | 3652472020 | 10.25 | 22556.6565 |   41735.05 | Ipsam volu | 2023-05-15 | 2023-06-22 | DateInterv | [{"date":" | {"a":"2124 | ["Ut sed d | {"id":"6", |
|    true | 1294233247 | 7357477648 | 10.25 |     237157 | 185585581. | Voluptas i | 2023-01-09 | 2023-08-09 | DateInterv | [{"date":" | {"a":"4298 | ["Nihil mo | {"id":"7", |
|   false | 1250913218 | 7983709812 | 10.25 | 25667989.4 | 3079052.64 | Minus aute | 2023-07-10 | 2023-08-28 | DateInterv | [{"date":" | {"a":"4122 | ["Consecte | {"id":"8", |
|    true | 2105356816 | 8855288021 | 10.25 |    0.93007 |    2433.84 | Fuga aut r | 2023-08-21 | 2023-03-27 | DateInterv | [{"date":" | {"a":"1268 | ["Labore r | {"id":"9", |
|    true | 1314291690 | 7670636630 | 10.25 | 73.3677088 | 20244591.1 | Voluptate  | 2023-10-07 | 2023-09-15 | DateInterv | [{"date":" | {"a":"9293 | ["Iusto qu | {"id":"10" |
+---------+------------+------------+-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
10 rows
$ bin/parquet.php read:metadata flow.parquet --row-groups --page-headers --columns --statistics
├─────────────────┼────────────────── Metadata ─────────────────────────────────────┤
│ file path       │ /Users/norbert/Workspace/flow-php/flow/.scratchpad/flow.parquet │
│ parquet version │ 1                                                               │
│ created by      │ flow-php parquet version 1.x-dev                                │
│ rows            │ 100                                                             │
└─────────────────┴─────────────────────────────────────────────────────────────────┘

┌────────────────────────────────┬────────────────── Flat Columns ─────┬────────────┬────────────────┬────────────────┐
│ path                           │ type                 │ logical type │ repetition │ max repetition │ max definition │
├────────────────────────────────┼──────────────────────┼──────────────┼────────────┼────────────────┼────────────────┤
│ boolean                        │ BOOLEAN              │ -            │ OPTIONAL   │ 0              │ 1              │
│ int32                          │ INT32                │ -            │ OPTIONAL   │ 0              │ 1              │
│ int64                          │ INT64                │ -            │ OPTIONAL   │ 0              │ 1              │
│ float                          │ FLOAT                │ -            │ OPTIONAL   │ 0              │ 1              │
│ double                         │ DOUBLE               │ -            │ OPTIONAL   │ 0              │ 1              │
│ decimal                        │ FIXED_LEN_BYTE_ARRAY │ DECIMAL      │ OPTIONAL   │ 0              │ 1              │
│ string                         │ BYTE_ARRAY           │ STRING       │ OPTIONAL   │ 0              │ 1              │
│ date                           │ INT32                │ DATE         │ OPTIONAL   │ 0              │ 1              │
│ datetime                       │ INT64                │ TIMESTAMP    │ OPTIONAL   │ 0              │ 1              │
│ time                           │ INT64                │ TIME         │ OPTIONAL   │ 0              │ 1              │
│ list_of_datetimes.list.element │ INT64                │ TIMESTAMP    │ OPTIONAL   │ 1              │ 3              │
│ map_of_ints.key_value.key      │ BYTE_ARRAY           │ STRING       │ REQUIRED   │ 1              │ 2              │
│ map_of_ints.key_value.value    │ INT32                │ -            │ OPTIONAL   │ 1              │ 3              │
│ list_of_strings.list.element   │ BYTE_ARRAY           │ STRING       │ OPTIONAL   │ 1              │ 3              │
│ struct_flat.id                 │ INT32                │ -            │ OPTIONAL   │ 0              │ 2              │
│ struct_flat.name               │ BYTE_ARRAY           │ STRING       │ OPTIONAL   │ 0              │ 2              │
└────────────────────────────────┴──────────────────────┴──────────────┴────────────┴────────────────┴────────────────┘

┌──────────┬───── Row Groups ┬───────────────┐
│ num rows │ total byte size │ columns count │
├──────────┼─────────────────┼───────────────┤
│ 100      │ 42,561          │ 16            │
└──────────┴────── Total: 1 ─┴───────────────┘

┌────────────────────────────────┬──────────────────────────────┬───────────────── Column Chunks Statistics ───────────────────┬───────────────────────────────┬────────────┬────────────────┐
│ path                           │ min [deprecated]             │ max [deprecated]              │ min value                    │ max value                     │ null count │ distinct count │
├────────────────────────────────┼──────────────────────────────┼───────────────────────────────┼──────────────────────────────┼───────────────────────────────┼────────────┼────────────────┤
│ Row Group: 0                                                                                                                                                                               │
├────────────────────────────────┼──────────────────────────────┼───────────────────────────────┼──────────────────────────────┼───────────────────────────────┼────────────┼────────────────┤
│ boolean                        │                              │ 1                             │                              │ 1                             │ -          │ 2              │
│ int32                          │ 12041874                     │ 2138721799                    │ 12041874                     │ 2138721799                    │ -          │ 100            │
│ int64                          │ 80604967340828891            │ 9139108325942554382           │ 80604967340828891            │ 9139108325942554382           │ -          │ 100            │
│ float                          │ 10.25                        │ 10.25                         │ 10.25                        │ 10.25                         │ -          │ 1              │
│ double                         │ 0                            │ 515214588.53                  │ 0                            │ 515214588.53                  │ -          │ 100            │
│ decimal                        │ 0                            │ 634690289.94                  │ 0                            │ 634690289.94                  │ -          │ 100            │
│ string                         │ A aut aperiam distinctio...  │ Voluptatem quo dolores...     │ A aut aperiam distinctio...  │ Voluptatem quo dolores...     │ -          │ 100            │
│ date                           │ 19361                        │ 19660                         │ 19361                        │ 19660                         │ -          │ 85             │
│ datetime                       │ 1672586016000000             │ 1698912601000000              │ 1672586016000000             │ 1698912601000000              │ -          │ 100            │
│ time                           │ 7200000000                   │ 7200000001                    │ 7200000000                   │ 7200000001                    │ -          │ 2              │
│ list_of_datetimes.list.element │ 1672577279000000             │ 1699336548000000              │ 1672577279000000             │ 1699336548000000              │ -          │ 300            │
│ map_of_ints.key_value.key      │ a                            │ c                             │ a                            │ c                             │ -          │ 3              │
│ map_of_ints.key_value.value    │ 369858                       │ 2145864542                    │ 369858                       │ 2145864542                    │ -          │ 300            │
│ list_of_strings.list.element   │ A nesciunt autem nesciunt... │ Voluptatum voluptas maxime... │ A nesciunt autem nesciunt... │ Voluptatum voluptas maxime... │ -          │ 691            │
│ struct_flat.id                 │ 1                            │ 100                           │ 1                            │ 100                           │ -          │ 100            │
│ struct_flat.name               │ name_00001                   │ name_00100                    │ name_00001                   │ name_00100                    │ -          │ 100            │
└────────────────────────────────┴──────────────────────────────┴──────────────────────── Total: 16 ───────────────────────────┴───────────────────────────────┴────────────┴────────────────┘

┌────────────────────────────────┬─────────────────┬──────────────── Page Headers ──────┬───────────────────┬───────────────────────┬─────────────────┐
│ path                           │ type            │ encoding         │ compressed size │ uncompressed size │ dictionary num values │ data num values │
├────────────────────────────────┼─────────────────┼──────────────────┼─────────────────┼───────────────────┼───────────────────────┼─────────────────┤
│ boolean                        │ DATA_PAGE       │ PLAIN            │ 20              │ 20                │ -                     │ 100             │
│ int32                          │ DATA_PAGE       │ PLAIN            │ 407             │ 407               │ -                     │ 100             │
│ int64                          │ DATA_PAGE       │ PLAIN            │ 807             │ 807               │ -                     │ 100             │
│ float                          │ DICTIONARY_PAGE │ PLAIN_DICTIONARY │ 4               │ 4                 │ 1                     │ -               │
│ double                         │ DATA_PAGE       │ PLAIN            │ 807             │ 807               │ -                     │ 100             │
│ decimal                        │ DATA_PAGE       │ PLAIN            │ 507             │ 507               │ -                     │ 100             │
│ string                         │ DATA_PAGE       │ PLAIN            │ 4,227           │ 4,227             │ -                     │ 100             │
│ date                           │ DATA_PAGE       │ PLAIN            │ 407             │ 407               │ -                     │ 100             │
│ datetime                       │ DATA_PAGE       │ PLAIN            │ 807             │ 807               │ -                     │ 100             │
│ time                           │ DICTIONARY_PAGE │ PLAIN_DICTIONARY │ 16              │ 16                │ 2                     │ -               │
│ list_of_datetimes.list.element │ DATA_PAGE       │ PLAIN            │ 2,487           │ 2,487             │ -                     │ 300             │
│ map_of_ints.key_value.key      │ DICTIONARY_PAGE │ PLAIN_DICTIONARY │ 15              │ 15                │ 3                     │ -               │
│ map_of_ints.key_value.value    │ DATA_PAGE       │ PLAIN            │ 1,287           │ 1,287             │ -                     │ 300             │
│ list_of_strings.list.element   │ DATA_PAGE       │ PLAIN            │ 28,345          │ 28,345            │ -                     │ 691             │
│ struct_flat.id                 │ DATA_PAGE       │ PLAIN            │ 407             │ 407               │ -                     │ 100             │
│ struct_flat.name               │ DATA_PAGE       │ PLAIN            │ 1,407           │ 1,407             │ -                     │ 100             │
└────────────────────────────────┴─────────────────┴────────────────── Total: 16 ───────┴───────────────────┴───────────────────────┴─────────────────┘

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2023

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.233mb +0.12%  | 502.327ms -4.44% | ±1.25% +180.62% |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.791mb +0.95%   | 386.567ms -6.92% | ±1.07% -56.21%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.950mb +0.84%   | 791.841ms -5.30% | ±0.44% +105.03% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 233.539mb +0.02% | 1.110s -4.98%    | ±1.00% +242.83% |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.785mb +0.95%   | 26.591ms -2.99%  | ±0.82% +96.77%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.785mb +0.95%   | 643.349ms -3.52% | ±1.36% +153.11% |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+-----------------+-----------------+-----------------+
| benchmark                   | subject                  | revs | its | mem_peak        | mode            | rstdev          |
+-----------------------------+--------------------------+------+-----+-----------------+-----------------+-----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 87.052mb +0.05% | 77.006ms -1.20% | ±2.19% +248.44% |
+-----------------------------+--------------------------+------+-----+-----------------+-----------------+-----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 94.423mb +0.04%  | 776.827ms -7.34% | ±2.18% +160.33% |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 46.068mb +0.09%  | 74.423ms -9.35%  | ±0.09% -83.51%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 89.701mb +0.05%  | 83.776ms -7.98%  | ±0.57% +123.22% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 285.172mb +0.01% | 1.761s -7.84%    | ±0.53% -31.75%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 16.562mb +0.25%  | 39.202ms -5.64%  | ±1.66% -1.69%   |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak        | mode              | rstdev          |
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 60.661mb +0.07% | 4.532ms -20.97%   | ±1.27% -39.60%  |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 80.453mb +0.05% | 194.555ms -9.86%  | ±1.10% +70.61%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 58.979mb +0.07% | 19.941ms -9.04%   | ±2.15% +192.73% |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 59.800mb +0.07% | 3.277ms -16.83%   | ±2.98% -12.82%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 59.800mb +0.07% | 3.369ms -14.61%   | ±2.40% +56.01%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 59.013mb +0.07% | 4.132ms -18.76%   | ±3.24% +69.41%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 59.542mb +0.07% | 23.847ms -11.14%  | ±1.31% +248.51% |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 59.542mb +0.07% | 24.706ms -6.83%   | ±1.80% +245.47% |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 57.613mb +0.07% | 2.494μs -4.53%    | ±1.91% -45.95%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 57.613mb +0.07% | 0.500μs 0.00%     | ±0.00% 0.00%    |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 65.846mb +0.06% | 14.891ms -11.29%  | ±2.08% +414.44% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 91.366mb +0.05% | 68.063ms -7.06%   | ±1.53% +301.41% |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 60.063mb +0.07% | 3.410ms -19.61%   | ±0.69% -51.03%  |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 62.333mb +0.07% | 52.505ms -9.85%   | ±2.38% +446.94% |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 62.163mb +0.07% | 8.570ms -18.59%   | ±2.49% +238.19% |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 57.613mb +0.07% | 57.183ms -8.10%   | ±1.18% +104.73% |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 57.613mb +0.07% | 55.497ms -10.37%  | ±1.05% +0.28%   |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 57.613mb +0.07% | 55.805ms -9.62%   | ±2.92% +684.91% |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 59.887mb +0.07% | 10.132ms -10.68%  | ±0.56% -50.33%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 57.612mb +0.07% | 41.613ms -11.32%  | ±1.85% +128.30% |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 57.613mb +0.07% | 23.517μs -8.85%   | ±0.92% +191.71% |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 57.613mb +0.07% | 30.792μs -4.01%   | ±2.66% +402.58% |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 80.454mb +0.05% | 195.121ms -10.27% | ±0.62% -45.90%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 92.858mb +0.05% | 169.800ms -5.63%  | ±1.99% +411.45% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 48.174mb +0.09% | 83.226ms -8.78%   | ±1.81% +320.19% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 12.515mb +0.30% | 20.037ms -7.99%   | ±1.96% +169.14% |
+-------------------------+----------------------------+------+-----+-----------------+-------------------+-----------------+

@norberttech
Copy link
Member Author

The differences in the output tables comes from the fact that Flow internally it's using it's own ASCIITable implementation, but maybe we could actually make another implementation that will use Symfony Ascii Table whenever it's available.

@norberttech norberttech merged commit 2efd31f into flow-php:1.x Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet Viewer

1 participant