Skip to content

Conversation

@norberttech
Copy link
Member

@norberttech norberttech commented Oct 13, 2023

Change Log

Added

  • Parquet library - reading only for now
  • Implementation of algorithms from Google Dremel paper

Fixed

Changed

  • Simplified composer.json files across all sub repositories

Removed

Deprecated

  • Codename ParquetExtractor

Security


Description

Refs: #506

@github-actions

This comment has been minimized.

2 similar comments
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@norberttech norberttech requested a review from stloyd October 13, 2023 09:30
@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@norberttech norberttech force-pushed the feature/parquet-reader-writer branch from b2d4455 to 7860195 Compare October 13, 2023 09:56
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@norberttech norberttech force-pushed the feature/parquet-reader-writer branch from e559f6f to 9adcc1b Compare October 13, 2023 11:54
@github-actions

This comment has been minimized.

@norberttech norberttech force-pushed the feature/parquet-reader-writer branch from 9adcc1b to d9d883c Compare October 13, 2023 11:57
@github-actions

This comment has been minimized.

Added reading MAP logical types

Added reading LIST type

Fixed nullable lists handling

First implementation of Dremel encoding

Added logging to Dremel algorithm

Reconstruction of nested data structures based on schema definition

Performance optimization

Attempt to read map with values as a lists

Rebuilding structures

Extracted rebuilding columns logic to ColumnBuilder class

Reading nested structures

Restored usage of flow array functions

Added dremel/parquet to test suite

Updated github workflows

Avoid calculating remaining lenght/current position in BinaryReader on the fly

Make DataSize value object mutable

Move reading multiple values from Buffer into BufferReader

Allow to read from stream

Retrieve column chunks as generator

Moved reading flat columns into generics

Read parquet files struct columns through generators

Fixed reading column chunks

Reduced number of iterations over generators

Keep stream offset to avoid generators overlapping

Read all column chunks from a row group at once to avoid dealing with rows split between pages

Added notes for performance optimizations

Added PageByPage ChunkReader implementation

Fixed reding bytes of array when it's not a string

Adjusted schema ddl generation

Allow to limit numbers of returned rows

Fixed limit when there is more than one column chunk

Adjusted composer.json files in all subrepos

Added Parquet Reader options - handling INT96 as DateTime, reading byte arrays as strings, convert nanos to micros timestamps

Marked codename parquet extractor as deprecated

Added snappy extension detection

Converted testsuite fixtures into gzip from snappy

Fixed issues related to missing snapy_uncompress function

Added python scripts used to generate test/fixtures data for reader

Added resources folder into gitattributes as export-igonre

Close stream on ParquetFile destructor

Static analyze fixes

Detached Thrift from Flow Parquet Schema in order to reuse objects by writer

CR Fixes
@norberttech norberttech force-pushed the feature/parquet-reader-writer branch from d9d883c to 968b5e3 Compare October 13, 2023 13:57
@github-actions

This comment was marked as duplicate.

@github-actions

This comment was marked as duplicate.

@norberttech norberttech merged commit 538579e into flow-php:1.x Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants