Skip to content

Memory leak while reading large files #1857

@dimitribergerilg

Description

@dimitribergerilg

Hello !

Description

When trying to read a large file (1.3 GB) using flow-php/parquet with version ^0.24.0, I get a PHP fatal error due to memory limit (512MB).

The same code works with flow-php/parquet version 0.7.4.


Steps to Reproduce

  1. Implement the following getReader function:
public function getReader(FileModel $fileModel): mixed
{
    $reader = new Reader();
    try {
        return $reader->read($fileModel->getTmpFileName());
    } catch (Exception $e) {
        throw new FileException(
            sprintf(
                'Can\'t access the temporary file %s %s %s',
                $fileModel->getTmpFileName(),
                $fileModel->getOriginalFileName(),
                $e->getMessage()
            )
        );
    }
}
  1. Try to iterate over the file content:
$fileResource = $this->getReader($fileModel);
foreach ($fileResource->values(["col1", "col2"]) as $row) {
    dump($row);
    exit;
}
  1. Run with a file of size 1.3 GB.

Expected Behavior

The file should be read row by row without exceeding the PHP memory limit.


Actual Behavior

Execution fails after some time with a PHP Fatal error (memory limit 512MB) before entering the foreach loop.


Additional Attempts

I also tried using readStream:

return $reader->readStream(
    NativeLocalSourceStream::open(
        new Path($fileModel->getTmpFileName())
    )
);

But this does not work with:

"flow-php/etl": "^0.24.0",
"flow-php/parquet": "^0.24.0"

It only works with:

"flow-php/parquet": "0.7.4"

Environment

  • PHP version: [8.4]
  • OS: [bookwork-dockerised]
  • Memory limit: 512MB
  • flow-php/etl: ^0.24.0
  • flow-php/parquet: ^0.24.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions