Skip to content

Improve handling extractors output #534

@norberttech

Description

@norberttech

Currently, when dealing with just extracted rows, we need to properly handle them, this is how it looks now:

(new Flow())
    ->read(
        From::array(
            [['id' => 1, 'array' => ['a' => 1, 'b' => 2, 'c' => 3]]],
        )
    )
    ->withEntry('row', ref('row')->unpack())
    ->renameAll('row.', '')
    ->drop('row')
    ->withEntry('array', ref('array')->arrayMerge(lit(['d' => 4])))
    ->write(To::memory($memory = new ArrayMemory()))
    ->run();

Following lines are repeated almost* always

    ->withEntry('row', ref('row')->unpack())
    ->renameAll('row.', '')
    ->drop('row')

We should look into this and introduce an expression that will do all of those 3 things, something like:

ref('row')->rowUnpack() 

One might ask, why extractors are exposing the entire row under ArrayEntry.
The reason comes from Config::shouldPutInputIntoRows() option.

When this option is set to true, extractors will provide also additional data from the input, for example
file based extractors with that option will return something like this:

[
   'input_file_uri' => 'string',
   'row' => [...]
]

Or Http extractor will put there headers, request URL etc.

This is very handy when for example our datasets are stored like this:

  • /datasets/sales/2023/january.json
  • /datasets/sales/2023/february.json
  • /datasets/reports/report_type/123456.xml
  • /datasets/reports/report_type/789010.xml

Thanks to input_file_uri we can also parse input file path and get the month name or report id from it.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions