Skip to content

Add regex scalar function to extract values from strings #1159

@norberttech

Description

@norberttech

This is a use case from a question we got on a discord server:

Input:

<rss xmlns:g="http://base.google.com/ns/1.0">
<channel>
    <item>
        <g:price>123 EUR</g:price>
    </item>
    <item>
        <g:price>154.12 USD</g:price>
    </item>
</channel>
</rss>

expected output:

(price and currency in separate columns)

+--------------+--------+----------+
| price_string |  price | currency |
+--------------+--------+----------+
|      123 EUR | 123.00 |      EUR |
+--------------+--------+----------+
1 rows
+--------------+--------+----------+
| price_string |  price | currency |
+--------------+--------+----------+
|   154.12 USD | 154.12 |      USD |
+--------------+--------+----------+

Currently, it can be achieved by replacing currency/price with empty strings:

df()
    ->read(from_xml(__DIR__ . '/file.xml', 'rss/channel/item'))
    ->withEntry('price_string', ref('node')->xpath('g:price')->domElementValue())
    ->withEntry('price', ref('price_string')->regexReplace(lit('/(A-Z{3})/'), lit(''))->cast('float')->round(lit(2)))
    ->withEntry('currency', ref('price_string')->regexReplace(lit('/\d+(?:\.\d+)?/'), lit('')))
    ->drop('node')
    ->write(to_output(false))
    ->run();

Because we don't have a scalar function that would extract a match from regexp.

I think we can add something similar to:

ref('name')->regex(lit("pattern"), match: lit(0))

Regex - ScalarFunction similar to PregMatch function but instead of confirming that the string matches a pattern, it would let us return given match (or multiple matches maybe)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions