Skip to content

[Variant] Support ['fieldName'] in VariantPath parser#9276

Merged
alamb merged 7 commits intoapache:mainfrom
klion26:variant-path-field-in-bracket
Feb 17, 2026
Merged

[Variant] Support ['fieldName'] in VariantPath parser#9276
alamb merged 7 commits intoapache:mainfrom
klion26:variant-path-field-in-bracket

Conversation

@klion26
Copy link
Member

@klion26 klion26 commented Jan 27, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Add [fieldName] support in VariantPath parser, will throw an error if the parser fails.

Also support escaping \ inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954

Sample behaviors(read more on the code doc)

  • [foo] -> filed foo
  • [2] -> index 2
  • [a.b] -> field a.b
  • [a\]b] -> field a]b
  • [a\xb] -> field axb

Are these changes tested?

Added tests

Are there any user-facing changes?

Yes, there are some user-facing changes, but parquet-variant is still experient for now, so maybe we don't need to wait for a major version.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Jan 27, 2026
@klion26 klion26 force-pushed the variant-path-field-in-bracket branch from 88e1964 to 3201b0e Compare January 27, 2026 08:50
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks good to me -- thank you @klion26

I think other than the term unchecked this PR looks reasonable to me

cc @scovich in case you have other thoughts


/// Parses a path string, panics on invalid input.
/// Only use for tests for known-valid input.
pub fn from_str_unchecked(s: &'a str) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rust and other parts of this crate the "unchecked" word is often used to denote unsafe code -- specifically that could lead to undefined behavior

I suggest we don't call it "unchecked"

I think just from_str is probably fine ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely mixed messaging on what "unchecked" and "checked" mean in rust. For example, we have a contradiction of sorts in primitive integer operations:

String operations are similar:

Overall, it seems like

  • foo_checked is a fallible version of foo
    • only in cases where foo is infallible and panics on invalid input
    • if foo is already fallible, no foo_checked will exist
  • foo_unchecked is an unsafe version of foo
    • invalid input is UB
    • foo may be fallible or may panic

So yes, this method should be called just from_str with a doc comment that it panics (use TryFrom if you want to handle errors gracefully).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. changed to from_str_or_panic. because clippy discourage to implement a from_str manually, and the std::str::FromStr::from_str returns Result<Self, Self::Err>; which is the same as TryFrom.

 error: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just using try_from("foo").unwrap() ?

@scovich
Copy link
Contributor

scovich commented Jan 29, 2026

Also support escaping \ inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954

Out of curiosity, why would we need to force escaping anything other than ]? Even a \. wouldn't be enough to make splitting on . safe, so we'd just need a simple parser that scans the string looking for matching ] after encountering a [?

@klion26 klion26 force-pushed the variant-path-field-in-bracket branch from d76d9e7 to 8e2dcdb Compare February 9, 2026 12:24
@klion26 klion26 force-pushed the variant-path-field-in-bracket branch from 8e2dcdb to 634f236 Compare February 9, 2026 12:38
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @scovich thanks for the review and sorry for the late reply, I've been busy with other things lately.

Also support escaping \ inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954

Out of curiosity, why would we need to force escaping anything other than ]? Even a \. wouldn't be enough to make splitting on . safe, so we'd just need a simple parser that scans the string looking for matching ] after encountering a [?

@scovich Sorry for the confusion. In the description, I want to say we only support special characters like '.' between '[' and ']', if the user wants to use special characters, he/she nees to use them between '[' and ']', so the path parser is simpler and easier to maintain. If we prefer to support special characters in addition to '[' and ']', we can continue with #8954 and then implement it.


/// Parses a path string, panics on invalid input.
/// Only use for tests for known-valid input.
pub fn from_str_unchecked(s: &'a str) -> Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. changed to from_str_or_panic. because clippy discourage to implement a from_str manually, and the std::str::FromStr::from_str returns Result<Self, Self::Err>; which is the same as TryFrom.

 error: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str

@alamb
Copy link
Contributor

alamb commented Feb 11, 2026

I'll try and review this one over the next day or two to get it into the 58 release

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 -- this is looking good. I am sorry for the delay in reviewing.

/// let shredding_type = ShreddedSchemaBuilder::default()
/// // store the "time" field as a separate UTC timestamp
/// .with_path("time", (&DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true))
/// .with_path(VariantPath::from_str_or_panic("time"), (&DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks so ugly compared to the previous version

It seems if we are going to do this, we should probably also change the signature of with_path because now P: Into<VariantPath<'a>>, is unlikely to match anything other than VariantPath

What if we changed the signature to take a TryFrom and return the correct error?

Then the example could look like

/// let shredding_type = ShreddedSchemaBuilder::default()
///     // store the "time" field as a separate UTC timestamp
///     .with_path("time", &DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true))?  <-- note the ? here
/// ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, this's indeed a better way.

single_variant_get_test(
r#"{"some_field": 1234}"#,
VariantPath::from("some_field"),
VariantPath::from_str_or_panic("some_field"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be clearer to be honest:

Suggested change
VariantPath::from_str_or_panic("some_field"),
VariantPath::try_from("some_field").unwrap(),

Or maybe a helper like

fn parse_path(p: &str) -> VariantPath<'a> {
    VariantPath::try_from(p).unwrap(),
}

And then

            parse_path("some_field")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to try_from(..).unwrap()

#[test]
fn test_variant_path_empty_str() {
let path = VariantPath::from("");
let path = VariantPath::from_str_or_panic("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

        let path = VariantPath::try_from("").unwrap()

Is no less verbose and is easier to understand (as it uses the standard rust traits)

@klion26 klion26 force-pushed the variant-path-field-in-bracket branch from 6e3c4ce to 7a1d447 Compare February 12, 2026 15:58
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I've updated the code, please take another look, thanks.

let path: VariantPath<'a> = path.into();
let path: VariantPath<'a> = path
.try_into()
.map_err(|e| ArrowError::InvalidArgumentError(format!("{:?}", e)))?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this line and L544, because we'll have VaraintPath -> VariantPath, which is Infallible, and there is no From<Infallible> for ArrowError, ArrowError is located in arrow-schema, not sure if it's appropriate to make the change there. If add From<Infallible> for ArrowError is better, I can change it.

@alamb alamb added the api-change Changes to the arrow API label Feb 14, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 -- I think this change looks good to me

FYI @scovich

@alamb alamb merged commit 2f40f78 into apache:main Feb 17, 2026
18 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 17, 2026

Thanks again @klion26

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @scovich

Ugh, I forgot to submit a review ages ago, sorry... flushing what I had.


/// Parses a path string, panics on invalid input.
/// Only use for tests for known-valid input.
pub fn from_str_unchecked(s: &'a str) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely mixed messaging on what "unchecked" and "checked" mean in rust. For example, we have a contradiction of sorts in primitive integer operations:

String operations are similar:

Overall, it seems like

  • foo_checked is a fallible version of foo
    • only in cases where foo is infallible and panics on invalid input
    • if foo is already fallible, no foo_checked will exist
  • foo_unchecked is an unsafe version of foo
    • invalid input is UB
    • foo may be fallible or may panic

So yes, this method should be called just from_str with a doc comment that it panics (use TryFrom if you want to handle errors gracefully).

/// assert_eq!(path[1], VariantPathElement::field("bar"));
/// ```
///
/// # Example: Accessing filed with bracket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// # Example: Accessing filed with bracket
/// # Example: Accessing field with bracket

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this in a followup pr

/// # Example: Accessing filed with bracket
/// ```
/// # use parquet_variant::{VariantPath, VariantPathElement};
/// let path = VariantPath::try_from("a[b.c].d[2]").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:

  1. How does one distinguish between an array index 2 and a field that happens to be named "2"?
  2. What are the rules for brackets vs. periods? is .foo shorthand for ['foo']?

If we follow JSONpath spec then:

  1. Array indexing is [2] and field access is ['2']
  2. .foo IS shorthand for ['foo'], with specific rules for what characters can appear in the shorthand form (the regexp [_a-zA-Z][_a-zA-Z0-9]* matches a subset of the allowable shorthand names)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, [2] -> Index, and ['2'] is Path, (relies on unescaped.parse() to determine whether it's index)

The code below will print VariantPath([Field { name: "'3'" }]):::VariantPath([Index { index: 3 }])

let path = VariantPath::try_from("[3]").unwrap();
let p = VariantPath::try_from("['3']").unwrap();
println!("{:?}:::{:?}", p, path);

2 a.foo equals to a['foo'], but leading '.'(.foo) will throw an error

@klion26
Copy link
Member Author

klion26 commented Feb 24, 2026

@alamb @scovich thanks for the review and merge! Sorry for the late reply, just came back from the spring festival holiday, and will continue to follow the remaining issues/comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] support .. and ['fieldName'] syntax in the VariantPath parser [Variant] Support escaping mechanism for VariantPath parsing

3 participants