[Variant] Support ['fieldName'] in VariantPath parser#9276
[Variant] Support ['fieldName'] in VariantPath parser#9276alamb merged 7 commits intoapache:mainfrom
['fieldName'] in VariantPath parser#9276Conversation
88e1964 to
3201b0e
Compare
parquet-variant/src/path.rs
Outdated
|
|
||
| /// Parses a path string, panics on invalid input. | ||
| /// Only use for tests for known-valid input. | ||
| pub fn from_str_unchecked(s: &'a str) -> Self { |
There was a problem hiding this comment.
In rust and other parts of this crate the "unchecked" word is often used to denote unsafe code -- specifically that could lead to undefined behavior
I suggest we don't call it "unchecked"
I think just from_str is probably fine ?
There was a problem hiding this comment.
There is definitely mixed messaging on what "unchecked" and "checked" mean in rust. For example, we have a contradiction of sorts in primitive integer operations:
+panics on overflow- checked_add returns None on overflow
- unchecked_add is
unsafeand overflow is UB.
String operations are similar:
- as_ascii returns None on failure
- as_ascii_unchecked is UB on invalid input
- split_at panics on failure
- split_at_checked returns None on failure.
Overall, it seems like
foo_checkedis a fallible version offoo- only in cases where
foois infallible and panics on invalid input - if
foois already fallible, nofoo_checkedwill exist
- only in cases where
foo_uncheckedis an unsafe version offoo- invalid input is UB
foomay be fallible or may panic
So yes, this method should be called just from_str with a doc comment that it panics (use TryFrom if you want to handle errors gracefully).
There was a problem hiding this comment.
Thanks for the feedback. changed to from_str_or_panic. because clippy discourage to implement a from_str manually, and the std::str::FromStr::from_str returns Result<Self, Self::Err>; which is the same as TryFrom.
error: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str
There was a problem hiding this comment.
How about just using try_from("foo").unwrap() ?
Out of curiosity, why would we need to force escaping anything other than |
d76d9e7 to
8e2dcdb
Compare
8e2dcdb to
634f236
Compare
klion26
left a comment
There was a problem hiding this comment.
@alamb @scovich thanks for the review and sorry for the late reply, I've been busy with other things lately.
Also support escaping
\inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954Out of curiosity, why would we need to force escaping anything other than
]? Even a\.wouldn't be enough to make splitting on.safe, so we'd just need a simple parser that scans the string looking for matching]after encountering a[?
@scovich Sorry for the confusion. In the description, I want to say we only support special characters like '.' between '[' and ']', if the user wants to use special characters, he/she nees to use them between '[' and ']', so the path parser is simpler and easier to maintain. If we prefer to support special characters in addition to '[' and ']', we can continue with #8954 and then implement it.
parquet-variant/src/path.rs
Outdated
|
|
||
| /// Parses a path string, panics on invalid input. | ||
| /// Only use for tests for known-valid input. | ||
| pub fn from_str_unchecked(s: &'a str) -> Self { |
There was a problem hiding this comment.
Thanks for the feedback. changed to from_str_or_panic. because clippy discourage to implement a from_str manually, and the std::str::FromStr::from_str returns Result<Self, Self::Err>; which is the same as TryFrom.
error: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str
|
I'll try and review this one over the next day or two to get it into the 58 release |
| /// let shredding_type = ShreddedSchemaBuilder::default() | ||
| /// // store the "time" field as a separate UTC timestamp | ||
| /// .with_path("time", (&DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true)) | ||
| /// .with_path(VariantPath::from_str_or_panic("time"), (&DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true)) |
There was a problem hiding this comment.
This looks so ugly compared to the previous version
It seems if we are going to do this, we should probably also change the signature of with_path because now P: Into<VariantPath<'a>>, is unlikely to match anything other than VariantPath
What if we changed the signature to take a TryFrom and return the correct error?
Then the example could look like
/// let shredding_type = ShreddedSchemaBuilder::default()
/// // store the "time" field as a separate UTC timestamp
/// .with_path("time", &DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into())), true))? <-- note the ? here
/// ...
There was a problem hiding this comment.
Fixed, this's indeed a better way.
| single_variant_get_test( | ||
| r#"{"some_field": 1234}"#, | ||
| VariantPath::from("some_field"), | ||
| VariantPath::from_str_or_panic("some_field"), |
There was a problem hiding this comment.
I think this would be clearer to be honest:
| VariantPath::from_str_or_panic("some_field"), | |
| VariantPath::try_from("some_field").unwrap(), |
Or maybe a helper like
fn parse_path(p: &str) -> VariantPath<'a> {
VariantPath::try_from(p).unwrap(),
}And then
parse_path("some_field")There was a problem hiding this comment.
changed to try_from(..).unwrap()
parquet-variant/src/path.rs
Outdated
| #[test] | ||
| fn test_variant_path_empty_str() { | ||
| let path = VariantPath::from(""); | ||
| let path = VariantPath::from_str_or_panic(""); |
There was a problem hiding this comment.
I think
let path = VariantPath::try_from("").unwrap()Is no less verbose and is easier to understand (as it uses the standard rust traits)
6e3c4ce to
7a1d447
Compare
| let path: VariantPath<'a> = path.into(); | ||
| let path: VariantPath<'a> = path | ||
| .try_into() | ||
| .map_err(|e| ArrowError::InvalidArgumentError(format!("{:?}", e)))?; |
There was a problem hiding this comment.
Use this line and L544, because we'll have VaraintPath -> VariantPath, which is Infallible, and there is no From<Infallible> for ArrowError, ArrowError is located in arrow-schema, not sure if it's appropriate to make the change there. If add From<Infallible> for ArrowError is better, I can change it.
|
Thanks again @klion26 |
parquet-variant/src/path.rs
Outdated
|
|
||
| /// Parses a path string, panics on invalid input. | ||
| /// Only use for tests for known-valid input. | ||
| pub fn from_str_unchecked(s: &'a str) -> Self { |
There was a problem hiding this comment.
There is definitely mixed messaging on what "unchecked" and "checked" mean in rust. For example, we have a contradiction of sorts in primitive integer operations:
+panics on overflow- checked_add returns None on overflow
- unchecked_add is
unsafeand overflow is UB.
String operations are similar:
- as_ascii returns None on failure
- as_ascii_unchecked is UB on invalid input
- split_at panics on failure
- split_at_checked returns None on failure.
Overall, it seems like
foo_checkedis a fallible version offoo- only in cases where
foois infallible and panics on invalid input - if
foois already fallible, nofoo_checkedwill exist
- only in cases where
foo_uncheckedis an unsafe version offoo- invalid input is UB
foomay be fallible or may panic
So yes, this method should be called just from_str with a doc comment that it panics (use TryFrom if you want to handle errors gracefully).
| /// assert_eq!(path[1], VariantPathElement::field("bar")); | ||
| /// ``` | ||
| /// | ||
| /// # Example: Accessing filed with bracket |
There was a problem hiding this comment.
| /// # Example: Accessing filed with bracket | |
| /// # Example: Accessing field with bracket |
There was a problem hiding this comment.
Will fix this in a followup pr
| /// # Example: Accessing filed with bracket | ||
| /// ``` | ||
| /// # use parquet_variant::{VariantPath, VariantPathElement}; | ||
| /// let path = VariantPath::try_from("a[b.c].d[2]").unwrap(); |
There was a problem hiding this comment.
Questions:
- How does one distinguish between an array index
2and a field that happens to be named"2"? - What are the rules for brackets vs. periods? is
.fooshorthand for['foo']?
If we follow JSONpath spec then:
- Array indexing is
[2]and field access is['2'] .fooIS shorthand for['foo'], with specific rules for what characters can appear in the shorthand form (the regexp[_a-zA-Z][_a-zA-Z0-9]*matches a subset of the allowable shorthand names)
There was a problem hiding this comment.
currently, [2] -> Index, and ['2'] is Path, (relies on unescaped.parse() to determine whether it's index)
The code below will print VariantPath([Field { name: "'3'" }]):::VariantPath([Index { index: 3 }])
let path = VariantPath::try_from("[3]").unwrap();
let p = VariantPath::try_from("['3']").unwrap();
println!("{:?}:::{:?}", p, path);
2 a.foo equals to a['foo'], but leading '.'(.foo) will throw an error
Which issue does this PR close?
..and['fieldName']syntax in the VariantPath parser #9050.VariantPathparsing #8954Rationale for this change
What changes are included in this PR?
Add
[fieldName]support inVariantPathparser, will throw an error if the parser fails.Also support escaping
\inside brackets. If we force users to use brackets when the field contains special characters, maybe we can also close #8954Sample behaviors(read more on the code doc)
[foo]-> filedfoo[2]-> index 2[a.b]-> fielda.b[a\]b]-> fielda]b[a\xb]-> fieldaxbAre these changes tested?
Added tests
Are there any user-facing changes?
Yes, there are some user-facing changes, but
parquet-variantis still experient for now, so maybe we don't need to wait for a major version.