Skip to content

[Arrow] Add API to check if Field has a valid ExtensionType#9677

Merged
alamb merged 4 commits into
apache:mainfrom
sdf-jkl:field-extension-type
Apr 13, 2026
Merged

[Arrow] Add API to check if Field has a valid ExtensionType#9677
alamb merged 4 commits into
apache:mainfrom
sdf-jkl:field-extension-type

Conversation

@sdf-jkl
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

Check issue

What changes are included in this PR?

  • Added a has_valid_extension_type API to Field
  • Added a unit test to save behavior

Are these changes tested?

  • yes, unit test

Are there any user-facing changes?

@github-actions github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Apr 8, 2026
@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 8, 2026

@alamb @scovich I was looking at #8474 and the doc update in #8475.

The current recommendation in #8475 is good:

if field.extension_type_name() == Some(MyExtensionType::NAME) {
    if let Ok(extension_type) = field.try_extension_type::<MyExtensionType>() {
        // ...
    }
}

Checking the name first avoids the two name-related error paths in try_new_from_field_metadata (missing/mismatch).

For a full validity check, though, we currently still have to go through try_new_from_field_metadata, which means:

  1. deserialize_metadata(...)
  2. try_new(...)

Both may return Err, and both are part of validation today.

I think a clean follow-up would be a dedicated validate API on ExtensionType, with a default implementation that simply delegates to try_new_from_field_metadata (or equivalent). That gives us a clearer API now without changing behavior, and leaves room for specialized implementations later if needed. For extensions that override it, this could also avoid allocation/work from constructing an ExtensionType when we only need validation.

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 8, 2026

I could add ExtensionType::validate here

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 8, 2026

It was easy to do right away here.

Now instead of building the ExtensionType we only follow the simple validation steps from ExtensionType::try_new.

For these types (Bool8, Json, Uuid, Opaque, TimestampWithOffset, VariantType, RowNumber, RowGroupIndex) it's a simple DataType check.

FixedShapeTensor and VariableShapeTensor don't implement validate yet, but they don't really need it since we don't do a boolean check for them anywhere.

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but one design question.

Comment thread arrow-schema/src/extension/canonical/bool8.rs
/// The default implementation delegates to [`Self::try_new`]. Extension
/// types may override this to validate without constructing `Self`.
fn validate(data_type: &DataType, metadata: Self::Metadata) -> Result<(), ArrowError> {
Self::try_new(data_type, metadata).map(|_| ())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear we have an API clash here:

  • supports_data_type receives &self, in order to allow configurable extension types based on their metadata
  • validate receives metadata but can only use it by instantiating Self (which requires the very allocation we wanted to avoid).

Ideally, supports_data_type should be implemented in terms of validate instead... but I guess that would be a breaking change?

The next best would be to chase down every extension type that actually has metadata, and implement the two methods in the correct direction.

Does any impl in the arrow crate actually use this provided method? Or is it just a safety net for third party impl to avoid a breaking change?

Copy link
Copy Markdown
Contributor Author

@sdf-jkl sdf-jkl Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a safety net now. No extension type is using the default validate impl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about as a follow on we remove the default impl (and we merge it for the next major breaking API release)-- that will force any implementations to implement and avoid the potential slowdown

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread arrow-schema/src/field.rs
Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd appreciate if @tustvold or @alamb could give a quick sanity check on the API design since they've been much more involved with extension types than I have.

/// The default implementation delegates to [`Self::try_new`]. Extension
/// types may override this to validate without constructing `Self`.
fn validate(data_type: &DataType, metadata: Self::Metadata) -> Result<(), ArrowError> {
Self::try_new(data_type, metadata).map(|_| ())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks!

@alamb alamb merged commit dad0be4 into apache:main Apr 13, 2026
31 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 13, 2026

Thank you @scovich and @sdf-jkl

@sdf-jkl sdf-jkl deleted the field-extension-type branch April 13, 2026 19:21
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…he#9677)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes apache#8474.

# Rationale for this change
Check issue
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
- Added a `has_valid_extension_type` API to `Field`
- Added a unit test to save behavior
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
- yes, unit test
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Arrow] Consider faster way to check if a Field has an extension type

3 participants