Skip to content

Use Metadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone#10075

Open
emilk wants to merge 2 commits into
apache:mainfrom
rerun-io:emilk/metadata-newtype
Open

Use Metadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone#10075
emilk wants to merge 2 commits into
apache:mainfrom
rerun-io:emilk/metadata-newtype

Conversation

@emilk
Copy link
Copy Markdown
Contributor

@emilk emilk commented Jun 4, 2026

Which issue does this PR close?

Rationale for this change

Field and Schema store their metadata as HashMap<String, String>, which makes cloning them expensive (a deep clone with an allocation per entry). This is in contrast to basically everything else in arrow-rs, which uses Arc for cheap cloning. HashMap also has a non-deterministic iteration order, forcing every consumer that needs determinism (IPC encoding, Display, Hash, Ord) to collect and sort the keys first.

What changes are included in this PR?

Adds a new Metadata type to arrow-schema:

#[derive(Clone, Default, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Metadata(
    // `None` means empty, so an empty `Metadata` never allocates
    Option<Arc<BTreeMap<String, String>>>,
);
  • Cloning is always cheap (bumps a refcount).
  • Mutation (insert/remove/extend) is copy-on-write via Arc::make_mut: the underlying map is deep-cloned only if it is shared.
  • Iteration order is deterministic (sorted by key) thanks to BTreeMap.
  • The inner map is never empty (None encodes the empty map), keeping the derived PartialEq/Ord/Hash implementations consistent.
  • Familiar map API: get, insert, remove, contains_key, len, iter, keys, values, clear, Index<&str>, plus From/Into conversions for HashMap, BTreeMap, and [(K, V); N] arrays, and FromIterator/Extend/IntoIterator implementations.
  • serde serialization format is identical to the HashMap it replaces (verified by a test asserting byte-for-byte equality with postcard).

Field::metadata, Schema::metadata, and SchemaBuilder::metadata now use Metadata. The setters (with_metadata, set_metadata, Schema::new_with_metadata) take impl Into<Metadata>, so existing callers passing a HashMap keep compiling, and callers can now also write field.with_metadata([("key", "value")]) directly.

Now-redundant sorting has been removed, since Metadata already iterates in sorted key order:

  • Field::cmp, Field::hash, and Schema::hash (which now is derived) no longer collect and sort keys.
  • arrow_ipc::convert::metadata_to_fb no longer sorts keys (IPC, Flight, and Parquet schema encoding funnel through this).
  • The DataType Display impl no longer sorts metadata entries.

Are these changes tested?

Yes:

  • New unit tests for Metadata covering the empty/allocation-free invariant, insert/get/remove, copy-on-write semantics, deterministic iteration order, equality with std maps, Extend, Debug, and serde round-trips (including format compatibility with the maps it replaces).
  • The existing test suites cover the integration (cargo test --workspace --all-features passes, apart from the pre-existing test_shrink_to_fit_after_concat failure that also fails on main with --all-features).

Are there any user-facing changes?

Yes, this is a breaking change to arrow-schema:

  • Field::metadata(), Field::metadata_mut(), Schema::metadata(), SchemaBuilder::metadata(), and SchemaBuilder::metadata_mut() now return &Metadata/&mut Metadata instead of &HashMap<String, String>/&mut HashMap<String, String>. The public field Schema::metadata is now a Metadata.
  • Field::with_metadata, Field::set_metadata, Schema::with_metadata, and Schema::new_with_metadata now take impl Into<Metadata> instead of HashMap<String, String>. Callers passing a HashMap continue to compile, but callers relying on type inference (e.g. .with_metadata(iter.collect()) or .with_metadata(Default::default())) need an explicit type annotation.
  • ExtensionType::try_new_from_field_metadata and arrow_ipc::convert::metadata_to_fb now take &Metadata instead of &HashMap<String, String>.
  • RecordBatch::schema_metadata_mut now returns &mut Metadata.
  • Code that needs an actual HashMap/BTreeMap can convert with .into() or iterate via .iter().

Since Metadata mirrors the map API, most code (e.g. schema.metadata.get("key"), field.metadata().is_empty()) compiles unchanged.

Additionally, metadata iteration order (e.g. in Debug output) is now deterministic (sorted by key), where it previously followed HashMap's arbitrary order.

@github-actions github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jun 4, 2026
…efault features

- parquet/src/arrow/schema/virtual_type.rs: formatted via the parquet-specific
  rustfmt invocation (the schema module is hidden behind the experimental!
  macro, so plain cargo fmt does not reach it)
- arrow-avro: drop ambiguous .into() in with_metadata calls inside
  #[cfg(not(feature = "avro_custom_types"))] code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timsaucer
Copy link
Copy Markdown
Member

As a point of reference, DataFusion downstream is already doing this essentially in some places
https://github.com/apache/datafusion/blob/7b78f0cea0409f60c79a3833781e621344d52ffd/datafusion/common/src/metadata.rs#L179

introduced by apache/datafusion#17986

pinging @paleolimbot who may be interested in this topic

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through this and it's clean, excellent, and well tested...it makes constructing metadata more concise and should help with metadata-heavy field users. In GeoArrow our metadata is frequently 4KB or more, and while I haven't personally been able to notice this in SedonaDB benchmarking, it would be great not to worry about these clones. Thanks!

Copy link
Copy Markdown
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great. It follows the same pattern we have in the downstream DataFusion crate, and it looks well tested. It is a breaking change, which is my only concern. Do we have upgrade guides in arrow-rs? I didn't immediately see any when looking around, but I also didn't dig deep.

Comment on lines +5485 to +5492
Field::new("duration_time_nanos", DataType::Int64, false)
.with_metadata([("logicalType", "arrow.duration-nanos")]),
Field::new("duration_time_micros", DataType::Int64, false)
.with_metadata([("logicalType", "arrow.duration-micros")]),
Field::new("duration_time_millis", DataType::Int64, false)
.with_metadata([("logicalType", "arrow.duration-millis")]),
Field::new("duration_time_seconds", DataType::Int64, false)
.with_metadata([("logicalType", "arrow.duration-seconds")]),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice cleanup.

Comment thread arrow-schema/src/field.rs
Comment on lines -133 to -154
let mut keys: Vec<&String> =
self.metadata.keys().chain(other.metadata.keys()).collect();
keys.sort();
for k in keys {
match (self.metadata.get(k), other.metadata.get(k)) {
(None, None) => {}
(Some(_), None) => {
return Ordering::Less;
}
(None, Some(_)) => {
return Ordering::Greater;
}
(Some(v1), Some(v2)) => match v1.cmp(v2) {
Ordering::Equal => {}
other => {
return other;
}
},
}
}

Ordering::Equal
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice removal! Especially for unit tests.

@alamb alamb added the api-change Changes to the arrow API label Jun 5, 2026
@alamb alamb changed the title New struct Metadata: ordered, and cheap to clone Use Metadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone Jun 5, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @emilk (and @timsaucer ) I think this is a great change -- and I think it is also well documented, well tested and well implemented 🤗

FYI @paleolimbot as I think you are also very attuned to metadata representations so this might interest you as well

Also FYI @rustyconover, if we go with this PR I think your PR to add per RecordBatch metadata will git significantly simpler

Since this is a breaking API change we will need to wait for the next major release. We are in the final stages of completing 59.0.0 (a major release) #9110

The next major release is currently slated for August (though we often lag behind the target dates it seems)

However, we could potentially accelerate the major release schedule for this one as I think it is quite valuable

.with_metadata([
(
EXTENSION_TYPE_NAME_KEY,
VariableShapeTensor::NAME.to_owned(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may not need the to_owned():

        .with_metadata([
            (
                EXTENSION_TYPE_NAME_KEY,
                VariableShapeTensor::NAME,
            ),
            (
                EXTENSION_TYPE_METADATA_KEY,
                r#"{ "dim_names": [1, null, 3, 4] }"#,
            ),
        ]);

/// let clone = metadata.clone(); // cheap
/// assert_eq!(clone.get("key"), Some(&"value".to_string()));
///
/// // Mutating one does not affect the other:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR it might be cool to add a builder style API too:

let metadata = Metadata::builder()
  .with("key3", "value3")
  .build()

That way then we could write

 .with_metadata([
            (
                EXTENSION_TYPE_NAME_KEY,
                VariableShapeTensor::NAME.to_owned(),
            ),
            (
                EXTENSION_TYPE_METADATA_KEY,
                r#"{ "dim_names": [1, null, 3, 4] }"#.to_owned(),
            ),

something like

 .with_metadata(Metadata::builder()
    .with(EXTENSION_TYPE_NAME_KEY, VariableShapeTensor::NAME),
    .with(EXTENSION_TYPE_METADATA_KEY)
    .build()
  )

// Invariant: the inner map is never empty (`None` encodes the empty map).
// This ensures the derived implementations of `PartialEq`, `Ord`, `Hash`, …
// treat an empty `Metadata` consistently, and the empty case never allocates.
Option<Arc<BTreeMap<String, String>>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be mentioning again here that it uses a BTreeMap to ensure consistent iteration order

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate arrow-avro arrow-avro crate next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ref-counted Metadata type

4 participants