Use Metadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone#10075
Use Metadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone#10075emilk wants to merge 2 commits into
Metadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone#10075Conversation
…efault features - parquet/src/arrow/schema/virtual_type.rs: formatted via the parquet-specific rustfmt invocation (the schema module is hidden behind the experimental! macro, so plain cargo fmt does not reach it) - arrow-avro: drop ambiguous .into() in with_metadata calls inside #[cfg(not(feature = "avro_custom_types"))] code Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
As a point of reference, DataFusion downstream is already doing this essentially in some places introduced by apache/datafusion#17986 pinging @paleolimbot who may be interested in this topic |
paleolimbot
left a comment
There was a problem hiding this comment.
I read through this and it's clean, excellent, and well tested...it makes constructing metadata more concise and should help with metadata-heavy field users. In GeoArrow our metadata is frequently 4KB or more, and while I haven't personally been able to notice this in SedonaDB benchmarking, it would be great not to worry about these clones. Thanks!
timsaucer
left a comment
There was a problem hiding this comment.
I think this is great. It follows the same pattern we have in the downstream DataFusion crate, and it looks well tested. It is a breaking change, which is my only concern. Do we have upgrade guides in arrow-rs? I didn't immediately see any when looking around, but I also didn't dig deep.
| Field::new("duration_time_nanos", DataType::Int64, false) | ||
| .with_metadata([("logicalType", "arrow.duration-nanos")]), | ||
| Field::new("duration_time_micros", DataType::Int64, false) | ||
| .with_metadata([("logicalType", "arrow.duration-micros")]), | ||
| Field::new("duration_time_millis", DataType::Int64, false) | ||
| .with_metadata([("logicalType", "arrow.duration-millis")]), | ||
| Field::new("duration_time_seconds", DataType::Int64, false) | ||
| .with_metadata([("logicalType", "arrow.duration-seconds")]), |
| let mut keys: Vec<&String> = | ||
| self.metadata.keys().chain(other.metadata.keys()).collect(); | ||
| keys.sort(); | ||
| for k in keys { | ||
| match (self.metadata.get(k), other.metadata.get(k)) { | ||
| (None, None) => {} | ||
| (Some(_), None) => { | ||
| return Ordering::Less; | ||
| } | ||
| (None, Some(_)) => { | ||
| return Ordering::Greater; | ||
| } | ||
| (Some(v1), Some(v2)) => match v1.cmp(v2) { | ||
| Ordering::Equal => {} | ||
| other => { | ||
| return other; | ||
| } | ||
| }, | ||
| } | ||
| } | ||
|
|
||
| Ordering::Equal |
There was a problem hiding this comment.
Nice removal! Especially for unit tests.
struct Metadata: ordered, and cheap to cloneMetadata struct instead of HashMap<String, String> for metadata: Ordered; cheap to clone
alamb
left a comment
There was a problem hiding this comment.
Thank you @emilk (and @timsaucer ) I think this is a great change -- and I think it is also well documented, well tested and well implemented 🤗
FYI @paleolimbot as I think you are also very attuned to metadata representations so this might interest you as well
Also FYI @rustyconover, if we go with this PR I think your PR to add per RecordBatch metadata will git significantly simpler
Since this is a breaking API change we will need to wait for the next major release. We are in the final stages of completing 59.0.0 (a major release) #9110
The next major release is currently slated for August (though we often lag behind the target dates it seems)
However, we could potentially accelerate the major release schedule for this one as I think it is quite valuable
| .with_metadata([ | ||
| ( | ||
| EXTENSION_TYPE_NAME_KEY, | ||
| VariableShapeTensor::NAME.to_owned(), |
There was a problem hiding this comment.
You may not need the to_owned():
.with_metadata([
(
EXTENSION_TYPE_NAME_KEY,
VariableShapeTensor::NAME,
),
(
EXTENSION_TYPE_METADATA_KEY,
r#"{ "dim_names": [1, null, 3, 4] }"#,
),
]);| /// let clone = metadata.clone(); // cheap | ||
| /// assert_eq!(clone.get("key"), Some(&"value".to_string())); | ||
| /// | ||
| /// // Mutating one does not affect the other: |
There was a problem hiding this comment.
As a follow on PR it might be cool to add a builder style API too:
let metadata = Metadata::builder()
.with("key3", "value3")
.build()That way then we could write
.with_metadata([
(
EXTENSION_TYPE_NAME_KEY,
VariableShapeTensor::NAME.to_owned(),
),
(
EXTENSION_TYPE_METADATA_KEY,
r#"{ "dim_names": [1, null, 3, 4] }"#.to_owned(),
),something like
.with_metadata(Metadata::builder()
.with(EXTENSION_TYPE_NAME_KEY, VariableShapeTensor::NAME),
.with(EXTENSION_TYPE_METADATA_KEY)
.build()
)| // Invariant: the inner map is never empty (`None` encodes the empty map). | ||
| // This ensures the derived implementations of `PartialEq`, `Ord`, `Hash`, … | ||
| // treat an empty `Metadata` consistently, and the empty case never allocates. | ||
| Option<Arc<BTreeMap<String, String>>>, |
There was a problem hiding this comment.
It might also be mentioning again here that it uses a BTreeMap to ensure consistent iteration order
Which issue does this PR close?
Metadatatype #10069.Rationale for this change
FieldandSchemastore their metadata asHashMap<String, String>, which makes cloning them expensive (a deep clone with an allocation per entry). This is in contrast to basically everything else inarrow-rs, which usesArcfor cheap cloning.HashMapalso has a non-deterministic iteration order, forcing every consumer that needs determinism (IPC encoding,Display,Hash,Ord) to collect and sort the keys first.What changes are included in this PR?
Adds a new
Metadatatype toarrow-schema:insert/remove/extend) is copy-on-write viaArc::make_mut: the underlying map is deep-cloned only if it is shared.BTreeMap.Noneencodes the empty map), keeping the derivedPartialEq/Ord/Hashimplementations consistent.get,insert,remove,contains_key,len,iter,keys,values,clear,Index<&str>, plusFrom/Intoconversions forHashMap,BTreeMap, and[(K, V); N]arrays, andFromIterator/Extend/IntoIteratorimplementations.serdeserialization format is identical to theHashMapit replaces (verified by a test asserting byte-for-byte equality with postcard).Field::metadata,Schema::metadata, andSchemaBuilder::metadatanow useMetadata. The setters (with_metadata,set_metadata,Schema::new_with_metadata) takeimpl Into<Metadata>, so existing callers passing aHashMapkeep compiling, and callers can now also writefield.with_metadata([("key", "value")])directly.Now-redundant sorting has been removed, since
Metadataalready iterates in sorted key order:Field::cmp,Field::hash, andSchema::hash(which now is derived) no longer collect and sort keys.arrow_ipc::convert::metadata_to_fbno longer sorts keys (IPC, Flight, and Parquet schema encoding funnel through this).DataTypeDisplayimpl no longer sorts metadata entries.Are these changes tested?
Yes:
Metadatacovering the empty/allocation-free invariant, insert/get/remove, copy-on-write semantics, deterministic iteration order, equality with std maps,Extend,Debug, and serde round-trips (including format compatibility with the maps it replaces).cargo test --workspace --all-featurespasses, apart from the pre-existingtest_shrink_to_fit_after_concatfailure that also fails onmainwith--all-features).Are there any user-facing changes?
Yes, this is a breaking change to
arrow-schema:Field::metadata(),Field::metadata_mut(),Schema::metadata(),SchemaBuilder::metadata(), andSchemaBuilder::metadata_mut()now return&Metadata/&mut Metadatainstead of&HashMap<String, String>/&mut HashMap<String, String>. The public fieldSchema::metadatais now aMetadata.Field::with_metadata,Field::set_metadata,Schema::with_metadata, andSchema::new_with_metadatanow takeimpl Into<Metadata>instead ofHashMap<String, String>. Callers passing aHashMapcontinue to compile, but callers relying on type inference (e.g..with_metadata(iter.collect())or.with_metadata(Default::default())) need an explicit type annotation.ExtensionType::try_new_from_field_metadataandarrow_ipc::convert::metadata_to_fbnow take&Metadatainstead of&HashMap<String, String>.RecordBatch::schema_metadata_mutnow returns&mut Metadata.HashMap/BTreeMapcan convert with.into()or iterate via.iter().Since
Metadatamirrors the map API, most code (e.g.schema.metadata.get("key"),field.metadata().is_empty()) compiles unchanged.Additionally, metadata iteration order (e.g. in
Debugoutput) is now deterministic (sorted by key), where it previously followedHashMap's arbitrary order.