Skip to content

Improving performance of open_datatree #8994

@TomNicholas

Description

@TomNicholas

What is your issue?

The implementation of open_datatree works, but is inefficient, because it calls open_dataset once for every group in the file. We should refactor this to improve the performance, which would fix issues like xarray-contrib/datatree#330.

We discussed this in the datatree meeting, and my understanding is that concretely we need to:

  • Create an asv benchmark for open_datatree, probably involving first writing then benchmarking the opening of a special netCDF file that has no data but lots of groups. (tracked in Add benchmark test for open_datatree #9100)
  • Refactor the NetCDFDatastore class to only create one CachingFileManager object per file, not one per group, see
    manager = CachingFileManager(
    .
  • Refactor NetCDF4BackendEntrypoint.open_datatree to use an implementation that goes through NetCDFDatastore without calling the top-level xr.open_dataset again.
  • Check the performance of calling xr.open_datatree on a netCDF file has actually improved.

It would be great to get this done soon as part of the datatree integration project. @kmuehlbauer I know you were interested - are you willing / do you have time to take this task on?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions