Add chunks='auto' support for cftime datasets #10527

charles-turner-1 · 2025-07-13T01:09:51Z

Tests added
Closes Confusing error when use_cftime = True and chunks = 'auto' in xr.open_dataset() #9834

welcome · 2025-07-13T01:09:54Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

for more information, see https://pre-commit.ci

jemmajeffree · 2025-07-14T04:52:19Z

Would these changes also work for cf timedeltas or are they going to still cause problems?
I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

charles-turner-1 · 2025-07-14T05:02:09Z

Would these changes also work for cf timedeltas or are they going to still cause problems? I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

If you can find something thats specifically a cftimedelta and run the _contains_cftime_datetimes function on it that'd be super helpful to know whether it returns True or False.

jemmajeffree · 2025-07-14T07:39:31Z

TLDR: don't mind me, it's not going to cause any issues

Firstly, what I thought was a cftimedelta turned out to be a numpy timedelta hanging out with a cftime

When I did manage to coerce this timedelta into cftime conventions, it just contained a floating point number of days, so I can't see anything having issues with its size

coder = xr.coding.times.CFTimedeltaCoder()
result = coder.encode(oops.average_DT).load()
print(result.dtype)
result

xarray/namedarray/daskmanager.py

…1/xarray into autochunk-cftime

…pect disk chunks sensibly & this should be ready to go, I think

charles-turner-1 · 2025-07-15T23:13:46Z

I did some prodding around yesterday and I realised this won't let us do something like

import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto')

yet, only stuff along the lines of

import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto')

I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack.

dcherian · 2025-07-16T14:23:57Z

Why so? Are we sending "auto" in to normalize_chunks first?

…inda janky

…1/xarray into autochunk-cftime

charles-turner-1 · 2025-07-23T08:40:06Z

Yup, this is the call stack:

----> 3 xr.open_dataset(
      4     "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
  /Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
    720     )
--> 721     ds = _dataset_from_backend_dataset(
    722         backend_ds,
  /Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
    417     if chunks is not None:
--> 418         ds = _chunk_ds(
    419             ds,
  /Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
    367     for name, var in backend_ds.variables.items():
--> 368         var_chunks = _get_chunk(var, chunks, chunkmanager)
    369         variables[name] = _maybe_chunk(
  /Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
    101 
--> 102     chunk_shape = chunkmanager.normalize_chunks(
    103         chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks()

I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too.

Do I want to refactor to move the changes in xarray/structure/chunks.py into the daskmanager module if possible?

Once I've got the structure there cleaned up, I'll work on replacing the build_chunkspec function with something more sensible - I just need to work out how to extract the implementation in dask cleanly now I think - normalize_chunks also seems to calculate sensible chunk sizes.

dcherian · 2025-07-23T16:12:08Z

xarray/structure/chunks.py

+
+        from xarray.namedarray.utils import build_chunkspec
+
+        target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))


How about adding get_auto_chunk_size to the ChunkManager class; and put the dask-specific stuff in the DaskManager.

cc @TomNicholas

dcherian · 2025-07-23T16:14:26Z

I guess one bit that's confusing here is that the code-path for backends and normal variables is different?

So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory.

xarray/namedarray/daskmanager.py

charles-turner-1 · 2025-09-24T00:24:38Z

It looks like the failing test (same one I commented on above) might be flaky? Now only failing for windows & python3.11, not 3.13: https://github.com/pydata/xarray/actions/runs/17962030991/job/51087173288

dcherian · 2025-10-13T18:06:47Z

xarray/namedarray/daskmanager.py

            raise NotImplementedError("Only chunks='auto' is supported at present.")
        return dask.array.shuffle(x, indexer, axis, chunks="auto")
+
+    def get_auto_chunk_size(self) -> int:


@tomwhite is there an equivalent for cubed? I didn't see it in the docs...

dcherian · 2025-10-13T18:07:37Z

xarray/namedarray/utils.py

+    if _contains_cftime_datetimes(data):
+        limit, dtype = fake_target_chunksize(data, chunkmanager.get_auto_chunk_size())
+    else:
+        limit = None
+        dtype = data.dtype
+
+    chunk_shape = chunkmanager.normalize_chunks(
+        chunk_shape,
+        shape=shape,
+        dtype=dtype,
+        limit=limit,
+        previous_chunks=preferred_chunk_shape,
+    )


does this seem fine to you @charles-turner-1 . I wanted to avoid calling get_auto_chunk_size as much as possible

Yeah, looks good! fake_target_chunksize also contains the same _contains_cf_datetimes check & early return if it's false, so we could remove the check in either fake_target_chunksize or here without causing issues if you think that's a good idea?

I'm guessing you meant calling fake_target_chunksize in your comment above, in which case we would probably want to either remove it in that function - or leave it in if we want to reuse fake_target_chunksize elsewhere?

dcherian

Phew, I think this is good to go. It would be good to clean up the types, but this PR has stalled for a long time.

Apologies for the delay (again). I was on vacation.

charles-turner-1 · 2025-10-13T23:39:08Z

No worries, thanks for all your help!

I'd love to keep getting my feet wet - do you happen to know if there are any other extant issues in roughly the same parts of the codebase off the top of your head? If not I'll go digging soon!

dcherian · 2025-10-15T00:33:19Z

There's this one #9897 ;) but it's a bit gnarly, high impact though.

welcome · 2025-10-15T00:33:30Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

charles-turner-1 · 2025-10-15T00:38:02Z

There's this one #9897 ;) but it's a bit gnarly, high impact though.

🙏 I'll have a crack!

spencerkclark · 2025-10-15T00:53:53Z

This is great—thanks @charles-turner-1 and @dcherian!

Thomas-Moore-Creative · 2025-10-20T02:19:07Z

This is great—thanks @charles-turner-1 and @dcherian!

Go Australia 🇦🇺 ( AKA @charles-turner-1 ), pulling our weight! 😉

jsignell · 2025-12-12T16:36:42Z

xarray/namedarray/utils.py

+
+    output_dtype = np.dtype(np.float64)
+
+    nbytes_approx: int = sys.getsizeof(first_n_items(data, 1))  # type: ignore[no-untyped-call]


I'm just came across this and I'm not quite sure it's the right size. I think sys.getsizeof is the in-memory size and and dtype.itemsize is the uncompressed disk size. Consider for instance:

import sys import numpy as np import cftime np.dtype(np.float64).itemsize # 8 sys.getsizeof(np.float64(1.0)) # 32 sys.getsizeof(np.array([1.0], dtype=np.float64)) # 120 sys.getsizeof(cftime.DatetimeGregorian.fromordinal(2450000)) #112

I'm kind of wondering if setting the dtype to np.dtype(np.float64) would suffice

This is for the assumed size as a float64 right? I think what you're saying is true but the array overhead rapidly becomes unimportant for reasonably large arrays? Very rough & ready analysis below:

# Does this still matter for decently sized arrays? import matplotlib.pyplot as plt sizes : list[tuple[int,int]] = [] for n in np.logspace(0, 8, num=50, dtype=int): arr = np.zeros(n, dtype=np.float64) sizes.append((n, sys.getsizeof(arr)/n)) # Plot size per element vs number of elements plt.figure(figsize=(10,6)) plt.plot([n for n,_ in sizes], [size for _,size in sizes], marker='o') plt.xscale('log') plt.xlabel('Number of elements in array (log scale)') plt.ylabel('Size per element (bytes)') # Add 8 byte line plt.axhline(y=8, color='r', linestyle='--', label='8')

Exactly! Calling sys.getsizeof on an array containing a single cftime object is not going to be a good representation of the memory consumption of an array of these things. Even if you pop the object out of the array that is still not really a good representation of the memory consumption. I think you'd do better with just nbytes_approx: int = 8

I made that same plot you did but with cftimes inside the array:

# Does this still matter for decently sized arrays? import sys import numpy as np import matplotlib.pyplot as plt sizes : list[tuple[int,int]] = [] for n in np.logspace(0, 4, num=50, dtype=int): arr = np.array([cftime.DatetimeGregorian.fromordinal(2450000+i) for i in range(n)]) sizes.append((n, sys.getsizeof(arr)/n)) # Plot size per element vs number of elements plt.figure(figsize=(10,6)) plt.plot([n for n,_ in sizes], [size for _,size in sizes], marker='o') plt.xscale('log') plt.xlabel('Number of elements in array (log scale)') plt.ylabel('Size per element (bytes)') # Add 8 byte line plt.axhline(y=8, color='r', linestyle='--', label='8')

My bad - I thought we'd popped the first cftime element out of the array and had a look at its size at that point.

It looks like the cftime elements in the array are 8bytes too - is that what we expect? I would have expected them to be a bit larger due to the extra overhead...

Assuming I'm wrong about that, it would be much simpler to just tell dask that a cftime is 8 bytes and leave the limit unadjusted - the ratio of two line should be pretty much 1 for all decently sized arrays.

On my phone right now but I'll have a proper look when I get to my computer.

So it does look like size per element in a numpy array is reliably 8 bytes, but I'm really unconvinced this can be correct tbh:

import sys import numpy as np import cftime import matplotlib.pyplot as plt cf_sizes : list[int] = [] f64_sizes : list[int] = [] num_elements :list[int] = [] for n in np.logspace(0, 4, num=50, dtype=int): cf_arr = np.array([cftime.DatetimeGregorian.fromordinal(2450000+i) for i in range(n)]) cf_sizes.append(sys.getsizeof(cf_arr)/n) num_elements.append(n) arr = np.zeros(n, dtype=np.float64) f64_sizes.append(sys.getsizeof(arr)/n) # Plot size per element vs number of elements plt.figure(figsize=(10,6)) ratio = [s_cf/s_f64 for s_cf, s_f64 in zip(cf_sizes, f64_sizes)] plt.plot(num_elements, cf_sizes, marker='o', label='cftime') plt.plot(num_elements, f64_sizes, marker='o', label='float64') plt.plot(num_elements, ratio, marker='o', label='cftime/float64 ratio') plt.xscale('log') plt.xlabel('Number of elements in array (log scale)') plt.ylabel('Size per element (bytes)') # Add 8 byte line, unit ratio line plt.axhline(y=8, color='r', linestyle='--', label='8') plt.axhline(y=1, color='grey', linestyle='--', label='1') plt.legend()

But if we look at the raw element (same as you did above):

>>> t = cftime.DatetimeGregorian.fromordinal(2450000) >>> sys.getsizeof(t) 112

Since it's just not possible that the cftime objects are magically shrinking when we put them in a numpy array, I assume numpy is storing pointers to objects somewhere on the heap.

I've run a couple of more sophisticated (this does not necessarily mean more likely to be right!) tests here:

# Does this still matter for decently sized arrays? import sys import numpy as np import cftime import matplotlib.pyplot as plt import tracemalloc import gc cf_sizes : list[int] = [] numel :list[int] = [] for n in np.logspace(0, 5, num=50, dtype=int): gc.collect() tracemalloc.start() snap1 = tracemalloc.take_snapshot() cf_arr = cftime.DatetimeGregorian.fromordinal(np.arange(2450000, 2450000+n)) snap2 = tracemalloc.take_snapshot() stats = snap2.compare_to(snap1, 'lineno') tracemalloc.stop() tot = sum(stat.size_diff for stat in stats) cf_sizes.append(tot / n) numel.append(n) # Plot size per element vs number of elements plt.figure(figsize=(10,6)) plt.plot(numel, cf_sizes, marker='o', label='cftime') plt.xscale('log') plt.yscale('log') plt.xlabel('Number of elements in array (log scale)') plt.ylabel('Size per element (bytes)') # Add 8 byte line plt.axhline(y=8, color='r', linestyle='--', label='8') plt.legend() print(f"nbytes asymptotes to {cf_sizes[-1]:.2f} for large arrays")

nbytes asymptotes to 120.12 for large arrays

I'm still not convinced this is the right number, so I'm still digging. But it looks like we (accidentally/serendipitously) might have gotten in the right ballpark?

EDIT: I've more some more playing, and I reckon we're off by approximately a factor of 2-2.5 ish. No real justification yet, just empirical results.

All works, just need to satisfy mypy and whatnot now

eb1a967

github-actions bot added topic-documentation topic-NamedArray Lightweight version of Variable labels Jul 13, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

852476d

for more information, see https://pre-commit.ci

charles-turner-1 changed the title ~~All works, just need to satisfy mypy and whatnot now~~ Add chunks='auto' support for cftime datasets Jul 13, 2025

charles-turner-1 and others added 5 commits July 12, 2025 18:10

Merge branch 'main' into autochunk-cftime

c921c59

Fix moving import to be optional

1aba531

[pre-commit.ci] auto fixes from pre-commit.com hooks

9429c3d

for more information, see https://pre-commit.ci

Make mypy happy

3c9d27e

Add some clarifying comments about what we need to do to optimise this

5153d2d

charles-turner-1 marked this pull request as draft July 14, 2025 05:02

dcherian reviewed Jul 14, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

dcherian reviewed Jul 14, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

charles-turner-1 added 4 commits July 15, 2025 07:04

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

62e71e6

…1/xarray into autochunk-cftime

@dcherian's suggestions. Just need to update chunking strategy to res…

cfdc31b

…pect disk chunks sensibly & this should be ready to go, I think

Merge branch 'main' of https://github.com/charles-turner-1/xarray

2f16bc7

Merge branch 'main' into autochunk-cftime

ce720fa

charles-turner-1 and others added 3 commits July 23, 2025 17:29

Merge branch 'main' into autochunk-cftime

4fa58c1

Can now load cftime arrays with auto-chunking. Implementation still k…

e58d6d7

…inda janky

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

590e503

…1/xarray into autochunk-cftime

dcherian reviewed Jul 23, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

charles-turner-1 and others added 3 commits September 24, 2025 07:16

Merge branch 'main' into autochunk-cftime

70208e0

Fix typing

3f0d3aa

Don't just import Variable in typing clause

e944eb4

charles-turner-1 requested a review from dcherian September 24, 2025 00:24

charles-turner-1 added 2 commits October 2, 2025 18:26

Merge branch 'main' into autochunk-cftime

1393351

Merge branch 'main' into autochunk-cftime

90242d1

This was referenced Oct 10, 2025

New data sources ESMValGroup/ESMValCore#2584

Open

Confusing error when use_cftime = True and chunks = 'auto' in xr.open_dataset() #9834

Closed

dcherian added 3 commits October 13, 2025 11:40

Cleanup

92bb538

Remove Variable handling

1e3a015

Try more

861cc57

dcherian reviewed Oct 13, 2025

View reviewed changes

dcherian added 2 commits October 13, 2025 13:29

bugfix

16ccc78

typing

1bd2f32

dcherian force-pushed the autochunk-cftime branch from d75e146 to 1bd2f32 Compare October 13, 2025 19:38

Merge branch 'main' into autochunk-cftime

9dead77

dcherian approved these changes Oct 13, 2025

View reviewed changes

dcherian added the plan to merge Final call for comments label Oct 13, 2025

charles-turner-1 mentioned this pull request Oct 14, 2025

[Catalog utility functions] find_chunking_info ACCESS-NRI/access-nri-intake-catalog#218

Closed

dcherian merged commit 94798a0 into pydata:main Oct 15, 2025
45 of 47 checks passed

charles-turner-1 mentioned this pull request Dec 7, 2025

cftime autochunking does not work with kerchunk reference datasets (& presumably other virtualised data) #10989

Closed

5 tasks

jsignell reviewed Dec 12, 2025

View reviewed changes


		from xarray.namedarray.utils import build_chunkspec

		target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))


		output_dtype = np.dtype(np.float64)

		nbytes_approx: int = sys.getsizeof(first_n_items(data, 1)) # type: ignore[no-untyped-call]

Uh oh!

Add chunks='auto' support for cftime datasets #10527

Add chunks='auto' support for cftime datasets #10527

Conversation

charles-turner-1 commented Jul 13, 2025 • edited by dcherian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

welcome bot commented Jul 13, 2025

Uh oh!

jemmajeffree commented Jul 14, 2025

Uh oh!

charles-turner-1 commented Jul 14, 2025

Uh oh!

jemmajeffree commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

charles-turner-1 commented Jul 15, 2025

Uh oh!

dcherian commented Jul 16, 2025

Uh oh!

charles-turner-1 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

charles-turner-1 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 commented Oct 13, 2025

Uh oh!

dcherian commented Oct 15, 2025

Uh oh!

Uh oh!

welcome bot commented Oct 15, 2025

Uh oh!

charles-turner-1 commented Oct 15, 2025

Uh oh!

spencerkclark commented Oct 15, 2025

Uh oh!

Thomas-Moore-Creative commented Oct 20, 2025

Uh oh!

jsignell Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

jsignell Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

charles-turner-1 commented Jul 13, 2025 •

edited by dcherian

Loading

charles-turner-1 commented Jul 23, 2025 •

edited

Loading

dcherian commented Jul 23, 2025 •

edited

Loading

charles-turner-1 commented Sep 24, 2025 •

edited

Loading

charles-turner-1 Oct 13, 2025 •

edited

Loading

dcherian left a comment •

edited

Loading

charles-turner-1 Dec 16, 2025 •

edited

Loading

charles-turner-1 Dec 18, 2025 •

edited

Loading