-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add chunks='auto' support for cftime datasets #10527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
Would these changes also work for cf timedeltas or are they going to still cause problems? |
If you can find something thats specifically a cftimedelta and run the |
…1/xarray into autochunk-cftime
…pect disk chunks sensibly & this should be ready to go, I think
|
I did some prodding around yesterday and I realised this won't let us do something like import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto')yet, only stuff along the lines of import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto')I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack. |
|
Why so? Are we sending |
|
Yup, this is the call stack: ----> 3 xr.open_dataset(
4 "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
/Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
720 )
--> 721 ds = _dataset_from_backend_dataset(
722 backend_ds,
/Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
417 if chunks is not None:
--> 418 ds = _chunk_ds(
419 ds,
/Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
367 for name, var in backend_ds.variables.items():
--> 368 var_chunks = _get_chunk(var, chunks, chunkmanager)
369 variables[name] = _maybe_chunk(
/Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
101
--> 102 chunk_shape = chunkmanager.normalize_chunks(
103 chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks()I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too. Do I want to refactor to move the changes in Once I've got the structure there cleaned up, I'll work on replacing the |
xarray/structure/chunks.py
Outdated
|
|
||
| from xarray.namedarray.utils import build_chunkspec | ||
|
|
||
| target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding get_auto_chunk_size to the ChunkManager class; and put the dask-specific stuff in the DaskManager.
cc @TomNicholas
|
I guess one bit that's confusing here is that the code-path for backends and normal variables is different? So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory. |
|
It looks like the failing test (same one I commented on above) might be flaky? Now only failing for windows & python3.11, not 3.13: https://github.com/pydata/xarray/actions/runs/17962030991/job/51087173288 |
| raise NotImplementedError("Only chunks='auto' is supported at present.") | ||
| return dask.array.shuffle(x, indexer, axis, chunks="auto") | ||
|
|
||
| def get_auto_chunk_size(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite is there an equivalent for cubed? I didn't see it in the docs...
| if _contains_cftime_datetimes(data): | ||
| limit, dtype = fake_target_chunksize(data, chunkmanager.get_auto_chunk_size()) | ||
| else: | ||
| limit = None | ||
| dtype = data.dtype | ||
|
|
||
| chunk_shape = chunkmanager.normalize_chunks( | ||
| chunk_shape, | ||
| shape=shape, | ||
| dtype=dtype, | ||
| limit=limit, | ||
| previous_chunks=preferred_chunk_shape, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this seem fine to you @charles-turner-1 . I wanted to avoid calling get_auto_chunk_size as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks good! fake_target_chunksize also contains the same _contains_cf_datetimes check & early return if it's false, so we could remove the check in either fake_target_chunksize or here without causing issues if you think that's a good idea?
I'm guessing you meant calling fake_target_chunksize in your comment above, in which case we would probably want to either remove it in that function - or leave it in if we want to reuse fake_target_chunksize elsewhere?
d75e146 to
1bd2f32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, I think this is good to go. It would be good to clean up the types, but this PR has stalled for a long time.
Apologies for the delay (again). I was on vacation.
|
No worries, thanks for all your help! I'd love to keep getting my feet wet - do you happen to know if there are any other extant issues in roughly the same parts of the codebase off the top of your head? If not I'll go digging soon! |
|
There's this one #9897 ;) but it's a bit gnarly, high impact though. |
🙏 I'll have a crack! |
|
This is great—thanks @charles-turner-1 and @dcherian! |
Go Australia 🇦🇺 ( AKA @charles-turner-1 ), pulling our weight! 😉 |
|
|
||
| output_dtype = np.dtype(np.float64) | ||
|
|
||
| nbytes_approx: int = sys.getsizeof(first_n_items(data, 1)) # type: ignore[no-untyped-call] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just came across this and I'm not quite sure it's the right size. I think sys.getsizeof is the in-memory size and and dtype.itemsize is the uncompressed disk size. Consider for instance:
import sys
import numpy as np
import cftime
np.dtype(np.float64).itemsize # 8
sys.getsizeof(np.float64(1.0)) # 32
sys.getsizeof(np.array([1.0], dtype=np.float64)) # 120
sys.getsizeof(cftime.DatetimeGregorian.fromordinal(2450000)) #112There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of wondering if setting the dtype to np.dtype(np.float64) would suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the assumed size as a float64 right? I think what you're saying is true but the array overhead rapidly becomes unimportant for reasonably large arrays? Very rough & ready analysis below:
# Does this still matter for decently sized arrays?
import matplotlib.pyplot as plt
sizes : list[tuple[int,int]] = []
for n in np.logspace(0, 8, num=50, dtype=int):
arr = np.zeros(n, dtype=np.float64)
sizes.append((n, sys.getsizeof(arr)/n))
# Plot size per element vs number of elements
plt.figure(figsize=(10,6))
plt.plot([n for n,_ in sizes], [size for _,size in sizes], marker='o')
plt.xscale('log')
plt.xlabel('Number of elements in array (log scale)')
plt.ylabel('Size per element (bytes)')
# Add 8 byte line
plt.axhline(y=8, color='r', linestyle='--', label='8')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! Calling sys.getsizeof on an array containing a single cftime object is not going to be a good representation of the memory consumption of an array of these things. Even if you pop the object out of the array that is still not really a good representation of the memory consumption. I think you'd do better with just nbytes_approx: int = 8
I made that same plot you did but with cftimes inside the array:
# Does this still matter for decently sized arrays?
import sys
import numpy as np
import matplotlib.pyplot as plt
sizes : list[tuple[int,int]] = []
for n in np.logspace(0, 4, num=50, dtype=int):
arr = np.array([cftime.DatetimeGregorian.fromordinal(2450000+i) for i in range(n)])
sizes.append((n, sys.getsizeof(arr)/n))
# Plot size per element vs number of elements
plt.figure(figsize=(10,6))
plt.plot([n for n,_ in sizes], [size for _,size in sizes], marker='o')
plt.xscale('log')
plt.xlabel('Number of elements in array (log scale)')
plt.ylabel('Size per element (bytes)')
# Add 8 byte line
plt.axhline(y=8, color='r', linestyle='--', label='8')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad - I thought we'd popped the first cftime element out of the array and had a look at its size at that point.
It looks like the cftime elements in the array are 8bytes too - is that what we expect? I would have expected them to be a bit larger due to the extra overhead...
Assuming I'm wrong about that, it would be much simpler to just tell dask that a cftime is 8 bytes and leave the limit unadjusted - the ratio of two line should be pretty much 1 for all decently sized arrays.
On my phone right now but I'll have a proper look when I get to my computer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it does look like size per element in a numpy array is reliably 8 bytes, but I'm really unconvinced this can be correct tbh:
import sys
import numpy as np
import cftime
import matplotlib.pyplot as plt
cf_sizes : list[int] = []
f64_sizes : list[int] = []
num_elements :list[int] = []
for n in np.logspace(0, 4, num=50, dtype=int):
cf_arr = np.array([cftime.DatetimeGregorian.fromordinal(2450000+i) for i in range(n)])
cf_sizes.append(sys.getsizeof(cf_arr)/n)
num_elements.append(n)
arr = np.zeros(n, dtype=np.float64)
f64_sizes.append(sys.getsizeof(arr)/n)
# Plot size per element vs number of elements
plt.figure(figsize=(10,6))
ratio = [s_cf/s_f64 for s_cf, s_f64 in zip(cf_sizes, f64_sizes)]
plt.plot(num_elements, cf_sizes, marker='o', label='cftime')
plt.plot(num_elements, f64_sizes, marker='o', label='float64')
plt.plot(num_elements, ratio, marker='o', label='cftime/float64 ratio')
plt.xscale('log')
plt.xlabel('Number of elements in array (log scale)')
plt.ylabel('Size per element (bytes)')
# Add 8 byte line, unit ratio line
plt.axhline(y=8, color='r', linestyle='--', label='8')
plt.axhline(y=1, color='grey', linestyle='--', label='1')
plt.legend()
But if we look at the raw element (same as you did above):
>>> t = cftime.DatetimeGregorian.fromordinal(2450000)
>>> sys.getsizeof(t)
112Since it's just not possible that the cftime objects are magically shrinking when we put them in a numpy array, I assume numpy is storing pointers to objects somewhere on the heap.
I've run a couple of more sophisticated (this does not necessarily mean more likely to be right!) tests here:
# Does this still matter for decently sized arrays?
import sys
import numpy as np
import cftime
import matplotlib.pyplot as plt
import tracemalloc
import gc
cf_sizes : list[int] = []
numel :list[int] = []
for n in np.logspace(0, 5, num=50, dtype=int):
gc.collect()
tracemalloc.start()
snap1 = tracemalloc.take_snapshot()
cf_arr = cftime.DatetimeGregorian.fromordinal(np.arange(2450000, 2450000+n))
snap2 = tracemalloc.take_snapshot()
stats = snap2.compare_to(snap1, 'lineno')
tracemalloc.stop()
tot = sum(stat.size_diff for stat in stats)
cf_sizes.append(tot / n)
numel.append(n)
# Plot size per element vs number of elements
plt.figure(figsize=(10,6))
plt.plot(numel, cf_sizes, marker='o', label='cftime')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Number of elements in array (log scale)')
plt.ylabel('Size per element (bytes)')
# Add 8 byte line
plt.axhline(y=8, color='r', linestyle='--', label='8')
plt.legend()
print(f"nbytes asymptotes to {cf_sizes[-1]:.2f} for large arrays")nbytes asymptotes to 120.12 for large arrays

I'm still not convinced this is the right number, so I'm still digging. But it looks like we (accidentally/serendipitously) might have gotten in the right ballpark?
EDIT: I've more some more playing, and I reckon we're off by approximately a factor of 2-2.5 ish. No real justification yet, just empirical results.



use_cftime = Trueandchunks = 'auto'inxr.open_dataset()#9834