10 minutes to flox¶
GroupBy single variable¶
import numpy as np
import xarray as xr
from flox.xarray import xarray_reduce
labels = xr.DataArray(
[1, 2, 3, 1, 2, 3, 0, 0, 0],
dims="x",
name="label",
)
labels
<xarray.DataArray 'label' (x: 9)> Size: 72B array([1, 2, 3, 1, 2, 3, 0, 0, 0]) Dimensions without coordinates: x
With numpy¶
da = xr.DataArray(
np.ones((9,)), dims="x", name="array"
)
Apply the reduction using flox.xarray.xarray_reduce() specifying the reduction operation in func
xarray_reduce(da, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> Size: 32B array([3., 2., 2., 2.]) Coordinates: * label (label) int64 32B 0 1 2 3
With dask¶
Let’s first chunk da and labels
da_chunked = da.chunk(x=2)
labels_chunked = labels.chunk(x=3)
Grouping a dask array by a numpy array is unchanged
xarray_reduce(da_chunked, labels, func="sum")
<xarray.DataArray 'array' (label: 4)> Size: 32B dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates: * label (label) int64 32B 0 1 2 3
When grouping by a dask array, we need to specify the “expected group labels” on the output so we can construct the result DataArray.
Without the expected_groups kwarg, an error is raised
xarray_reduce(da_chunked, labels_chunked, func="sum")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 1
----> 1 xarray_reduce(da_chunked, labels_chunked, func="sum")
File ~/checkouts/readthedocs.org/user_builds/flox/envs/latest/lib/python3.12/site-packages/flox/xarray.py:347, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, *by, **finalize_kwargs)
342 if isbin_:
343 raise ValueError(
344 f"Please provided bin edges for group variable {idx} "
345 f"named {group_name} in expected_groups."
346 )
--> 347 expect1 = _get_expected_groups(b_.data, sort=sort)
348 else:
349 expect1 = expect
File ~/checkouts/readthedocs.org/user_builds/flox/envs/latest/lib/python3.12/site-packages/flox/core.py:122, in _get_expected_groups(by, sort)
120 def _get_expected_groups(by: T_By, sort: bool) -> T_ExpectIndex:
121 if is_duck_dask_array(by):
--> 122 raise ValueError("Please provide expected_groups if not grouping by a numpy array.")
123 flatby = by.reshape(-1)
124 expected = pd.unique(flatby[notnull(flatby)])
ValueError: Please provide expected_groups if not grouping by a numpy array.
Now we specify expected_groups:
dask_result = xarray_reduce(
da_chunked, labels_chunked, func="sum", expected_groups=[0, 1, 2, 3],
)
dask_result
<xarray.DataArray 'array' (label: 4)> Size: 32B dask.array<groupby_nansum, shape=(4,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray> Coordinates: * label (label) int64 32B 0 1 2 3
Note that any group labels not present in expected_groups will be ignored.
You can also provide expected_groups for the pure numpy GroupBy.
numpy_result = xarray_reduce(
da, labels, func="sum", expected_groups=[0, 1, 2, 3],
)
numpy_result
<xarray.DataArray 'array' (label: 4)> Size: 32B array([3., 2., 2., 2.]) Coordinates: * label (label) int64 32B 0 1 2 3
The two are identical:
numpy_result.identical(dask_result)
True
Binning by a single variable¶
For binning, specify the bin edges in expected_groups using pandas.IntervalIndex:
import pandas as pd
xarray_reduce(
da,
labels,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B array([2., 2., 2.]) Coordinates: * label_bins (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]
Similarly for dask inputs
xarray_reduce(
da_chunked,
labels_chunked,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray> Coordinates: * label_bins (label_bins) object 24B (0.5, 1.5] (1.5, 2.5] (2.5, 6.0]
For more control over the binning (which edge is closed), pass the appropriate kwarg to pandas.IntervalIndex:
xarray_reduce(
da_chunked,
labels_chunked,
func="sum",
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6], closed="left"),
)
<xarray.DataArray 'array' (label_bins: 3)> Size: 24B dask.array<groupby_nansum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray> Coordinates: * label_bins (label_bins) object 24B [0.5, 1.5) [1.5, 2.5) [2.5, 6.0)
Grouping by multiple variables¶
<xarray.DataArray (x: 4, y: 12)> Size: 384B
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
Coordinates:
labels2 (x) int64 32B 1 2 2 1
labels1 (y) <U1 48B 'a' 'a' 'c' 'c' 'c' 'b' 'b' 'c' 'c' 'b' 'b' 'f'
Dimensions without coordinates: x, yTo group by multiple variables simply pass them as *args:
xarray_reduce(da, "labels1", "labels2", func="sum")
<xarray.DataArray (labels1: 4, labels2: 2)> Size: 64B
array([[ 4., 4.],
[ 8., 8.],
[10., 10.],
[ 2., 2.]])
Coordinates:
* labels1 (labels1) object 32B 'a' 'b' 'c' 'f'
* labels2 (labels2) int64 16B 1 2Histogramming (Binning by multiple variables)¶
An unweighted histogram is simply a groupby multiple variables with count.
arr = np.ones((4, 12))
labels1 = np.array(np.linspace(0, 10, 12))
labels2 = np.array([1, 2, 2, 1])
da = xr.DataArray(
arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
)
da
<xarray.DataArray (x: 4, y: 12)> Size: 384B
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
Coordinates:
labels2 (x) int64 32B 1 2 2 1
labels1 (y) float64 96B 0.0 0.9091 1.818 2.727 ... 7.273 8.182 9.091 10.0
Dimensions without coordinates: x, ySpecify bins in expected_groups
xarray_reduce(
da,
"labels1",
"labels2",
func="count",
expected_groups=(
pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]), # labels1
pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]), # labels2
),
)
<xarray.DataArray (labels1_bins: 3, labels2_bins: 2)> Size: 48B
array([[10, 0],
[ 6, 0],
[ 4, 0]])
Coordinates:
* labels1_bins (labels1_bins) object 24B (-0.5, 4.5] (4.5, 6.5] (6.5, 8.9]
* labels2_bins (labels2_bins) object 16B (0.5, 1.5] (1.5, 1.9]Resampling¶
Use the xarray interface i.e. da.resample(time="M").mean().
Optionally pass method="blockwise": da.resample(time="M").mean(method="blockwise")