4,634 questions
0
votes
0
answers
41
views
TokenizationError when loading h5py dataset as dask dataframe
My goal is to process (sklearn Pipeline) a large HDF file that doesn't fit into RAM.
The core data is an irregular multivariate time-series (a very long 2D array). It could be split columnwise to fit ...
2
votes
0
answers
59
views
task works on local, but errors on Dask cluster: "SystemError: error return without exception set"
I have the following codes that pass an array to the task and submit to Dask cluster. The Dask cluster is running in Docker with several Dask workers. Docker starts with:
scheduler:
docker run -d \
-...
2
votes
0
answers
67
views
How to optimize NetCDF files and dask for processing long-term climataological indices with xclim (ex. SPI using 30-day rolling window)?
I am trying to analyze the 30 day standardized precipitation index for a multi-state range of the southeastern US for the year 2016. I'm using xclim to process a direct pull of gridded daily ...
0
votes
0
answers
25
views
Introducing new dimension in xarray apply_ufunc
There has been at least one other question regarding the introduction of new dimensions in the output of xarray.apply_ufunc; I have two problems with this answer: First, I feel like the answer avoids ...
0
votes
0
answers
43
views
Dask distributed stores old version of my code
I am analysing some data using dask distributed on a SLURM cluster. I am also using jupyter notebook. I am changing my codebase frequently and running jobs. Recently, a lot of my jobs started to crash....
3
votes
0
answers
84
views
How to drop rows with a boolean mask in xarray/dask without .compute() blowing up memory?
I’m trying to subset a large xarray.Dataset backed by Dask and save it back to Zarr, but I’m running into a major memory problem when attempting to drop rows with a boolean mask.
Here’s a minimal ...
0
votes
1
answer
54
views
How to connect to Dask Gateway Server from inside a Docker container?
I have a method that connects my app to a Dask Gateway Server
def set_up_dask(dashboard=False, num_workers=4, min_workers=4, max_workers=50):
gateway = Gateway("http://127.0.0.1:8000")
...
0
votes
0
answers
49
views
How to properly use joblib files in Dask?
from joblib import load
ntrees_16_model = load(r"ntrees_quantile_16_model_watermask.joblib")
ntrees_50_model = load(r"ntrees_quantile_50_model_watermask.joblib")
ntrees_84_model = ...
0
votes
0
answers
66
views
Why does XGBoost training (with DMatrix) write heavily to disk instead of using RAM?
I am training an XGBoost model in Python on a dataset with approximately 20k features and 30M records.
The features are sparse, and I am using xgboost.DMatrix for training.
Problem
During training, ...
0
votes
2
answers
69
views
Issues getting PyCaret/Fugue to work with DASK Backend
I am trying to use PyCaret with Fugue for a DASK backend and I'm running into an issue.
Using the following:
pycaret 3.3.2
fugue 0.9.1
dask ...
1
vote
1
answer
80
views
How to reduce xarray.coarsen with majority vote?
I'm currently trying to resample a large geotiff file to a coarser resolution. This file contains classes of tree species (indicated by integer values) at each pixel, so I want to resample each block (...
0
votes
1
answer
66
views
High RAM usage when using Datashader with dasked xarray
I have a dasked xarray which is about 150k x 90k with chunk size of 8192 x 8192. I am working on a Window virtual machine which has 100gb RAM and 16 cores.
I want to plot it using the Datashader ...
0
votes
0
answers
30
views
Is it possible to use dask distributed to pandas with apply working with multiprocessing?
I need advice from you.
Right now i do some computation with pandas library.
Program is using multiprocessing and df.apply.
The simple example showing my idea is here:
import multiprocessing
import ...
0
votes
0
answers
52
views
Combing two .nc files with different dimensions using Icechunk, Virtualizarr, and Xarray
My overall goal is the set up a virtual dataset of ERA5 data using Icechunk. As a smaller test example, I'm trying to pull all the data located in the 194001 ERA5 folder. I've been mostly able to ...
0
votes
1
answer
70
views
Dask large outer join with gzip files
I'm working with an omics dataset (1000+ files) which is a folder of about ~1GB of .txt.gz files which are tab separated. They each look roughly like this for a patient ABC:
pos
ABC_count1
ABC_count2
...