4,630 questions
1
vote
0
answers
40
views
Modin + Dask distributed: AttributeError: type object 'ABCMeta' has no attribute 'deploy_axis_func'
I'm trying to use Modin with a Dask LocalCluster to parallelize pandas DataFrame operations in a Django application (Python 3.13). Even with processes=False (thread-based workers, same process), the ...
0
votes
0
answers
77
views
Sentence Transformer Stuck at Loading (Google Cloud Instance)
I use this code to load sentence transformer in a GCP VM instance (no GPU). This is a dask plugin used on dask worker.:
class NLPSetup(WorkerPlugin):
def __init__(self, bucket_uri):
self....
1
vote
1
answer
47
views
How can I get dask to schedule these tasks on different specialized workers?
I have a flow based on a dictionary of tasks to their dependent tasks. I loop through each task which has already had all dependencies submitted and submit them, which eventually exhausts the tasks. ...
Best practices
0
votes
1
replies
18
views
scatter(x, broadcast=True) vs replicate(x)
I am trying to understand the difference in Dask between
scatter(x, broadcast=True) and replicate(x).
Both seem to provide a way to ensure copies of data is available in all nodes.
Are they actually ...
0
votes
1
answer
74
views
How should I be using Dask .compute() to perform relatively simple operations
I'm trying to use Dask to do some relatively simple computations and operations that I was doing with Pandas but on a larger dataset. I have approximately 1500 .csv files that range in size from 1KB ...
3
votes
1
answer
91
views
Dask client connects successfully but no workers are available [closed]
I am using Dask for some processing. The client starts successfully, but I am seeing zero workers.
This is how I am creating the client:
client = Client("tls://localhost:xxxx")
This is the ...
0
votes
0
answers
61
views
TokenizationError when loading h5py dataset as dask dataframe
My goal is to process (sklearn Pipeline) a large HDF file that doesn't fit into RAM.
The core data is an irregular multivariate time-series (a very long 2D array). It could be split columnwise to fit ...
3
votes
1
answer
75
views
task works on local, but errors on Dask cluster: "SystemError: error return without exception set"
I have the following codes that pass an array to the task and submit to Dask cluster. The Dask cluster is running in Docker with several Dask workers. Docker starts with:
scheduler:
docker run -d \
-...
3
votes
0
answers
91
views
How to optimize NetCDF files and dask for processing long-term climataological indices with xclim (ex. SPI using 30-day rolling window)?
I am trying to analyze the 30 day standardized precipitation index for a multi-state range of the southeastern US for the year 2016. I'm using xclim to process a direct pull of gridded daily ...
0
votes
0
answers
30
views
Introducing new dimension in xarray apply_ufunc
There has been at least one other question regarding the introduction of new dimensions in the output of xarray.apply_ufunc; I have two problems with this answer: First, I feel like the answer avoids ...
0
votes
0
answers
56
views
Dask distributed stores old version of my code
I am analysing some data using dask distributed on a SLURM cluster. I am also using jupyter notebook. I am changing my codebase frequently and running jobs. Recently, a lot of my jobs started to crash....
2
votes
0
answers
90
views
How to drop rows with a boolean mask in xarray/dask without .compute() blowing up memory?
I’m trying to subset a large xarray.Dataset backed by Dask and save it back to Zarr, but I’m running into a major memory problem when attempting to drop rows with a boolean mask.
Here’s a minimal ...
-1
votes
1
answer
61
views
How to connect to Dask Gateway Server from inside a Docker container?
I have a method that connects my app to a Dask Gateway Server
def set_up_dask(dashboard=False, num_workers=4, min_workers=4, max_workers=50):
gateway = Gateway("http://127.0.0.1:8000")
...
0
votes
0
answers
60
views
How to properly use joblib files in Dask?
from joblib import load
ntrees_16_model = load(r"ntrees_quantile_16_model_watermask.joblib")
ntrees_50_model = load(r"ntrees_quantile_50_model_watermask.joblib")
ntrees_84_model = ...
0
votes
0
answers
86
views
Why does XGBoost training (with DMatrix) write heavily to disk instead of using RAM?
I am training an XGBoost model in Python on a dataset with approximately 20k features and 30M records.
The features are sparse, and I am using xgboost.DMatrix for training.
Problem
During training, ...