-
Notifications
You must be signed in to change notification settings - Fork 92
availableCores: add LSF/OpenLava #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LSF (and the no longer available OpenLava) use `LSB_`- prefixed environment variables. The scheduling information is in these variables. - `LSB_MCPU_HOSTS` as `[<hostname> <ncpu>]...` tuples. E.g. (`node-1 32 node-2 16`) - `LSB_DJOB_NUMPROC` gives the number of processes on the current host - `LSB_HOSTS` as `[<hostname>]...` where each host is listed multiple times if multiple slots are allocated.
|
Are the Macos / Windows failures to be expected? |
|
Thxs for your work here - it'll be a while before I have a chance to dive into it your contrib/comments - a bit swamped.
Yes, the macOS errors (https://travis-ci.org/HenrikBengtsson/future) because they're currently tracking an error triggered on CRAN (#356) You can ignore any errors on GitHub Actions - I just added/prototyped support for GA the other week and I haven't had time to trace those errors - they might be false positives. |
|
This one is totally minor. I've just added the variable for LSF/OpenLava mirroring SGE/Slurm/Torque. |
|
Thanks for this. Please see my inline review comments. Importantly, do you have access to an LSF/OpenLava scheduler where you can validate that this addition works? This can be done by, for instance, submitting a job requesting a different number of cores and that calls |
I don't see any ... Yes, I do have access to a cluster running OpenLava. I will double check that it works as expected. |
Ah, I think I've made this mistake before. You have to "submit" a code review after having entered the comments. I just went back to this PR and the comments where there waiting for me to submit them :/ |
|
That is, I've submitted - see if you can see them now. |
|
@epruesse is there anything I can do to help get this PR ready? This will be very useful for our users at HBS. |
|
Adding this info here: I just stumbled upon https://grid.rcs.hbs.org/parallel-matlab which mentions |
Codecov Report
@@ Coverage Diff @@
## develop #360 +/- ##
===========================================
- Coverage 78.60% 78.55% -0.06%
===========================================
Files 60 60
Lines 4277 4267 -10
===========================================
- Hits 3362 3352 -10
Misses 915 915
Continue to review full report at Codecov.
|
|
Sorry - I was just too busy to finish this. |
|
At least on OpenLava, the command The LSB_MAX_NUM_PROCESSORS is the total number of slots allocated, LSB_HOSTS lists the host names, LSB_MCPU_HOSTS lists for each host the number of processes and LSB_DJOB_NUMPROC gives the "slot" count allocated on the current host. So if someone were to submit (for whichever reason) a parallel job spanning multiple nodes, NUM_PROCESSORS would be too high. |
|
Of course I don't know whether these are the same on all LSF types out there. This is from OpenLava 4.0. Perhaps you can confirm, @izahn? |
|
I'd like to get this into the next release so that For multi-node jobs, hosts <- Sys.getenv("LSB_HOSTS")
hosts <- strsplit(hosts, split = ",", fixed = TRUE)[[1]]
hosts <- gsub("(^[ ]+|[ ]+$)", hosts)Is that a correct assumption? |
Possible; maybe it would be better to have an |
|
I've updated |
LSF (and the no longer available OpenLava) use
LSB_- prefixed environment variables. The scheduling information is in these variables.LSB_MCPU_HOSTSas[<hostname> <ncpu>]...tuples. E.g. (node-1 32 node-2 16)LSB_DJOB_NUMPROCgives the number of processes on the current hostLSB_HOSTSas[<hostname>]...where each host is listed multiple times if multiple slots are allocated.Note: should
availableCoresnot detect a present cluster engine automatically and default to this? I.e. ifNSLOTSis set (and perhaps another, more SGE specific variable), then use that value over everything else. And same for the other cluster engines.