availableCores: add LSF/OpenLava #360

epruesse · 2020-02-21T17:15:36Z

LSF (and the no longer available OpenLava) use LSB_- prefixed environment variables. The scheduling information is in these variables.

LSB_MCPU_HOSTS as [<hostname> <ncpu>]... tuples. E.g. (node-1 32 node-2 16)
LSB_DJOB_NUMPROC gives the number of processes on the current host
LSB_HOSTS as [<hostname>]... where each host is listed multiple times if multiple slots are allocated.

Note: should availableCores not detect a present cluster engine automatically and default to this? I.e. if NSLOTS is set (and perhaps another, more SGE specific variable), then use that value over everything else. And same for the other cluster engines.

LSF (and the no longer available OpenLava) use `LSB_`- prefixed environment variables. The scheduling information is in these variables. - `LSB_MCPU_HOSTS` as `[<hostname> <ncpu>]...` tuples. E.g. (`node-1 32 node-2 16`) - `LSB_DJOB_NUMPROC` gives the number of processes on the current host - `LSB_HOSTS` as `[<hostname>]...` where each host is listed multiple times if multiple slots are allocated.

epruesse · 2020-02-21T17:48:49Z

Are the Macos / Windows failures to be expected?

HenrikBengtsson · 2020-02-21T17:58:53Z

Thxs for your work here - it'll be a while before I have a chance to dive into it your contrib/comments - a bit swamped.

Are the Macos / Windows failures to be expected?

Yes, the macOS errors (https://travis-ci.org/HenrikBengtsson/future) because they're currently tracking an error triggered on CRAN (#356)

You can ignore any errors on GitHub Actions - I just added/prototyped support for GA the other week and I haven't had time to trace those errors - they might be false positives.

epruesse · 2020-02-21T22:17:56Z

This one is totally minor. I've just added the variable for LSF/OpenLava mirroring SGE/Slurm/Torque.

HenrikBengtsson · 2020-02-23T01:19:16Z

Thanks for this. Please see my inline review comments.

Importantly, do you have access to an LSF/OpenLava scheduler where you can validate that this addition works? This can be done by, for instance, submitting a job requesting a different number of cores and that calls Rscript -e "future::availableCores()" and/or Rscript -e "future::availableCores(which = 'all')". The output should show an element named LSF. However simple the fix looks, I prefer to not push things that haven't been validated by at least one person.

epruesse · 2020-03-13T01:19:38Z

Please see my inline review comments.

I don't see any ...

Yes, I do have access to a cluster running OpenLava. I will double check that it works as expected.

R/availableCores.R

HenrikBengtsson · 2020-03-13T01:24:09Z

Please see my inline review comments.

I don't see any ...

Ah, I think I've made this mistake before. You have to "submit" a code review after having entered the comments. I just went back to this PR and the comments where there waiting for me to submit them :/

HenrikBengtsson · 2020-03-13T01:24:38Z

That is, I've submitted - see if you can see them now.

izahn · 2020-04-20T14:50:22Z

@epruesse is there anything I can do to help get this PR ready? This will be very useful for our users at HBS.

HenrikBengtsson · 2020-08-18T00:16:12Z

Adding this info here: I just stumbled upon https://grid.rcs.hbs.org/parallel-matlab which mentions LSB_MAX_NUM_PROCESSORS on their LSF cluster.

codecov-commenter · 2020-08-18T01:04:19Z

Codecov Report

Merging #360 into develop will decrease coverage by 0.05%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop     #360      +/-   ##
===========================================
- Coverage    78.60%   78.55%   -0.06%     
===========================================
  Files           60       60              
  Lines         4277     4267      -10     
===========================================
- Hits          3362     3352      -10     
  Misses         915      915

Impacted Files	Coverage Δ
R/availableCores.R	`88.73% <100.00%> (+0.32%)`	⬆️
R/MulticoreFuture-class.R	`69.84% <0.00%> (-0.48%)`	⬇️
R/UniprocessFuture-class.R	`88.88% <0.00%> (-0.21%)`	⬇️
R/globals.R	`95.23% <0.00%> (-0.18%)`	⬇️
R/makeClusterPSOCK.R	`66.55% <0.00%> (-0.06%)`	⬇️
R/Future-class.R	`84.34% <0.00%> (-0.05%)`	⬇️
R/resolved.R	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84a35c9...fa8b597. Read the comment docs.

epruesse · 2020-08-21T03:20:32Z

Sorry - I was just too busy to finish this.

epruesse · 2020-08-21T04:17:28Z

At least on OpenLava, the command echo env|bsub -o out.env yields these variables (among others, of course):

HOSTTYPE=linux
LSB_JOBINDEX=0
LSB_JOBEXIT_STAT=0
LSB_JOBNAME=env
LSB_JOBFILENAME=/PATH/TO/SPOOL/
LSB_OUTPUTFILE=env.out
LSB_MCPU_HOSTS=node-01 1 
LSFUSER=USERNAME
LSB_JOBID=12345
LSB_JOB_EXECUSER=USERNAME
LSB_DJOB_NUMPROC=1
LSB_HOSTS=node-01
LSB_SUB_HOST=headnode-01
LSB_QUEUE=normal
LSB_MAX_NUM_PROCESSORS=1

The LSB_MAX_NUM_PROCESSORS is the total number of slots allocated, LSB_HOSTS lists the host names, LSB_MCPU_HOSTS lists for each host the number of processes and LSB_DJOB_NUMPROC gives the "slot" count allocated on the current host. So if someone were to submit (for whichever reason) a parallel job spanning multiple nodes, NUM_PROCESSORS would be too high.

epruesse · 2020-08-21T04:18:32Z

Of course I don't know whether these are the same on all LSF types out there. This is from OpenLava 4.0. Perhaps you can confirm, @izahn?

HenrikBengtsson · 2020-09-18T20:26:02Z

I'd like to get this into the next release so that availableCores() will be useful for query the current host for the number of allotted cores. It sounds like LSB_DJOB_NUMPROC can be used for this regardless of the number of nodes (hosts) was requested? Is that correct?

For multi-node jobs, availableCores() should still only tell us the number of cores we are allowed to use for that job on that node. If we want that job to launch workers on the multiple nodes allotted, then that's a task for availableWorkers() to figure that out, e.g. by parsing LSB_HOSTS. From your top comment, it sounds like we can do that by just:

hosts <- Sys.getenv("LSB_HOSTS")
hosts <- strsplit(hosts, split = ",", fixed = TRUE)[[1]]
hosts <- gsub("(^[ ]+|[ ]+$)", hosts)

Is that a correct assumption?

HenrikBengtsson · 2020-09-18T21:23:13Z

Note: should availableCores not detect a present cluster engine automatically and default to this? I.e. if NSLOTS is set (and perhaps another, more SGE specific variable), then use that value over everything else. And same for the other cluster engines.

Possible; maybe it would be better to have an inferScheduler() function that uses various ad-hoc tricks to infer what scheduler is in place. However, since I don't know what those rules should exactly be (e.g. I might check 5-10 known env vars but then if they drop something in a future release it might all break), and if that can be done reliably (e.g. maybe a user has some conflicting manual setting), I decided to run through all possible alternatives.

HenrikBengtsson · 2020-09-18T21:35:35Z

I've updated availableWorkers() to support LSB_HOSTS (based on the assumption hostnames are separated by SPACEs).

HenrikBengtsson reviewed Mar 13, 2020

View reviewed changes

R/availableCores.R Outdated Show resolved Hide resolved

R/availableCores.R Outdated Show resolved Hide resolved

HenrikBengtsson closed this Aug 18, 2020

HenrikBengtsson reopened this Aug 18, 2020

epruesse added 2 commits August 20, 2020 21:23

Update availableCores.R

2de465b

Update availableCores.R

92bc78a

epruesse added 2 commits August 21, 2020 12:26

Update options.R

0123d4c

Update availableCores.R

15a9642

HenrikBengtsson added this to the 1.19.0 milestone Sep 18, 2020

Merge branch 'develop' into patch-1

fa8b597

HenrikBengtsson merged commit ccc1926 into futureverse:develop Sep 18, 2020

HenrikBengtsson mentioned this pull request Oct 20, 2020

HELP WANTED: Agility of availableCores() futureverse/parallelly#17

Open

24 tasks

availableCores: add LSF/OpenLava #360

availableCores: add LSF/OpenLava #360

Uh oh!

Conversation

epruesse commented Feb 21, 2020

Uh oh!

epruesse commented Feb 21, 2020

Uh oh!

HenrikBengtsson commented Feb 21, 2020

Uh oh!

epruesse commented Feb 21, 2020

Uh oh!

HenrikBengtsson commented Feb 23, 2020

Uh oh!

epruesse commented Mar 13, 2020

Uh oh!

Uh oh!

Uh oh!

HenrikBengtsson commented Mar 13, 2020

Uh oh!

HenrikBengtsson commented Mar 13, 2020

Uh oh!

izahn commented Apr 20, 2020

Uh oh!

HenrikBengtsson commented Aug 18, 2020

Uh oh!

codecov-commenter commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

epruesse commented Aug 21, 2020

Uh oh!

epruesse commented Aug 21, 2020

Uh oh!

epruesse commented Aug 21, 2020

Uh oh!

HenrikBengtsson commented Sep 18, 2020

Uh oh!

HenrikBengtsson commented Sep 18, 2020

Uh oh!

HenrikBengtsson commented Sep 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Aug 18, 2020 •

edited

Loading