Skip to content

Conversation

@epruesse
Copy link
Contributor

LSF (and the no longer available OpenLava) use LSB_- prefixed environment variables. The scheduling information is in these variables.

  • LSB_MCPU_HOSTS as [<hostname> <ncpu>]... tuples. E.g. (node-1 32 node-2 16)
  • LSB_DJOB_NUMPROC gives the number of processes on the current host
  • LSB_HOSTS as [<hostname>]... where each host is listed multiple times if multiple slots are allocated.

Note: should availableCores not detect a present cluster engine automatically and default to this? I.e. if NSLOTS is set (and perhaps another, more SGE specific variable), then use that value over everything else. And same for the other cluster engines.

LSF (and the no longer available OpenLava) use `LSB_`- prefixed environment variables. The scheduling information is in these variables.

 - `LSB_MCPU_HOSTS` as `[<hostname> <ncpu>]...` tuples. E.g. (`node-1 32 node-2 16`)
 - `LSB_DJOB_NUMPROC` gives the number of processes on the current host
 - `LSB_HOSTS` as `[<hostname>]...` where each host is listed multiple times if multiple slots are allocated.
@epruesse
Copy link
Contributor Author

Are the Macos / Windows failures to be expected?

@HenrikBengtsson
Copy link
Collaborator

Thxs for your work here - it'll be a while before I have a chance to dive into it your contrib/comments - a bit swamped.

Are the Macos / Windows failures to be expected?

Yes, the macOS errors (https://travis-ci.org/HenrikBengtsson/future) because they're currently tracking an error triggered on CRAN (#356)

You can ignore any errors on GitHub Actions - I just added/prototyped support for GA the other week and I haven't had time to trace those errors - they might be false positives.

@epruesse
Copy link
Contributor Author

This one is totally minor. I've just added the variable for LSF/OpenLava mirroring SGE/Slurm/Torque.

@HenrikBengtsson
Copy link
Collaborator

Thanks for this. Please see my inline review comments.

Importantly, do you have access to an LSF/OpenLava scheduler where you can validate that this addition works? This can be done by, for instance, submitting a job requesting a different number of cores and that calls Rscript -e "future::availableCores()" and/or Rscript -e "future::availableCores(which = 'all')". The output should show an element named LSF. However simple the fix looks, I prefer to not push things that haven't been validated by at least one person.

@epruesse
Copy link
Contributor Author

Please see my inline review comments.

I don't see any ...

Yes, I do have access to a cluster running OpenLava. I will double check that it works as expected.

@HenrikBengtsson
Copy link
Collaborator

Please see my inline review comments.

I don't see any ...

Ah, I think I've made this mistake before. You have to "submit" a code review after having entered the comments. I just went back to this PR and the comments where there waiting for me to submit them :/

@HenrikBengtsson
Copy link
Collaborator

That is, I've submitted - see if you can see them now.

@izahn
Copy link

izahn commented Apr 20, 2020

@epruesse is there anything I can do to help get this PR ready? This will be very useful for our users at HBS.

@HenrikBengtsson
Copy link
Collaborator

Adding this info here: I just stumbled upon https://grid.rcs.hbs.org/parallel-matlab which mentions LSB_MAX_NUM_PROCESSORS on their LSF cluster.

@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2020

Codecov Report

Merging #360 into develop will decrease coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #360      +/-   ##
===========================================
- Coverage    78.60%   78.55%   -0.06%     
===========================================
  Files           60       60              
  Lines         4277     4267      -10     
===========================================
- Hits          3362     3352      -10     
  Misses         915      915              
Impacted Files Coverage Δ
R/availableCores.R 88.73% <100.00%> (+0.32%) ⬆️
R/MulticoreFuture-class.R 69.84% <0.00%> (-0.48%) ⬇️
R/UniprocessFuture-class.R 88.88% <0.00%> (-0.21%) ⬇️
R/globals.R 95.23% <0.00%> (-0.18%) ⬇️
R/makeClusterPSOCK.R 66.55% <0.00%> (-0.06%) ⬇️
R/Future-class.R 84.34% <0.00%> (-0.05%) ⬇️
R/resolved.R 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84a35c9...fa8b597. Read the comment docs.

@epruesse
Copy link
Contributor Author

Sorry - I was just too busy to finish this.

@epruesse
Copy link
Contributor Author

At least on OpenLava, the command echo env|bsub -o out.env yields these variables (among others, of course):

HOSTTYPE=linux
LSB_JOBINDEX=0
LSB_JOBEXIT_STAT=0
LSB_JOBNAME=env
LSB_JOBFILENAME=/PATH/TO/SPOOL/
LSB_OUTPUTFILE=env.out
LSB_MCPU_HOSTS=node-01 1 
LSFUSER=USERNAME
LSB_JOBID=12345
LSB_JOB_EXECUSER=USERNAME
LSB_DJOB_NUMPROC=1
LSB_HOSTS=node-01
LSB_SUB_HOST=headnode-01
LSB_QUEUE=normal
LSB_MAX_NUM_PROCESSORS=1

The LSB_MAX_NUM_PROCESSORS is the total number of slots allocated, LSB_HOSTS lists the host names, LSB_MCPU_HOSTS lists for each host the number of processes and LSB_DJOB_NUMPROC gives the "slot" count allocated on the current host. So if someone were to submit (for whichever reason) a parallel job spanning multiple nodes, NUM_PROCESSORS would be too high.

@epruesse
Copy link
Contributor Author

Of course I don't know whether these are the same on all LSF types out there. This is from OpenLava 4.0. Perhaps you can confirm, @izahn?

@HenrikBengtsson
Copy link
Collaborator

I'd like to get this into the next release so that availableCores() will be useful for query the current host for the number of allotted cores. It sounds like LSB_DJOB_NUMPROC can be used for this regardless of the number of nodes (hosts) was requested? Is that correct?

For multi-node jobs, availableCores() should still only tell us the number of cores we are allowed to use for that job on that node. If we want that job to launch workers on the multiple nodes allotted, then that's a task for availableWorkers() to figure that out, e.g. by parsing LSB_HOSTS. From your top comment, it sounds like we can do that by just:

hosts <- Sys.getenv("LSB_HOSTS")
hosts <- strsplit(hosts, split = ",", fixed = TRUE)[[1]]
hosts <- gsub("(^[ ]+|[ ]+$)", hosts)

Is that a correct assumption?

@HenrikBengtsson HenrikBengtsson added this to the 1.19.0 milestone Sep 18, 2020
@HenrikBengtsson HenrikBengtsson merged commit ccc1926 into futureverse:develop Sep 18, 2020
@HenrikBengtsson
Copy link
Collaborator

Note: should availableCores not detect a present cluster engine automatically and default to this? I.e. if NSLOTS is set (and perhaps another, more SGE specific variable), then use that value over everything else. And same for the other cluster engines.

Possible; maybe it would be better to have an inferScheduler() function that uses various ad-hoc tricks to infer what scheduler is in place. However, since I don't know what those rules should exactly be (e.g. I might check 5-10 known env vars but then if they drop something in a future release it might all break), and if that can be done reliably (e.g. maybe a user has some conflicting manual setting), I decided to run through all possible alternatives.

@HenrikBengtsson
Copy link
Collaborator

I've updated availableWorkers() to support LSB_HOSTS (based on the assumption hostnames are separated by SPACEs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants