Skip to content

HELP WANTED: availableWorkers() #16

@HenrikBengtsson

Description

@HenrikBengtsson

Background

When submitting a job to the TORQUE / PBS using something like:

qsub -l nodes=3:ppn=2 myjob.sh

the scheduler will allocate 3 nodes with 2 cores each (= 6 cores total) for myjob.sh when launched. Exactly which 3 nodes is only known to myjob.sh at run time. This information is available in a file $PBS_NODEFILE written by TORQUE / PBS, e.g.

$ cat $PBS_NODEFILE
n1
n1
n8
n8
n9
n9

Other HPC job schedulers use other files / environment variables for this.

Actions

Add an availableNodes() file that searches for common environment variables and returns a vector of node names, e.g.

> availableNodes()`
[1] "n1" "n1" "n8" "n8" "n9" "n9"

If no known environment variables are found, the default fallback could be to return rep("localhost", times = availableCores().

The above would allow us to make workers = availableNodes() the new default for cluster futures (currently workers = availableCores()).

Identify these settings for the following schedulers:

  • PBS (Portable Batch System): Environment variable PBS_NODEFILE (the name of a file containing one node per line where each node is repeated "ppn" times).
  • Oracle Grid Engine (aka Sun Grid Engine, CODINE, GRD). Environment variable PE_HOSTFILE (a file, format unclear), cf. https://www.ace-net.ca/wiki/Sun_Grid_Engine
  • Slurm (Simple Linux Utility for Resource Management). Environment variable SLURM_JOB_NODELIST (list of nodes in a compressed format, e.g. instead of "tux1,tux3,tux4" it is stored as "tux[1,3-4]". Note that multiple "compressions" may exist, e.g. "compute-[0-6]-[0-15]". The number of nodes is can be verified by SLURM_JOB_NUM_NODES. The "ppn" information is in stored in SLURM_TASKS_PER_NODE).
  • LSF/OpenLava (Platform Load Sharing Facility).
    • LSB_HOSTS
  • Spark
  • OAR
  • HTCondor
  • Moab
  • PJM (https://staff.cs.manchester.ac.uk/~fumie/internal/Job_Operation_Software_en.pdf)
    • PJM_O_NODEINF - "Path of the allocated node list file. For a job to which virtual nodes are allocated, the IP addresses of the nodes where the virtual nodes are placed are written one per line."

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions