Torque/Moab cannot handle large numbers of short jobs and workflows involving such jobs must take measures to reduce the number of jobs to avoid overwhelming the scheduler. For the simple cases of running either a bag-of-jobs set of many serial tasks or running multiple MPI applications at the same time, we provide some helper utilities to simplify these common tasks. For more complex workflows you may consider an actual workflow manager such as SWIFT.
Use mutiple, concurrent apruns only when you have a small number of apruns per job. Too many apruns running on a single MOM node can impact the MOM node and other users running jobs that share the same MOM node.
Sample PBS script to launch multiple apruns, each with 32 MPI tasks per node. Changes to aprun options can be made to bundle multiple 2-node jobs or single task per node jobs.
#!/bin/bash #PBS -j oe #PBS -l nodes=4:ppn=32:xe #PBS -l walltime=00:02:00 #PBS -l flags=commlocal:commtolerant source /opt/modules/default/init/bash cd $PBS_O_WORKDIR aprun -n 32 -N 32 ./app.exe > app0.oe 2>&1 & aprun -n 32 -N 32 ./app.exe > app1.oe 2>&1 & aprun -n 32 -N 32 ./app.exe > app2.oe 2>&1 & aprun -n 32 -N 32 ./app.exe > app3.oe 2>&1 & ... wait
For more information
- man aprun
Single node tasking
Use xargs for a simple bag-of-jobs type processing if you only want to use a single node's worth of CPU cores.
Sample PBS script to launch serial tasks, up to 32 running concurently on a single node. Any number of tasks can be given, xargs will start the next task as soon as one of the currently running tasks finishes.
#!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=32:xe #PBS -l walltime=0:42:00 #PBS -l flags=commlocal:commtolerant source /opt/modules/default/init/bash cd $PBS_O_WORKDIR aprun -n 1 -N 1 -d 32 xargs -d '\n' -I cmd -P 32 /bin/bash -c 'cmd' <<EOF ./app1 opt1 opt2 ./app2 opt1 opt2 opt3 ... ./app2657891 opt1 EOF
For more information
Scheduler.x is a Fortran MPI code that is designed to place multiple independent non-MPI applications on a single or multiple compute nodes in a form of a single batch job.
The usual scenario of using Scheduler.x is to run a large number of single-core applications by fully packing the multi-core compute nodes, and utilizing as many compute nodes as needed. The other scenario is running multiple OpenMP applications that can either share the compute node or use all cores in the node by single application in a multi-node batch job.
Scheduler.x works by starting MPI processes on compute nodes. The master process reads the list of applications to run from the joblist file, and assigns the job to the first available child process. Upon completion of the assigned work the child process notifies the master process that it is available to run a new job. The master process takes a next job from the joblist file and assigns it to the child process to run. In this way, the batch job may run more jobs than the number of nodes/cores it requests. The excess jobs wait for their turn to be executed. The batch job completes when there are no more jobs in the joblist file to run.
scheduler.x joblist fullPathToExecutable [-nostdout] [-noexit]
- scheduler.x is the name of job launcher
- joblist is a mandatory first argument. The joblist file contains the list of jobs to be executed. See joblist format for the details.
- fullPathToExecutable is the name of application executable to run. This executable executes jobs from the joblist. It can be a real application executable or interpreter (e.g. bash shell)
- -nostdout is an optional program argument. When present it will instruct the job launcher to redirect the job stdout to /dev/null. If this argument is not present, the job stdout will be saved in .slog file.
- -noexit is another optional program argument that instructs the job launcher to ignore the individual job failure and continue with the other jobs in joblist. If this argument is not present, job launcher will treat the job failure as a critical error and abnormally terminate.
Scheduler.x treats its first argument as a name of the joblist file. Use the name joblist or give it a different name. Each line in joblist file defines a single job. The number of jobs to run is the number of lines in the joblist file. Each lines carries two fields separate by blank space. First filed specifies directory name where the job will be executed. The directory name may be given with absolute path, e.g. /u/scratch/01dir/, if desired, or relatively to the current working directory. Second filed contains a comman-line argument of the application to be submitted. Only one argument is allowed. Use bash as the application executable to run applications that require multiple arguments.
Depending on the application need, Scheduler.x may be told to run N number of MPI processes per node whereas the node may have M cores, where N is less than or equal to M. The possible range of N is from 1 to M. Using N=1 is convenient for the purpose of running a single OpenMP application per node that can utilize all available resources of the node when submitting multiple such applications in a batch job with help of Scheduler.x. Using N=M is appropriate with serial single-core applications that do not need much RAM. Using N in between 1 and M is appropriate when packing the node with several OpenMP applications, or when the available RAM on the node can fit only N single-core or OpenMP jobs.
Scheduler.x treats the requested number of child processes per node, N as the number of schedulable units. This number multiplied by the number of nodes in the batch job give the total number of schedulable units in the batch session. Each job from the joblist file will occupy one schedulable unit. In case when N < M the remaining cores in the node cannot host individual jobs but can be accessed as threads by the scheduled jobs.
The master process created by Scheduler.x constantly listens for messages to arrive from child processes and supplies them with new jobs. It is important to know that the master process occupies a single schedulable unit, and it does not participate in running the applications. This schedulable unit may be a small as a single core or occupy the entire node depending on configuration of the batch job. When schedulable unit is set to entire node, the master process of Scheduler.x will still run on a single core in the node, and the remaining cores will become unaccessible for the application jobs. The loss of one node will be unnoticeable when simultaneously executing a large number of single-node applications.
When configuring a number of schedulable units in PBS script, one unit should be reserved for the master process of Scheduler.x. In the example bellow, two single-core applications jobs are going to be executed on a single compute node. Therefore, aprun should be started with option -n 3 -N 3 that requests 3 cores per node in PBS script.
- Cannot run MPI applications.
- All bundled applications will use the same core granularization - one application per one schedulable unit.
- The master process of Scheduler.x occupies a single schedulable unit, and does not participate in running applications.
- The entire batch job will wait for the last job in the bundle to finish execution before exiting despite some of the compute nodes may be idling at that time.
How to get the code
Scheduler.x is an open-source project that can be downloaded from https://github.com/ncsa/Scheduler
Sample PBS script to launch multiple applications:
#!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=32:xe #PBS -l walltime=00:02:00 #PBS -l flags=commlocal:commtolerant cd $PBS_O_WORKDIR aprun -n 3 -N 3 ./scheduler.x joblist /bin/bash > log
Sample joblist file
01dir job1.sh 02dir job2.sh
For more information
wraprun is a python MPI based utility for placing tasks on compute nodes.
Sample PBS script to launch multiple apruns:
#!/bin/bash #PBS -j oe #PBS -l nodes=4:ppn=32:xe #PBS -l walltime=00:02:00 #PBS -l flags=commlocal:commtolerant source /opt/modules/default/init/bash module load bwpy/0.3.0 cd $PBS_O_WORKDIR wraprun ...