Torque/Moab cannot handle large numbers of short jobs, and workflows involving such jobs must take measures to reduce the number of jobs to avoid overwhelming the scheduler. For the simple cases of running either a bag-of-jobs set of many serial tasks or running multiple MPI applications at the same time, we provide some helper utilities to simplify these common tasks. For more complex workflows, you may consider an actual workflow manager such as SWIFT.
Use mutiple, concurrent apruns only when you have a small number of apruns per job. Too many apruns running on a single MOM node can impact the MOM node and other users running jobs that share the same MOM node.
Reminder: only one aprun command can be run on a compute node at a time.
Sample PBS script to launch multiple apruns, each with 32 MPI tasks per node. Changes to aprun options can be made to bundle multiple 2-node jobs or single-task-per-node jobs.
#!/bin/bash #PBS -j oe #PBS -l nodes=4:ppn=32:xe #PBS -l walltime=00:02:00 #PBS -l flags=commlocal:commtolerant source /opt/modules/default/init/bash cd $PBS_O_WORKDIR aprun -n 32 -N 32 ./app.exe > app0.oe 2>&1 & aprun -n 32 -N 32 ./app.exe > app1.oe 2>&1 & aprun -n 32 -N 32 ./app.exe > app2.oe 2>&1 & aprun -n 32 -N 32 ./app.exe > app3.oe 2>&1 & ... wait
For more information
Single node tasking
Use xargs for a simple bag-of-jobs type processing if you only want to use a single node's worth of CPU cores. Your application should be built without the darshan module (module unload darshan ; make ... ).
Sample PBS script to launch serial tasks, up to 32 running concurently on a single node. Any number of tasks can be given, xargs will start the next task as soon as one of the currently running tasks finishes.
#!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=32:xe #PBS -l walltime=0:42:00 #PBS -l flags=commlocal:commtolerant source /opt/modules/default/init/bash cd $PBS_O_WORKDIR aprun -n 1 -N 1 -d 32 xargs -d '\n' -I cmd -P 32 /bin/bash -c 'cmd' <<EOF ./app1 opt1 opt2 ./app2 opt1 opt2 opt3 ... ./app2657891 opt1 EOF
For more information
Blue Water's bwpy-mpi module contains the class MPICommExecutor which can be used to spread out tasks among MPI ranks, reserving one rank to orchestrate the task. A simple example on how to it is shown below:
#/usr/bin/env python from mpi4py import MPI from mpi4py.futures import MPICommExecutor def fun(x): print("on %s print %g" % (MPI.COMM_WORLD.rank,x)) with MPICommExecutor(MPI.COMM_WORLD, root=0) as executor: jobs = range(100) if executor is not None: executor.map(fun, jobs)
which calls the function fun on each element of the list parallelizing the function calls among all available ranks. A sample submit script would like this
#!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=32:xe #PBS -l walltime=00:05:00 cd $PBS_O_WORKDIR module load bwpy module load bwpy-mpi NRANKS=$(wc -l <$PBS_NODEFILE) aprun -n $NRANKS -b python ./mpicommexecutor.py
Scheduler.x is an MPI code that is designed to place multiple independent non-MPI applications on a single or multiple compute nodes in a form of a single batch job.
The usual scenario of using Scheduler.x is to run a large number of single-core applications by fully packing the multi-core compute nodes, and utilizing as many compute nodes as needed. The other scenario is running multiple OpenMP applications that can either share the compute node or use all cores in the node by a single application in a multi-node batch job.
How to get the code
Scheduler.x is an open-source project that can be downloaded from https://github.com/ncsa/Scheduler
Scheduler.x works by starting MPI processes on compute nodes. The master process reads the list of applications to run from the joblist file, and assigns the job to the first available child process. Upon completion of the assigned work, the child process notifies the master process that it is available to run a new job. The master process takes the next job from the joblist file and assigns it to the child process to run. In this way, the batch job may run more jobs than the number of nodes/cores it requests. The excess jobs wait for their turn to be executed. The batch job completes when there are no more jobs in the joblist file to run.
scheduler.x joblist fullPathToExecutable [-nostdout] [-noexit]
Scheduler.x treats its first argument as a name of the joblist file. Use the name joblist or give it a different name. Each line in joblist file defines a single job. The number of jobs to run is the number of lines in the joblist file. Each lines carries two fields separate by blank space. First filed specifies directory name where the job will be executed. The directory name may be given with absolute path, e.g. /u/scratch/01dir/, if desired, or relatively to the current working directory. Second filed contains a comman-line argument of the application to be submitted. Only one argument is allowed. Use bash as the application executable to run applications that require multiple arguments.
Depending on the application need, Scheduler.x may be told to run N number of MPI processes per node whereas the node may have M cores, where N is less than or equal to M. The possible range of N is from 1 to M. Using N=1 is convenient for the purpose of running a single OpenMP application per node that can utilize all available resources of the node when submitting multiple such applications in a batch job with help of Scheduler.x. Using N=M is appropriate with serial single-core applications that do not need much RAM. Using N in between 1 and M is appropriate when packing the node with several OpenMP applications, or when the available RAM on the node can fit only N single-core or OpenMP jobs.
Scheduler.x treats the requested number of child processes per node, N as the number of schedulable units. This number multiplied by the number of nodes in the batch job give the total number of schedulable units in the batch session. Each job from the joblist file will occupy one schedulable unit. In case when N < M the remaining cores in the node cannot host individual jobs but can be accessed as threads by the scheduled jobs.
The master process created by Scheduler.x constantly listens for messages to arrive from child processes and supplies them with new jobs. It is important to know that the master process occupies a single schedulable unit, and it does not participate in running the applications. This schedulable unit may be a small as a single core or occupy the entire node depending on configuration of the batch job. When schedulable unit is set to entire node, the master process of Scheduler.x will still run on a single core in the node, and the remaining cores will become unaccessible for the application jobs. The loss of one node will be unnoticeable when simultaneously executing a large number of single-node applications.
When configuring a number of schedulable units in PBS script, one unit should be reserved for the master process of Scheduler.x. In the example bellow, two single-core applications jobs are going to be executed on a single compute node. Therefore, aprun should be started with option -n 3 -N 3 that requests 3 cores per node in PBS script.
Sample PBS script to launch multiple applications:
#!/bin/bash #PBS -j oe #PBS -l nodes=1:ppn=32:xe #PBS -l walltime=00:02:00 #PBS -l flags=commlocal:commtolerant cd $PBS_O_WORKDIR aprun -n 3 -N 3 ./scheduler.x joblist /bin/bash > log
Sample joblist file
01dir job1.sh 02dir job2.sh
For more information
wraprun is a python MPI based utility for placing tasks on compute nodes.
Sample PBS script to launch multiple apruns:
#!/bin/bash #PBS -j oe #PBS -l nodes=4:ppn=32:xe #PBS -l walltime=00:02:00 #PBS -l flags=commlocal:commtolerant source /opt/modules/default/init/bash module load bwpy/0.3.0 cd $PBS_O_WORKDIR wraprun ...
For more information