Job Bundling

Introduction

Torque/Moab cannot handle large numbers of short jobs, and workflows involving such jobs must take measures to reduce the number of jobs to avoid overwhelming the scheduler. For the simple cases of running either a bag-of-jobs set of many serial tasks or running multiple MPI applications at the same time, we provide some helper utilities to simplify these common tasks. For more complex workflows, you may consider an actual workflow manager such as SWIFT.

Bundling Approaches

multiple, concurrent apruns in a single job (for small node-count jobs)
single node tasking
scheduler.x - MPI based utility to run work per task
wraprun - ORNL python-based
other utilities
- OSC's parallel command processor
- Swift - workflow package for handling large workloads

Multiple apruns

Introduction

Use mutiple, concurrent apruns only when you have a small number of apruns per job. Too many apruns running on a single MOM node can impact the MOM node and other users running jobs that share the same MOM node.

Reminder: only one aprun command can be run on a compute node at a time.

Sample script

Sample PBS script to launch multiple apruns, each with 32 MPI tasks per node. Changes to aprun options can be made to bundle multiple 2-node jobs or single-task-per-node jobs.

#!/bin/bash
#PBS -j oe
#PBS -l nodes=4:ppn=32:xe
#PBS -l walltime=00:02:00
#PBS -l flags=commlocal:commtolerant

source /opt/modules/default/init/bash

cd $PBS_O_WORKDIR

aprun -n 32 -N 32 ./app.exe > app0.oe 2>&1 &
aprun -n 32 -N 32 ./app.exe > app1.oe 2>&1 &
aprun -n 32 -N 32 ./app.exe > app2.oe 2>&1 &
aprun -n 32 -N 32 ./app.exe > app3.oe 2>&1 &
...
wait

Using aprun with serial codes

aprun can be used to start multiple copies of a non-MPI (serial) code for job-bundling purposes for example if more than one job is to run on a each compute node. For each started process aprun sets an environment variable ALPS_APP_PE which provides the MPI rank (starting from 0) that would be assigned to the process and can be used by non-MPI codes to decide which job to run. For example:

aprun -n 16 -N 8 -d 4 bash -c 'echo "$ALPS_APP_PE on host $(hostname)"'

will start 16 copies of bash, with 8 copies per node, and each copy having access to 4 cores.

For more information

man aprun

Single node tasking

Introduction

Use xargs for a simple bag-of-jobs type processing if you only want to use a single node's worth of CPU cores. Your application should be built without the darshan module (module unload darshan ; make ... ).

Sample script

Sample PBS script to launch serial tasks, up to 32 running concurently on a single node. Any number of tasks can be given, xargs will start the next task as soon as one of the currently running tasks finishes.

#!/bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=32:xe
#PBS -l walltime=0:42:00
#PBS -l flags=commlocal:commtolerant

source /opt/modules/default/init/bash

cd $PBS_O_WORKDIR

# if encountering SEGFAULTS from MPI set
# export MPICH_GNI_FORK_MODE=FULLCOPY
# see "man mpi"

aprun -cc none -n 1 -N 1 -d 32 xargs -d '\n' -I cmd -P 32 /bin/bash -c 'cmd'  <<EOF
./app1 opt1 opt2
./app2 opt1 opt2 opt3
...
./app2657891 opt1
EOF

For more information

Using mulitple nodes and python

Blue Water's bwpy-mpi module contains the class MPICommExecutor which can be used to spread out tasks among MPI ranks, reserving one rank to orchestrate the task. A simple example on how to it is shown below:

#/usr/bin/env python

from mpi4py import MPI
from mpi4py.futures import MPICommExecutor
import os

def fun(x):
    exitcode = os.system('/bin/echo "on %s print %g"' % (MPI.COMM_WORLD.rank,x))
    return exitcode

with MPICommExecutor(MPI.COMM_WORLD, root=0) as executor:
    jobs = [47,18,99,10,72,25,7,29,74,2,81,96,26,67,15,17,51,64,38,29,48]
    if executor is not None:
        rcs = executor.map(fun, jobs)

if MPI.COMM_WORLD.rank == 0:
    for (jobnum, rc) in enumerate(rcs):
        print("Job %d returned %d" % (jobnum, rc))

which calls the function fun on each element of the list parallelizing the function calls among all available ranks. A sample submit script would like this

#!/bin/bash
#PBS -j oe
#PBS -l nodes=2:ppn=32:xe
#PBS -l walltime=00:05:00

cd $PBS_O_WORKDIR

module load bwpy
module load bwpy-mpi

NRANKS=$(wc -l <$PBS_NODEFILE)

# if encountering SEGFAULTS from MPI set
# export MPICH_GNI_FORK_MODE=FULLCOPY
# see "man mpi" 

aprun -n $NRANKS -b python ./mpicommexecutor.py

Scheduler.x

Introduction

Scheduler.x is an MPI code that is designed to place multiple independent non-MPI applications on a single or multiple compute nodes in a form of a single batch job.

The usual scenario of using Scheduler.x is to run a large number of single-core applications by fully packing the multi-core compute nodes, and utilizing as many compute nodes as needed. The other scenario is running multiple OpenMP applications that can either share the compute node or use all cores in the node by a single application in a multi-node batch job.

How to get the code

Scheduler.x is an open-source project that can be downloaded from https://github.com/ncsa/Scheduler

Description

Scheduler.x works by starting MPI processes on compute nodes. The master process reads the list of applications to run from the joblist file, and assigns the job to the first available child process. Upon completion of the assigned work, the child process notifies the master process that it is available to run a new job. The master process takes the next job from the joblist file and assigns it to the child process to run. In this way, the batch job may run more jobs than the number of nodes/cores it requests. The excess jobs wait for their turn to be executed. The batch job completes when there are no more jobs in the joblist file to run.

Command-line arguments

scheduler.x joblist fullPathToExecutable [-nostdout] [-noexit]

scheduler.x is the name of job launcher
joblist is a mandatory first argument. The joblist file contains the list of jobs to be executed. See joblist format for the details.
fullPathToExecutable is the name of application executable to run. This executable executes jobs from the joblist. It can be a real application executable or interpreter (e.g. bash shell)
-nostdout is an optional program argument. When present it will instruct the job launcher to redirect the job stdout to /dev/null. If this argument is not present, the job stdout will be saved in .slog file.
-noexit is another optional program argument that instructs the job launcher to ignore the individual job failure and continue with the other jobs in joblist. If this argument is not present, job launcher will treat the job failure as a critical error and abnormally terminate.

Joblist format

Scheduler.x treats its first argument as a name of the joblist file. Use the name joblist or give it a different name. Each line in joblist file defines a single job. The number of jobs to run is the number of lines in the joblist file. Each lines carries two fields separate by blank space. First filed specifies directory name where the job will be executed. The directory name may be given with absolute path, e.g. /u/scratch/01dir/, if desired, or relatively to the current working directory. Second filed contains a comman-line argument of the application to be submitted. Only one argument is allowed. Use bash as the application executable to run applications that require multiple arguments.

Schedulable unit

Depending on the application need, Scheduler.x may be told to run N number of MPI processes per node whereas the node may have M cores, where N is less than or equal to M. The possible range of N is from 1 to M. Using N=1 is convenient for the purpose of running a single OpenMP application per node that can utilize all available resources of the node when submitting multiple such applications in a batch job with help of Scheduler.x. Using N=M is appropriate with serial single-core applications that do not need much RAM. Using N in between 1 and M is appropriate when packing the node with several OpenMP applications, or when the available RAM on the node can fit only N single-core or OpenMP jobs.

Scheduler.x treats the requested number of child processes per node, N as the number of schedulable units. This number multiplied by the number of nodes in the batch job give the total number of schedulable units in the batch session. Each job from the joblist file will occupy one schedulable unit. In case when N < M the remaining cores in the node cannot host individual jobs but can be accessed as threads by the scheduled jobs.

The master process created by Scheduler.x constantly listens for messages to arrive from child processes and supplies them with new jobs. It is important to know that the master process occupies a single schedulable unit, and it does not participate in running the applications. This schedulable unit may be a small as a single core or occupy the entire node depending on configuration of the batch job. When schedulable unit is set to entire node, the master process of Scheduler.x will still run on a single core in the node, and the remaining cores will become unaccessible for the application jobs. The loss of one node will be unnoticeable when simultaneously executing a large number of single-node applications.

When configuring a number of schedulable units in PBS script, one unit should be reserved for the master process of Scheduler.x. In the example bellow, two single-core applications jobs are going to be executed on a single compute node. Therefore, aprun should be started with option -n 3 -N 3 that requests 3 cores per node in PBS script.

Limitations

Cannot run MPI applications.
All bundled applications will use the same core granularization - one application per one schedulable unit.
The master process of Scheduler.x occupies a single schedulable unit, and does not participate in running applications.
The entire batch job will wait for the last job in the bundle to finish execution before exiting despite some of the compute nodes may be idling at that time.

Sample script

Sample PBS script to launch multiple applications:

#!/bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=32:xe
#PBS -l walltime=00:02:00
#PBS -l flags=commlocal:commtolerant

cd $PBS_O_WORKDIR
aprun -n 3 -N 3 ./scheduler.x joblist /bin/bash > log

Sample joblist file

01dir job1.sh
02dir job2.sh

For more information

See https://github.com/ncsa/Scheduler

wraprun

Introduction

wraprun is a python MPI based utility for placing tasks on compute nodes.

Sample script

Sample PBS script to launch multiple apruns:

#!/bin/bash
#PBS -j oe
#PBS -l nodes=4:ppn=32:xe
#PBS -l walltime=00:02:00
#PBS -l flags=commlocal:commtolerant

source /opt/modules/default/init/bash
module load bwpy/0.3.0

cd $PBS_O_WORKDIR

wraprun ...

For more information

ORNL wraprun