Monitoring Jobs

System Commands

qstat: show status of pbs batch jobs.
- qstat -a lists jobs in submission order.
- qstat -f <jobid> produces a full/detailed report for the job, including its working directory (init_work_dir).
qpeek: This command is deprecated.
- Job stdout and stderr files are accessible in the submission directory of the job as ${PBS_JOBID}.OU and ${PBS_JOBID}.ER while the job is running.
qdel: deletes a job from the queue, or ends a job that's already running.
apstat: Shows the number of up nodes and idle nodes and a list of current pending and running jobs.
- apstat -r displays all the node reservations.
- apstat -c adds info on each partition to the "Compute node summary" section at the beginning of the output (XT = whole system, 32 = XE nodes, 16 = XK nodes)
showq: List jobs in priority order in three categories for active jobs, eligible jobs and blocked jobs.
- showq -r lists details of all running/active jobs.
- showq -i lists details of all eligible jobs, including their priorities.
- showq -b lists details of all blocked jobs.
qs: Another utility that shows a lot of info on queued and running jobs. This one has a column for the type of nodes used by a job (xe or xk).
showstart <jobid>: takes a jobid as its argument and displays an estimate start time of a job based on current reservations.
checkjob <jobid>: takes a jobid as its argument and displays the current job state and if nodes are available to run the job.
xtnodestat: shows the current allocation and status of the system's nodes and gives information about each running job. The output displays the position of each node in the network.
- xtnodestat -m prints only the mesh display.
- xtnodestat -j prints only the job display.

For more information of the above commands, see the corresponding man pages.

Scripts

The following scripts have been written to combine/simplfiy/beautify some of the functionality of the system commands above. Note that these scripts are automatically available upon logging in. No modules need to be loaded.

apstat_system.pl: A perl script that displays the system status by partition (it basically wraps "apstat -c" and adds some other info)
qstat.pl: A perl script that displays queue info similar to the default qstat output with the addition of the node type and count
showqgpu.pl: A perl script that displays only XK jobs in a format similar to the default showq output
showqxe.pl: A perl script that displays only XE jobs in a format similar to the default showq output
xkqueue.pl: A perl script showing queued XK jobs in order of their priority
xequeue.pl: A perl script showing queued XE jobs in order of their priority

Monitoring memory usage

Compile and link into your application a routine with getrusage() similar to the following and call it from Fortran or C at points in your code where you would like to monitor memory usage. You may want to disable the MPI_Barrier() for performance and/or only call the routine from selected ranks in order to constrain the output. See "man getrusage" for more information about the struct returned--you may also monitor user and system time and various other metrics provided by the OS kernel.

#include <stdio.h>

#include <mpi.h>

#include <sys/time.h>

#include <sys/resource.h>

void memtrack(char *message, int *myrank)

{

struct rusage myrusage;

// MPI_Barrier(MPI_COMM_WORLD); // optional, depending on your use case

getrusage(RUSAGE_SELF, &myrusage);

printf("%d: %s: maxrss=%.1fMB\n",

*myrank, message, myrusage.ru_maxrss/1024.0);

}

void memtrack_(int *myrank)

{

memtrack("", myrank);

}

Sample calls from Fortran or C:

print *, 'in main after MPI setup'

call memtrack(rank)

memtrack("in main after MPI setup",&rank);

Sample output:

0: in main after MPI setup: maxrss=18.0MB 2: in main after MPI setup: maxrss=18.0MB 3: in main after MPI setup: maxrss=18.0MB 1: in main after MPI setup: maxrss=18.0MB