Monitoring Jobs

System Commands

  • qstat: show status of pbs batch jobs.
    • qstat -a  lists jobs in submission order. 
    • qstat -f <jobid>  produces a full/detailed report for the job, including its working directory (init_work_dir).
  • qpeekThis command is deprecated. 
    • Job stdout and stderr files are accessible in the submission directory of the job as ${PBS_JOBID}.OU and ${PBS_JOBID}.ER while the job is running.
  • qdel: deletes a job from the queue, or ends a job that's already running.
  • apstatShows the number of up nodes and idle nodes and a list of current pending and running jobs. 
    • apstat -r  displays all the node reservations.
    • apstat -c  adds info on each partition to the "Compute node summary" section at the beginning of the output (XT = whole system, 32 = XE nodes, 16 = XK nodes)
  • showqList jobs in priority order in three categories for active jobs, eligible jobs and blocked jobs. 
    • showq -r  lists details of all running/active jobs.
    • showq -i  lists details of all eligible jobs, including their priorities.
    • showq -b  lists details of all blocked jobs.
  • qs: Another utility that shows a lot of info on queued and running jobs.  This one has a column for the type of nodes used by a job (xe or xk).
  • showstart <jobid>: takes a jobid as its argument and displays an estimate start time of a job based on current reservations. 
  • checkjob <jobid>: takes a jobid as its argument and displays the current job state and if nodes are available to run the job.
  • xtnodestat: shows the current allocation and status of the system's nodes and gives information about each running job. The output displays the position of each node in the network.  
    • xtnodestat -m  prints only the mesh display.
    • xtnodestat -j  prints only the job display.

For more information of the above commands, see the corresponding man pages.

Scripts

The following scripts have been written to combine/simplfiy/beautify some of the functionality of the system commands above.  Note that these scripts are automatically available upon logging in.  No modules need to be loaded.

  • apstat_system.pl: A perl script that displays the system status by partition (it basically wraps "apstat -c" and adds some other info)
  • qstat.pl: A perl script that displays queue info similar to the default qstat output with the addition of the node type and count
  • showqgpu.pl: A perl script that displays only XK jobs in a format similar to the default showq output
  • showqxe.pl: A perl script that displays only XE jobs in a format similar to the default showq output
  • xkqueue.pl: A perl script showing queued XK jobs in order of their priority
  • xequeue.pl: A perl script showing queued XE jobs in order of their priority

Monitoring memory usage

Compile and link into your application a routine with getrusage() similar to the following and call it from Fortran or C at points in your code  where you would like to monitor memory usage.  You may want to disable the MPI_Barrier() for performance and/or only call the routine from selected ranks in order to constrain the output.  See "man getrusage" for more information about the struct returned--you may also monitor user and system time and various other metrics provided by the OS kernel.

#include <stdio.h>
#include <mpi.h>
#include <sys/time.h>
#include <sys/resource.h>
 
void memtrack(char *message, int *myrank)
{
        struct rusage myrusage;
 
        MPI_Barrier(MPI_COMM_WORLD);
        getrusage(RUSAGE_SELF, &myrusage);
        printf("%d: %s: maxrss=%.1fMB\n",
                *myrank, message, myrusage.ru_maxrss/1024.0);
}
 
void  memtrack_(int *myrank)
{
        memtrack("", myrank);
}

 

Sample calls from Fortran or C:

print *, 'in main after MPI setup'

call memtrack(rank)

memtrack("in main after MPI setup",&rank);

Sample output:

 0: in main after MPI setup: maxrss=18.0MB
2: in main after MPI setup: maxrss=18.0MB
3: in main after MPI setup: maxrss=18.0MB
1: in main after MPI setup: maxrss=18.0MB