ATP : Abnormal Termination Processing
Abnormal Termination Processing (ATP) is Cray's utility for debugging. If an application takes a system trap, ATP performs analysis on the dying application.
How to use ATP
Here is the sequence to follow:
1. Add to the job script:
export ATP_ENABLED=1 # or setenv ATP_ENABLED 1
ulimit -c unlimited # or limit coredumpsize unlimited
If the application crashed on its own, the stack backtrace for the first process to die is sent to stderr along with the associated signal. Here's an example from an invalid floating point operation from a code compiled with "-g -Ktrap=fp" flags (Cray or PGI ):
If the job hangs, use apkill to send the signal correctly by taking the follow steps. 'qsig -s SIGSEGV' does terminate the job but ATP doesn't exactly catch it.
a. Get apstat id (apid) of of the job by:
To look at file atpMergedBT.dot:
$ module load stat
$ stat-view atpMergedBT.dot
Reference the man page for stat-view for more information.
To look at core files:
$ gdb /path/to/executable core.atp.apid.X
Here's an example gdb session from the core produced by the nan.f program below. Note that to emit core files, you'll need to set the corefile size to unlimited for your shell in the batch script ( ulimit -c unlimited ). By default, core files are not enabled.
This simple fortran program produces a NaN (not a number) result. Add compiler flags "-g -Ktrap=fp" to trap the floating point exception from an invalid fp operation near line 11.
program main include 'mpif.h' real xd real yd integer ierror call MPI_INIT(ierror) xd = 0.0 yd = 0.0 print *, xd/yd call MPI_FINALIZE(ierror) end
Additional Information / References