ATP : Abnormal Termination Processing
Abnormal Termination Processing (ATP) is Cray's utility for debugging. If an application takes a system trap, ATP performs analysis on the dying application.
How to use ATP
Here is the sequence to follow:
1. Add to the job script:
export ATP_ENABLED=1 # or setenv ATP_ENABLED 1
ulimit -c unlimited # or limit coredumpsize unlimited
2. Submit your job.
If the application crashed on its own, the stack backtrace for the first process to die is sent to stderr along with the associated signal. Here's an example from an invalid floating point operation from a code compiled with "-g -Ktrap=fp" flags (Cray or PGI ):
Application 286448 is crashing. ATP analysis proceeding...
Stack walkback for Rank 0 starting:
Stack walkback for Rank 0 done
Process died with signal 8: 'Floating point exception'
Forcing core dump of rank 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat
_pmiu_daemon(SIGCHLD): [NID 05712] [c8-1c0s7n2] [Thu Dec 20 14:26:13 2012] PE RA
NK 0 exit signal Floating point exception
If the job hangs, use apkill to send the signal correctly by taking the follow steps. 'qsig -s SIGSEGV' does terminate the job but ATP doesn't exactly catch it.
a. Get apstat id (apid) of of the job by:
$ apstat | grep $USER
The apid is first column
b. Use apkill to send the signal correctly, for example:
$ apkill apid
ATP will produce some core files (not all) and a merged calltree across all tasks in a 'dot' file. Reference the man page for apkill for more information on the command.
To look at file atpMergedBT.dot:
$ module load stat
$ stat-view atpMergedBT.dot
Reference the man page for stat-view for more information.
To look at core files:
$ gdb /path/to/executable core.atp.apid.X
Here's an example gdb session from the core produced by the nan.f program below. Note that to emit core files, you'll need to set the corefile size to unlimited for your shell in the batch script ( ulimit -c unlimited ). By default, core files are not enabled.
~/debug> gdb ./nan core.atp.286584.0
GNU gdb (GDB) SUSE (7.3-0.6.1)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
Reading symbols from /mnt/a/u/staff/arnoldg/debug/nan...done.
[New LWP 21394]
[Thread debugging using libthread_db enabled]
Core was generated by `./nan'.
Program terminated with signal 8, Arithmetic exception.
#0 0x0000000020000abf in main () at nan.f:11
11 print *, xd/yd
This simple fortran program produces a NaN (not a number) result. Add compiler flags "-g -Ktrap=fp" to trap the floating point exception from an invalid fp operation near line 11.
xd = 0.0
yd = 0.0
print *, xd/yd
Additional Information / References
- Include additional information, resources, references here