ATP : Abnormal Termination Processing
Abnormal Termination Processing (ATP) is Cray's utility for debugging. If an application takes a system trap, ATP performs analysis on the dying application.
How to use ATP
Here is the sequence to follow:
If the job hangs, use apkill to send the signal correctly by taking the follow steps. 'qsig -s SIGSEGV' does terminate the job but ATP doesn't exactly catch it.
ATP will produce some core files (not all) and a merged calltree across all tasks in a 'dot' file. Reference the man page for apkill for more information on the command.
To look at file atpMergedBT.dot:
$ module load stat $ stat-view atpMergedBT.dot
Reference the man page for stat-view for more information.
To look at core files:
$ gdb /path/to/executable core.atp.apid.X
Here's an example gdb session from the core produced by the nan.f program below. Note that to emit core files, you'll need to set the corefile size to unlimited for your shell in the batch script ( ulimit -c unlimited ). By default, core files are not enabled.
~/debug> gdb ./nan core.atp.286584.0 GNU gdb (GDB) SUSE (7.3-0.6.1) Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-suse-linux". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /mnt/a/u/staff/arnoldg/debug/nan...done. [New LWP 21394] [Thread debugging using libthread_db enabled] Core was generated by `./nan'. Program terminated with signal 8, Arithmetic exception. #0 0x0000000020000abf in main () at nan.f:11 11 print *, xd/yd
This simple fortran program produces a NaN (not a number) result. Add compiler flags "-g -Ktrap=fp" to trap the floating point exception from an invalid fp operation near line 11.
program main include 'mpif.h' real xd real yd integer ierror call MPI_INIT(ierror) xd = 0.0 yd = 0.0 print *, xd/yd call MPI_FINALIZE(ierror) end
Additional Information / References
Please see the man page for atp