ATP : Abnormal Termination Processing

Description

Abnormal Termination Processing (ATP) is Cray's utility for debugging. If an application takes a system trap, ATP performs analysis on the dying application.

How to use ATP

Here is the sequence to follow:

Ensure that the atp module is loaded (it is loaded by default) when linking your application.

Add to the job script:

export ATP_ENABLED=1 # or setenv ATP_ENABLED 1
ulimit -c unlimited  # or limit coredumpsize unlimited

Submit your job

If the application crashed on its own, the stack backtrace for the first process to die is sent to stderr along with the associated signal. Here's an example from an invalid floating point operation from a code compiled with "-g -Ktrap=fp" flags (Cray or PGI ):

Application 286448 is crashing. ATP analysis proceeding...

Stack walkback for Rank 0 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:226
  main@nan.f:11
Stack walkback for Rank 0 done
Process died with signal 8: 'Floating point exception'
Forcing core dump of rank 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat
_pmiu_daemon(SIGCHLD): [NID 05712] [c8-1c0s7n2] [Thu Dec 20 14:26:13 2012] PE RA
NK 0 exit signal Floating point exception

If the job hangs, use apkill to send the signal correctly by taking the follow steps. 'qsig -s SIGSEGV' does terminate the job but ATP doesn't exactly catch it.

Get apstat id (apid) of of the job by:
```
$ apstat | grep $USER
```
The apid is first column
Use apkill to send the signal correctly, for example:
```
$ apkill apid
```

ATP will produce some core files (not all) and a merged calltree across all tasks in a 'dot' file. Reference the man page for apkill for more information on the command.

To look at file atpMergedBT.dot:

$ module load stat
$ stat-view atpMergedBT.dot

Reference the man page for stat-view for more information.

To look at core files:

$ gdb /path/to/executable core.atp.apid.X

Here's an example gdb session from the core produced by the nan.f program below. Note that to emit core files, you'll need to set the corefile size to unlimited for your shell in the batch script ( ulimit -c unlimited ). By default, core files are not enabled.

~/debug> gdb ./nan core.atp.286584.0
GNU gdb (GDB) SUSE (7.3-0.6.1)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /mnt/a/u/staff/arnoldg/debug/nan...done.
[New LWP 21394]
[Thread debugging using libthread_db enabled]
Core was generated by `./nan'.
Program terminated with signal 8, Arithmetic exception.
#0  0x0000000020000abf in main () at nan.f:11
11             print *, xd/yd

Examples

This simple fortran program produces a NaN (not a number) result. Add compiler flags "-g -Ktrap=fp" to trap the floating point exception from an invalid fp operation near line 11.

program main
include 'mpif.h'

real xd
real yd
integer ierror

call MPI_INIT(ierror)
xd = 0.0
yd = 0.0
print *, xd/yd
call MPI_FINALIZE(ierror)
end

Additional Information / References

Please see the man page for atp