Overlapping Computation & Communication with MPI Non-blocking Calls

The following recommendations are verified with cray-mpich/7.2.0 on CLE 5.2 UP02

The following recommendations may help with MPI non-blocking calls. They have been verified to show improvement with MPI_Iallreduce.
Enable core specialization with aprun
aprun -r 1 # Enable core specialization.
Enable asynchronous progress engine & set MPI thread safety level
export MPICH_NEMESIS_ASYNC_PROGRESS=SC # Enable async progress engine
export MPICH_MAX_THREAD_SAFETY=multiple # MPICH thread safety
Enable MPI shared memory optimizations
export MPICH_SHARED_MEM_COLL_OPT=1 # optimized shared-memory based design for collective operations. Currently supported collective operations are: MPI_Allreduce, MPI_Iallreduce, and MPI_Bcast. 1 enables all.
export MPICH_SMP_SINGLE_COPY_SIZE=1024 # Specifies the minimum message size in bytes to consider for single-copy transfers for on-node messages. This applies only to the SMP (on-node shared memory) device.
?
Link with DMAPP library with following link flags
Static linking:
-Wl,--whole-archive,-ldmapp,--no-whole-archive
Dynamic linking:
-ldmapp
Enable optimized DMAPP collectives
export MPICH_USE_DMAPP_COLL=1            # attempt to use the highly optimized GHAL-based DMAPP collective algorithms, if available.

?Debugging:
export MPICH_GNI_ASYNC_PROGRESS_STATS=enabled   # Generates a detailed log. May result in a large stderr file.
export MPICH_ENV_DISPLAY=1    # rank 0 to display all MPICH environment variables and their current settings at MPI initialization time.
export MPICH_VERSION_DISPLAY=1 # display the CRAY MPICH version number as well as build date information.
Results:
The following results show overlap % for various message sizes and rank count
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V4.0.0, MPI-NBC part
#---------------------------------------------------
# Date                  : Wed Jun 10 13:22:27 2015
# Machine               : x86_64
# System                : Linux
# Release               : 3.0.101-0.31.1_1.0502.8394-cray_gem_c
# Version               : #1 SMP Mon Dec 22 19:59:41 UTC 2014
# MPI Version           : 3.0
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time



# Calling sequence was:

# ./IMB-NBC Iallreduce Iallreduce_pure

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Iallreduce

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 2
# ( 510 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         1.90         1.00         0.76         0.00
            4         1000        14.90         8.23         8.29        19.64
            8         1000        14.34         8.13         8.34        25.63
           16         1000        14.28         8.11         8.31        25.77
           32         1000         5.18         2.42         2.19         0.00
           64         1000         5.30         2.54         2.30         0.00
          128         1000         5.35         2.62         2.31         0.00
          256         1000         6.23         2.71         3.01         0.00
          512         1000         6.58         3.01         3.02         0.00
         1024         1000        25.13         9.17         9.25         0.00
         2048         1000        26.93         9.73         9.75         0.00
         4096         1000        39.37        14.86        15.90         0.00
         8192         1000        45.68        23.98        25.35        14.42
        16384         1000        56.51        29.26        30.36        10.27
        32768         1000        71.82        39.03        41.71        21.41
        65536          640       130.56        64.68        68.07         3.22
       131072          320       192.12       128.23       137.54        53.55
       262144          160       456.01       333.18       355.34        65.43
       524288           80       877.91       648.10       694.36        66.90
      1048576           40      1792.55      1319.45      1403.14        66.28
      2097152           20      3643.30      2618.75      2707.99        62.17
      4194304           10      7861.61      6294.99      6681.63        76.55

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 4
# ( 508 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         1.96         1.03         0.76         0.00
            4         1000        14.77         8.21         8.35        21.36
            8         1000        14.84         8.32         8.33        21.81
           16         1000        14.91         8.40         8.39        22.38
           32         1000         8.02         3.74         3.58         0.00
           64         1000         8.80         4.03         4.32         0.00
          128         1000         8.88         4.03         4.30         0.00
          256         1000         9.09         4.34         4.30         0.00
          512         1000        10.23         5.12         5.00         0.00
         1024         1000        49.12        22.65        22.98         0.00
         2048         1000        51.45        19.53        20.89         0.00
         4096         1000        85.29        44.52        47.02        13.30
         8192         1000        97.68        57.76        61.46        35.04
        16384         1000       100.50        48.92        51.10         0.00
        32768         1000       133.16        75.88        80.46        28.81
        65536          640       208.02       125.01       135.63        38.80
       131072          320       320.20       203.18       216.79        46.02
       262144          160       552.87       387.50       418.35        60.47
       524288           80      1083.17       784.31       837.23        64.30
      1048576           40      2667.68      1831.70      1957.51        57.29
      2097152           20      5108.64      3597.90      3820.80        60.46
      4194304           10     10313.20      8468.91      9063.70        79.65

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 8
# ( 504 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         1.98         1.08         0.79         0.00
            4         1000        16.31         9.35         8.99        21.74
            8         1000        16.34         9.35         8.99        21.34
           16         1000        17.11         9.50         9.67        21.35
           32         1000        10.83         5.23         4.93         0.00
           64         1000        11.79         5.46         5.78         0.00
          128         1000        12.00         5.60         5.79         0.00
          256         1000        13.05         6.05         5.68         0.00
          512         1000        15.45         7.46         7.79         0.00
         1024         1000        88.62        54.16        56.41        38.92
         2048         1000        91.67        56.92        59.94        42.03
         4096         1000       117.08        68.63        72.45        33.13
         8192         1000       155.92       100.22       105.75        47.34
        16384         1000       169.91       105.55       111.33        42.19
        32768         1000       206.10       124.65       132.13        38.36
        65536          640       282.44       175.18       188.25        43.02
       131072          320       451.45       293.52       318.08        50.35
       262144          160       814.90       586.73       637.70        64.22
       524288           80      1525.50      1116.74      1217.77        66.43
      1048576           40      2812.62      2110.67      2290.47        69.35
      2097152           20      5562.60      4368.10      4736.86        74.78
      4194304           10     11493.40      9449.10     10221.51        80.00

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 16
# ( 496 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         1.97         0.99         0.78         0.00
            4         1000        23.98        13.42        13.79        23.40
            8         1000        23.31        13.23        13.12        23.03
           16         1000        23.58        13.33        13.14        21.64
           32         1000        17.37         8.55         8.40         0.00
           64         1000        18.41         9.06         9.36         0.07
          128         1000        19.42         9.59         9.79         0.00
          256         1000        20.89        10.66        10.55         3.04
          512         1000        25.39        13.18        13.29         8.11
         1024         1000       104.53        64.62        68.77        41.98
         2048         1000       111.75        69.24        73.65        42.28
         4096         1000       153.60        88.38        90.89        28.24
         8192         1000       204.74       119.83       124.31        31.69
        16384         1000       231.00       133.90       140.69        30.98
        32768         1000       283.16       163.01       172.74        30.45
        65536          640       401.26       234.56       247.10        32.54
       131072          320       614.08       399.35       427.43        49.76
       262144          160      1542.62      1202.69      1307.54        74.00
       524288           80      1799.06      1249.39      1350.63        59.30
      1048576           40      3540.18      2597.12      2869.92        67.14
      2097152           20      7238.65      5102.35      5448.03        60.79
      4194304           10     14614.80     11747.81     12562.39        77.18

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 32
# ( 480 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         2.02         1.11         0.75         0.00
            4         1000        26.12        14.94        15.33        27.10
            8         1000        26.55        15.09        15.36        25.44
           16         1000        26.55        15.13        15.47        26.21
           32         1000        56.41        30.47        31.51        17.71
           64         1000        59.09        31.82        32.48        16.05
          128         1000        60.26        31.61        32.79        12.63
          256         1000        62.16        32.38        33.32        10.61
          512         1000        66.90        35.50        36.33        13.57
         1024         1000       132.19        87.74        92.36        51.88
         2048         1000       143.53        95.89       101.80        53.20
         4096         1000       233.79       134.12       141.19        29.41
         8192         1000       296.50       175.57       179.41        32.59
        16384         1000       319.48       187.27       196.92        32.86
        32768         1000       368.47       213.33       221.82        30.06
        65536          640       492.14       291.76       309.23        35.20
       131072          320       684.80       447.48       480.06        50.57
       262144          160      1126.11       755.92       811.77        54.40
       524288           80      2924.45      2255.89      2435.89        72.55
      1048576           40      4575.87      3269.03      3582.63        63.52
      2097152           20      8135.84      6039.50      6651.46        68.48
      4194304           10     15998.10     12681.91     13849.74        76.06

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 64
# ( 448 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         2.04         1.15         0.76         0.00
            4         1000        29.01        17.49        17.86        35.52
            8         1000        29.03        17.67        17.89        36.53
           16         1000        30.20        17.80        18.35        32.39
           32         1000        73.38        38.22        39.07        10.02
           64         1000        74.97        39.25        39.98        10.66
          128         1000        76.75        39.92        41.26        10.76
          256         1000        78.64        40.89        43.21        12.64
          512         1000        83.92        43.38        45.37        10.64
         1024         1000       154.92       105.69       111.82        55.97
         2048         1000       172.74       118.62       124.87        56.66
         4096         1000       274.13       156.72       164.95        28.82
         8192         1000       347.89       206.35       214.72        34.08
        16384         1000       370.66       217.45       225.99        32.21
        32768         1000       413.57       237.15       251.72        29.91
        65536          640       545.39       319.63       337.08        33.02
       131072          320       753.33       488.38       519.44        48.99
       262144          160      1737.74      1251.31      1299.16        62.56
       524288           80      2977.36      2254.06      2367.65        69.45
      1048576           40      4748.08      3493.15      3702.74        66.11
      2097152           20      8336.81      6419.29      6859.48        72.05
      4194304           10     16571.00     13475.70     14752.53        79.02

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 128
# ( 384 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         2.04         1.09         0.78         0.00
            4         1000        30.45        19.34        19.24        41.99
            8         1000        31.35        19.46        19.93        40.37
           16         1000        31.67        19.67        20.53        41.52
           32         1000        87.47        44.39        46.79         7.95
           64         1000        88.55        45.19        46.93         7.60
          128         1000        90.29        46.80        48.85        10.96
          256         1000        93.93        47.70        48.93         5.52
          512         1000        99.08        51.59        53.88        11.87
         1024         1000       195.37       139.19       146.23        61.58
         2048         1000       208.54       142.09       150.35        55.80
         4096         1000       309.73       173.60       182.41        25.37
         8192         1000       390.39       230.71       242.45        34.14
        16384         1000       407.32       236.79       248.74        31.45
        32768         1000       450.14       256.92       271.70        28.89
        65536          640       585.12       341.30       359.84        32.24
       131072          320       825.88       513.12       549.32        43.06
       262144          160      1953.49      1313.58      1388.67        53.92
       524288           80      3207.60      2371.51      2500.62        66.56
      1048576           40      5129.15      3738.55      4064.95        65.79
      2097152           20      9047.15      6680.85      6916.02        65.79
      4194304           10     18619.20     13818.79     15136.36        68.29

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 256
# ( 256 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         2.07         1.13         0.78         0.00
            4         1000        31.49        19.95        19.94        42.12
            8         1000        32.18        20.12        20.58        41.37
           16         1000        32.56        20.31        21.06        41.81
           32         1000       112.31        60.96        61.83        16.96
           64         1000       111.23        59.58        61.60        16.16
          128         1000       109.10        55.40        57.23         6.17
          256         1000       115.20        60.47        62.23        12.04
          512         1000       118.41        60.85        62.80         8.36
         1024         1000       209.41       142.46       148.86        55.03
         2048         1000       237.63       163.81       172.69        57.25
         4096         1000       344.82       192.93       198.02        23.30
         8192         1000       427.11       250.26       262.61        32.66
        16384         1000       446.48       258.61       267.91        29.87
        32768         1000       490.09       280.44       295.29        29.00
        65536          640       622.08       363.85       383.41        32.65
       131072          320       892.98       532.36       558.42        35.42
       262144          160      2272.04      1694.26      1788.40        67.69
       524288           80      3432.19      2461.62      2507.14        61.29
      1048576           40      5334.58      3799.22      4057.78        62.16
      2097152           20      9105.75      6808.46      7253.48        68.33
      4194304           10     17817.90     14020.30     15345.43        75.25

#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 512
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         2.14         1.09         0.77         0.00
            4         1000        33.53        21.75        21.94        46.32
            8         1000        34.28        21.93        21.92        43.66
           16         1000        35.75        22.59        22.60        41.77
           32         1000       132.41        70.23        73.37        15.25
           64         1000       129.62        67.00        68.07         8.01
          128         1000       130.41        67.86        69.37         9.83
          256         1000       131.01        66.39        69.32         6.78
          512         1000       138.63        73.31        77.14        15.31
         1024         1000       323.16       253.45       263.74        73.57
         2048         1000       342.64       263.96       279.01        71.80
         4096         1000       424.28       242.21       253.95        28.30
         8192         1000       504.31       309.63       321.41        39.43
        16384         1000       518.99       312.61       327.53        36.99
        32768         1000       560.26       333.12       347.17        34.57
        65536          640       687.74       412.09       435.21        36.66
       131072          320       994.97       611.65       644.15        40.49
       262144          160      2419.55      1789.48      1893.83        66.73
       524288           80      3542.32      2525.26      2685.30        62.12
      1048576           40      5559.43      3904.75      4200.27        60.61
      2097152           20      9645.30      7030.95      7644.80        65.80
      4194304           10     19036.89     14211.01     15473.82        68.81