Skip to Content

Topology Aware Scheduling

(In production starting January 13)

Introduction

Topology-aware scheduling is a new method for scheduling compute jobs on Blue Waters that improves application performance by assigning a contiguous prism of nodes in the Blue Waters torus to each job. The standard algorithm for scheduling compute jobs on Blue Waters attempts to maximize processor utilization by assigning disconnected groups of nodes to a job when contiguous blocks of nodes are not available, so that the job may start as soon as possible.

Blue Waters Topology

Blue Waters uses the Cray Gemini interconnect network, a 3-D torus, for communicating data between compute nodes as well as I/O data from compute nodes to service nodes. Under the standard scheduling algorithm, data moving between compute nodes in two different jobs may share particular communication links, causing both jobs to compete for link bandwidth. By scheduling jobs to nodes so that all jobs are contained within sufficiently-small convex prisms, all communication traffic (but not I/O traffic) for each job is completely contained within the prism, so traffic from any job cannot interfere with traffic from any other job. Topology-aware scheduling can improve performance for many types of jobs, and variation in run-to-run performance is diminished. However, the constraints on job placement may result in jobs waiting longer in the job queue, and overall system utilization may decrease.  The figure below is a snapshot of the job node assignments on Blue Waters, showing a typical arrangement of ten large jobs.

Topology-aware scheduler torus view

Specifying Topology Aware Scheduling Properties

Do Nothing

If no special flags are specified, the topology-aware scheduler optimizes communication performance by placing the job into a convex cuboid node allocation. 

Specify Application Communication Properties

qsub -l flags=commintolerant

The commintolerant flag prevents the scheduler from placing the job next to a large job. For jobs which are larger than half of any torus dimension, the shortest path between some nodes routes traffic outside the large job's allocation. If a job is extremely sensitive to interference from outside communication traffic, specifying the commintolerant flag causes the scheduler to schedule a placement not adjacent to any large jobs, eliminating such interference. Since the commintolerant flag limits potential placement locations, jobs may experience longer queue times when using this flag.

qsub -l flags=commtolerant

The commtolerant flag is now set by default. The job will still receive a convex cuboid node allocation, but the commtolerant flag does allow the scheduler to place the job adjacent to large jobs. If a job is sensitive to outside communication interference, adjacent large jobs may affect performance.

qsub -l flags=commlocal

The commlocal flag (also now set by default) allows my job to be placed next to others where my prism dimensions are over half the dimension span where it could route communication through others, but my task placement keeps most communication pairs within half the dimension away which will limit the interference imposed on others.  (the shortest path between most pairs remains within my placement prism)

qsub -l flags=commlocal:commintolerant

Jobs are both local and intolerant.

Specify Allocated Node Shape (Prism) and Origin

To specify a particular shape and optionally a location/origin in the torus:

Shape syntax: XxYxZ

Location/origin syntax: x.y.z

All dimensions are expressed in geminis, each of which contains 2 compute nodes, and have a valid range of 1-24.  So, a 2x2x2 shape would contain 8 geminis, or 16 nodes.

Valid syntax for the -l geometry resource:

geometry=shape
geometry=shape@location
geometry=@location

qsub -l geometry=3x3x3
qsub -l geometry=4x1x4@12.0.0
qsub -l geometry=@0.4.0

Submitting jobs

qsub PBS header changes
 

This is a 1024-node XE job. The shape of this job is 11x2x24, and the start location is 3.6.0. The shape consists of 28 service nodes (2.65%) and 3 down nodes (available via checkjob).



 

8x6x9 starting at 16.10.8 with 20 service nodes and 1 down node.

Host list syntax:

qsub -l hostlist=host+host+...

$ qsub -l hostlist=host%tasks+host%tasks+

Example: (host 12,13,14,15)

qsub -l hostlist=12%32+13%32+14%32+15%32

*Warning about using host lists with Topology Aware Scheduler:

If a hostlist is specified without an explicit geometry, the scheduler will treat the job as if the bounding box of the specified nodes had been requested as the geometry.  If a sparse list is provided, the bounding box will likely include an idle node count beyond threshold, which will result in the job state being set to "batchhold".

 

Querying Job States

qstat - no change

showq - no change

checkjob TAS aware

checkjob <jobid> now has topology related information. The following is an example of a 40-node XK job that got a reservation. 

Total Requested Tasks: 40

Req[0]  TaskCount: 40  Partition: ALL
Opsys: ---  Arch: ---  Features: xk

Reserved Nodes:  (14:35:27 -> 1:14:35:27  Duration: 1:00:00:00)
[9402:1][9403:1][9400:1][9401:1][9398:1][9399:1]
[9396:1][9397:1][9394:1][9395:1][9028:1][9029:1]
[9030:1][9031:1][9032:1][9033:1][9034:1][9035:1]
[9036:1][9037:1][7098:1][7099:1][7096:1][7097:1]
[7094:1][7095:1][7092:1][7093:1][7090:1][7091:1]
[6724:1][6725:1][6726:1][6727:1][6728:1][6729:1]
[6730:1][6731:1][6732:1][6733:1]
QOS Flags:        ALLOWCOMMFLAGS,ALLOWGEOMETRY,ALLOWWRAP
Candidate Shapes: 4x1x5 5x1x4
Placement:        4x1x5@16.3.2
Internal Frag:    0/40 (0.00%)
Down Nodes:       0
Service Nodes:    0 (0.00%)
Ideal App Cost:   2.28
Actual App Cost:  2.28
Total Cost:       4.95
Dateline Zones:   Job crosses zone boundaries: Z+ Z-

showres - no change (NCSA to wrap)

Backfill

showbf

showbf reports the largest generated shape that fits each backfill window. showbf -G can be used to limit the display of backfill windows to those that can accommodate the given geometry. For example, showbf -G 3x3x3 shows information about each backfill window that can accommodate a job in a 3x3x3 shape.
 

showbf -p nid11293
Partition      Tasks  Nodes      Duration   StartOffset       StartDate  Geometry
---------     ------  -----  ------------  ------------  --------------  --------
nid11293        3584    224      23:44:33      00:00:00  22:50:05_10/29  8x2x7
nid11293        1920    120    1:15:44:33      00:00:00  22:50:05_10/29  5x2x6
nid11293        1600    100      INFINITY      00:00:00  22:50:05_10/29  5x2x5

showbf -n 200
Partition      Tasks  Nodes      Duration   StartOffset       StartDate  Geometry
---------     ------  -----  ------------  ------------  --------------  --------
ALL             3584    224      23:57:59      00:00:00  22:50:41_10/29  8x2x7
nid11293        3584    224      23:57:59      00:00:00  22:50:41_10/29  8x2x7

 

showbf -G 3x3x3
Partition      Tasks  Nodes      Duration   StartOffset       StartDate  Geometry
---------     ------  -----  ------------  ------------  --------------  --------
ALL              864     54      23:56:03      00:00:00  22:52:37_10/29  3x3x3
nid11293         864     54      23:56:03      00:00:00  22:52:37_10/29  3x3x3

Find xk nodes via showbf:


showbf -f xk -p nid11293
Partition      Tasks  Nodes      Duration   StartOffset       StartDate  Geometry
---------     ------  -----  ------------  ------------  --------------  --------
nid11293        9088    568      14:49:02      00:00:00  14:10:58_11/05  9x4x8