Queue, Scheduling and Charging Policies
To target specific node types, we have implemented features: xe, xk and x. The default feature will be xe for the XE nodes so that if you do not specify either xk for the XK nodes or both xe and xk for multi-req (use of both XE and XK) or use both node types without specifying how much of either with the "x" features that crosses both XE and XK.
A small number of XE and XK nodes (96 of each) offer double the usual amount of memory: 128 GB for XE and 64 GB for XK. To target these nodes for a job, append himem to the xe or xk feature in the #PBS -l nodes=... line. See below for examples.
Examples of using features:
For XE node specification: #PBS -l nodes=1024:ppn=32:xe
For XK node specification: #PBS -l nodes=1024:ppn=16:xk
For both XE and XK: #PBS -l nodes=1024:ppn=32:xe+1024:ppn=16:xk
For XE/XK-non-specific (X-feature) node specification: #PBS -l nodes=1024:ppn=16:x
For XE large memory nodes: #PBS -l nodes=64:ppn=32:xehimem
A queue based system is used to establish initial job priorities and charging.
To specify a queue: #PBS -q queue_name
** - The debug queue is limited to one job per user either running or eligible to run. If a user has a job running in the debug queue then all of that user's other jobs in the debug queue will be in the blocked state as shown by the showq command. If a user does not have a job running in the debug queue then only the user's largest job in the debug queue will be eligible for scheduling and the user's smaller debug jobs will be blocked. Blocked jobs will not run even if appropriate nodes are available. For this reason, when running a set of tests at different node counts it is best to submit only the largest job to the debug queue and the smaller jobs to the high queue with a short walltime limit to allow backfilling as nodes are cleared for the debug job.
§ - The noalloc queue is intended for users in projects who have run out of node-hours in their allocation and/or are in their grace period. Jobs in the noalloc queue have effectively no priority and will run only when nodes are idle and available for the requested wallclock time. The jobs are eligible for preemption after 1 hour. The use of minwclimit may allow larger jobs to start sooner but there is no guarantee that the job will run longer than the 1 hour preemption time. Projects will not be "charged" for accounting purposes but users can look at their Past Jobs to determine usage..
Moving Jobs Among Queues:
After being queued, a job may be moved from one queue to a different one by its submitter. You might do this if you realize you put the job in the wrong queue, or you need the job to run sooner. The command to do this is "qmove". Find out about its features using "man qmove".
Schedule Configuration (How do I make my jobs more likely to run?)
The Blue Waters project doesn't publish the exact configuration of our scheduling system. We change it from time, so we don't want to guarantee any specific feature. However, here is a list of general considerations for choosing your job parameters.
Larger jobs generally get priority over smaller jobs.
Wall time of a job no longer factors into priority calculations on Blue Waters. For both size and wall-time considerations, see the "Why isn't my job running?" page in this section and its discussion on backfill; smaller and shorter jobs fit into backfill better.
Jobs accumulate priority when they're in the eligible state in the queue. So if you have a job that isn't running, it's better to leave it there than to re-submit.
Jobs submitted to the "high" or "debug" queues have higher starting priority than jobs in the normal queue with the same parameters; see above for tradeoffs for using those queues.
Job Scheduling Limits
There is a limit to the total node count that one user can have in the queue (larger than the total node count of Blue Waters). There is also a limit of total queued nodes per project, more than the per-user limit but less than double it, so one user cannot prevent other project users from having jobs be eligible but two users in the same project can. There is also a very large upper limit of running jobs per allocation, but most groups will not hit this limit unless their jobs are very small.
Charging is based on the aggregate node-hours for a job scaled by the charging factor for the queue used by the job. The normal queue will have a charging factor of 1. Other queues will have a higher or lower factor depending on variables like priority or preemptibilty.
Compute nodes are allocated in an exclusive manner; jobs do not share nodes. The use of one node for one hour has a usage of one node-hour scaled by the queue charging factor for the job to which the node is allocated. The number of PEs (processing elements) or number of threads on the node is not a factor in usage.
The usage command will report aggregate node-hours taking into account the queue charging factor for each contributing job. The portal provides charge information on individual past jobs as well.
As is discussed in the User Guide overview and the System Summary, there are 16 cores (AMD Bulldozer compute cores) per XE node and 8 cores per XK nodes.
The current policy for job refunds is that it is impractical in regular operations to address requests for refunds on a system of this size due to the time it takes to determine the cause of the job termination. We strongly recommend that users implement an efficient checkpoint strategy in their application and use the recommended checkpoint interval calculator to determine the time between checkpoints based on node count and the time to write a checkpoint. In extraordinary cases refunds might be considered. Send email to firstname.lastname@example.org for more information.