Skip to Content

Blue Waters Data Sets

For questions email:  help+bw@ncsa.illinois.edu


Overview

The Blue Waters data is the result of scientific data processing since 2012 on the Petascale Computing Facility sponsored by the National Science Foundation. Blue Waters data is pubically available for viewing and downloading.  The data has been anonymized; that is any personnel/account names associated with the data have been removed. 

General Description of Collected Data

OVIS, an Open Source software designed  for the monitoring of the performance, health and efficiency of large scale computing systems was used for most of the data collection.  OVIS uses an API and network protocol for gathering this data which is called the Lightweight Distributed Metric Service (LDMS).

Blue Waters data comprises statistics compiled on various computer hardware and software activities; a few examples are:

  • I/O data on various components such as NICs
  • statistics on node usage
  • memory allocations,
  • CPU or GPU performance
  • Read/write/caching, file opens and closes
  • File transfers and data calls
  • Communication link status

For each data set, a detailed description of actual data set or data element is provided, or a link is provided to obtain more information.

How to Get Access to the Data

Access to the datasets is provided via https://www.globus.org/.  You may login with an existing institutional account or create a new account at that site.

The collection name is Blue Waters System Monitoring Data Set and can be accessed by clicking HERE.

Data Types Available

1. Node metric, compute and service node (time series) data   -  

     A.  Cray system sampler data   - click link to description further down on page

     B.  Model Specific Registers (MSR) data   -  click link to description further down on page

2. Syslogs Data click link to description further down on page

3. Resource Manager data (Torque) -   click link to description further down on page

4. System Environment Data Collections - click link to description further down on page

5. Darshan data (I/O data)  click link to description further down on page

6. Lustre User Experience Metrics -  click link to description further down on page


Explanation of Data Types

 

1. Node metric, compute and service node (time series) data

1 A.  Cray System Sampler Data

The following is a description of the node and time series data contents.

Units of B are raw byte counts at the time of the sample. Most other values are raw counts at the sample time as well. The few rate datapoints are denoted as B/s (bytes per second)

To parse the datafiles, the appropriate header should be used to determine the position of the data within the comma seperated data file. As the data has changed slightly over time, there are files named HEADER.<date range> to denote the format.  

File Data Name

File Data Definition

#Time Time in epoch (GMT)
Time_usec partial second time in microseconds to the right of the decimal point
CompId Node ID
Tesla_K20X.gpu_util_rate Utilization reported by nvidia at the time of sample (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_agg_dbl_ecc_total_errors GPU Double bit ecc errors (see attached NVIDIA doumentation for more info)
Tesla_K20X.gpu_agg_dbl_ecc_texture_memory GPU Double bit ecc errors for the texture memory (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_agg_dbl_ecc_register_file GPU Double bit ecc errors for the register file (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_agg_dbl_ecc_device_memory GPU Double bit ecc errors for device memory (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_agg_dbl_ecc_l2_cache GPU Double bit ecc errors for the level 2 cache (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_agg_dbl_ecc_l1_cache GPU Double bit ecc errors for the Level 1 cache (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_memory_used Memory in use Kb (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_temp GPU temperature in Celsius
Tesla_K20X.gpu_pstate Power management state (see attached NVIDIA documentation for more info)
Tesla_K20X.gpu_power_limit Power limit (maximum) in milliwatts
Tesla_K20X.gpu_power_usage GPU power consumption in milliwatts
ipogif0_tx_bytes Bytes transmitted with TCP/IP over the gemini interface 
ipogif0_rx_bytes Bytes received with TCP/IP over the gemini interface 
RDMA_rx_bytes Remote Direct Memory Address received bytes
RDMA_nrx RDMA number of cumulative receives
RDMA_tx_bytes RDMA cumulative transmit bytes
RDMA_ntx RDMA cumulative number of transfers
SMSG_rx_bytes Cumulative Bytes received via the Short Message cprotocol (refer to Cray documentation)
SMSG_nrx Cumulative Number of Short messages receives (refer to Cray documentation)
SMSG_tx_bytes Sort message transmit bytes (refer to Cray documentation)
SMSG_ntx Short message number of transmits (refer to Cray documentation)
current_freemem Unallocated memory in Kb
loadavg_total_processes Unixload of all processes ready to run average X 100
loadavg_running_processes Unix load of processes in the running state average X 100
loadavg_5min(x100) Unix load 5 minute average X 100
loadavg_latest(x100) Current Unix load X 100
nr_writeback Number of count of pages scheduled out but not completed
nr_dirty Number of count of pages waiting to be scheduled to output device
lockless_write_bytes#stats.snx11001 Number of lockless write I/O.  This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf
lockless_read_bytes#stats.snx11001 Cumulative number of lockless read I/O.  This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf
direct_write#stats.snx11001 Cumulative number of writes to storage
direct_read#stats.snx11001 Cumulative number of reads to storage
inode_permission#stats.snx11001 Cumulative number of checks for access rights to a given inode
removexattr#stats.snx11001 Cumulative number of remove attributes.  Command removes the extended attribute identified by name and associated with the given path in the filesystem.
listxattr#stats.snx11001 Cumulative number of listattr.  Command retrieves the list of extended attribute names associated with the given path in the filesystem.   
getxattr#stats.snx11001 Cumulative number of times operation has occurred to retrieve the value of the extended attribute identified
by name and associated with the given path in the filesystem.
setxattr#stats.snx11001 Cumulative number of calls to set extended attributes 
alloc_inode#stats.snx11001 Cumulative number  of Fragmentations.   System will allocate another inode as needed.
statfs#stats.snx11001 Cumulative number of calls to stat fs
getattr#stats.snx11001 Cumulative number of get attribute calls
flock#stats.snx11001 This utility manages flock locks from within shell scripts or from the command line. A cumulative count of file locks.
lockless_truncate#stats.snx11001 Cumulative number of file truncates without locking a file.
truncate#stats.snx11001 The cumulative number of events to shrink (or extend) the size of a file to the specified size
setattr#stats.snx11001 The cumulative number of times setattr was called.  This command sets the value of given attribute of an object
fsync#stats.snx11001 The cumulative number of fsync transfers.  fsync transfers ("flushes") all modified in-core data of (i.e.,modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.  This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.  As well as flushing the file data, fsync() also flushes the metadata information associated with the file.
seek#stats.snx11001 Cumulative File seeks
mmap#stats.snx11001 Cumulative number of new mapping in the virtual address space of the calling process.  The starting address for the new mapping is specified in addr.
close#stats.snx11001 Cumulative File Closes
open#stats.snx11001 Cumulative File Opens
ioctl#stats.snx11001 Cumulative  Input/Output control calls
brw_write#stats.snx11001 Cumulative Bulk read writes to storage
brw_read#stats.snx11001 Cumulative Bulk reads to storage
write_bytes#stats.snx11001 Cumulative Writes to storage in bytes
read_bytes#stats.snx11001 Cumulative Reads to storage in bytes
writeback_failed_pages#stats.snx11001 Cumulative number of writeback failed pages.  Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_ok_pages#stats.snx11001 Cumulative number of writeback success pages.Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_from_pressure#stats.snx11001 Cumulative number of writeback from pressure. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_from_writepage#stats.snx11001 Cumulative number of writeback from writepages.  Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
dirty_pages_misses#stats.snx11001 Cumulative number of Dirty page misses;  Dirty pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk.
dirty_pages_hits#stats.snx11001 Cumulative number of Dirty page hits;  Dirty pages are the pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk.
lockless_write_bytes#stats.snx11002 Number of lockless write I/O.  This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf
lockless_read_bytes#stats.snx11002 Cumulative number of lockless read I/O.  This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf
direct_write#stats.snx11002 Cumulative number of writes to storage
direct_read#stats.snx11002 Cumulative number of reads to storage
inode_permission#stats.snx11002 Cumulative number of checks for access rights to a given inode
removexattr#stats.snx11002 Cumulative number of remove attributes.  Command removes the extended attribute identified by name and associated with the given path in the filesystem.
listxattr#stats.snx11002 Cumulative number of listattr.  Command retrieves the list of extended attribute names associated with the given path in the filesystem.   
getxattr#stats.snx11002 Cumulative number of times operation has occurred to retrieve the value of the extended attribute identified
by name and associated with the given path in the filesystem.
setxattr#stats.snx11002 Cumulative number of calls to set extended attributes 
alloc_inode#stats.snx11002 Cumulative number  of Fragmentations.   System will allocate another inode as needed.
statfs#stats.snx11002 Cumulative number of calls to stat fs
getattr#stats.snx11002 Cumulative number of get attribute calls
flock#stats.snx11002 This utility manages flock locks from within shell scripts or from the command line. A cumulative count of file locks.
lockless_truncate#stats.snx11002 Cumulative number of file truncates without locking a file.
truncate#stats.snx11002 The cumulative number of events to shrink (or extend) the size of a file to the specified size
setattr#stats.snx11002 The cumulative number of times setattr was called.  This command sets the value of given attribute of an object
fsync#stats.snx11002 The cumulative number of fsync transfers.  fsync transfers ("flushes") all modified in-core data of (i.e.,modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.  This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.  As well as flushing the file data, fsync() also flushes the metadata information associated with the file.
seek#stats.snx11002 Cumulative File seeks
mmap#stats.snx11002 Cumulative number of new mapping in the virtual address space of the calling process.  The starting address for the new mapping is specified in addr.
close#stats.snx11002 Cumulative File Closes
open#stats.snx11002 Cumulative File Opens
ioctl#stats.snx11002 Cumulative  Input/Output control calls
brw_write#stats.snx11002 Cumulative Bulk read writes to storage
brw_read#stats.snx11002 Cumulative Bulk reads to storage
write_bytes#stats.snx11002 Cumulative Writes to storage in bytes
read_bytes#stats.snx11002 Cumulative Reads to storage in bytes
writeback_failed_pages#stats.snx11002 Cumulative number of writeback failed pages.  Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_ok_pages#stats.snx11002 Cumulative number of writeback success pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_from_pressure#stats.snx11002 Cumulative number of writeback from pressure. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_from_writepage#stats.snx11002 Cumulative number of writeback from writepages.  Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
dirty_pages_misses#stats.snx11002 Cumulative number of Dirty page misses;  Dirty pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk.
dirty_pages_hits#stats.snx11002 Cumulative number of Dirty page hits;  Dirty pages are the pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk.
lockless_write_bytes#stats.snx11003 Number of lockless write I/O.  This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf
lockless_read_bytes#stats.snx11003 Cumulative number of lockless read I/O.  This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf
direct_write#stats.snx11003 Cumulative number of writes to storage
direct_read#stats.snx11003 Cumulative number of reads to storage
inode_permission#stats.snx11003 Cumulative number of checks for access rights to a given inode
removexattr#stats.snx11003 Cumulative number of remove attributes.  Command removes the extended attribute identified by name and associated with the given path in the filesystem.
listxattr#stats.snx11003 Cumulative number of listattr.  Command retrieves the list of extended attribute names associated with the given path in the filesystem.   
getxattr#stats.snx11003 Cumulative number of times operation has occurred to retrieve the value of the extended attribute identified
by name and associated with the given path in the filesystem.
setxattr#stats.snx11003 Cumulative number of calls to set extended attributes 
alloc_inode#stats.snx11003 Cumulative number  of Fragmentations.   System will allocate another inode as needed.
statfs#stats.snx11003 Cumulative number of calls to stat fs
getattr#stats.snx11003 Cumulative number of get attribute calls
flock#stats.snx11003 This utility manages flock locks from within shell scripts or from the command line. A cumulative count of file locks.
lockless_truncate#stats.snx11003 Cumulative number of file truncates without locking a file.
truncate#stats.snx11003 The cumulative number of events to shrink (or extend) the size of a file to the specified size
setattr#stats.snx11003 The cumulative number of times setattr was called.  This command sets the value of given attribute of an object
fsync#stats.snx11003 The cumulative number of fsync transfers.  fsync transfers ("flushes") all modified in-core data of (i.e.,modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.  This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.  As well as flushing the file data, fsync() also flushes the metadata information associated with the file.
seek#stats.snx11003 Cumulative File seeks
mmap#stats.snx11003 Cumulative number of new mapping in the virtual address space of the calling process.  The starting address for the new mapping is specified in addr.
close#stats.snx11003 Cumulative File Closes
open#stats.snx11003 Cumulative File Opens
ioctl#stats.snx11003 Cumulative  Input/Output control calls
brw_write#stats.snx11003 Cumulative Bulk read writes to storage
brw_read#stats.snx11003 Cumulative Bulk reads to storage
write_bytes#stats.snx11003 Cumulative Writes to storage in bytes
read_bytes#stats.snx11003 Cumulative Reads to storage in bytes
writeback_failed_pages#stats.snx11003 Cumulative number of writeback failed pages.  Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_ok_pages#stats.snx11003 Cumulative number of writeback success pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_from_pressure#stats.snx11003 Cumulative number of writeback from pressure. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
writeback_from_writepage#stats.snx11003 Cumulative number of writeback from writepages.  Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem
dirty_pages_misses#stats.snx11003 Cumulative number of Dirty page misses;  Dirty pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk.
dirty_pages_hits#stats.snx11003 Cumulative number of Dirty page hits;  Dirty pages are the pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk.
SAMPLE_totaloutput_optB (B/s) NIC Metrics (bytes per second).  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
SAMPLE_bteout_optB (B/s) NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
SAMPLE_bteout_optA (B/s) NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
SAMPLE_fmaout (B/s) NIC Metrics.   The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
SAMPLE_totalinput (B/s) NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
SAMPLE_totaloutput_optA (B/s) The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
totaloutput_optB NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
bteout_optB NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
bteout_optA NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
fmaout Cumulative number of Fast Memory Accesses (small transfers) by Node's NIC
totalinput Sum of total bytes for the Node's NIC
totaloutput_optA NIC Metrics.  The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them.
Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload.
Option B is preferred. It makes two assumption we believe are reasonable:
1. Packets part of BTE Puts will mostly be max-sized.
2. The majority of Get requests will be BTE, not FMA.
We believe this matches MPI's use.
The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet.
Option A may be more accurate if actual use doesn't match these assumptions.
Z-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) Percentage of time that Z Negative link was in a stalled state
Z+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) Link aggregated Gemini output stalls for the Z Postive link
Y-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) Link aggregated Gemini output stalls for the Y Negative link
Y+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) Link aggregated Gemini output stalls for the Z Postive link
X-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) Link aggregated Gemini output stalls for the X Negative link
X+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) Link aggregated Gemini output stalls for the X Negative link
Z-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) % of time spent in Input Queue Stall state for the Z Negative link
Z+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) % of time spent in Input Queue Stall state for the Z Postive link
Y-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) % of time spent in Input Queue Stall state for the Y Negative link
Y+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) % of time spent in Input Queue Stall state for the Y Postive link
X-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) % of time spent in Input Queue Stall state for the X Negative link
X+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) % of time spent in Input Queue Stall state for the X Negative link
Z-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Average Packet size for the Z Negative link
Z+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Average Packet size for the Z Postive link
Y-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Average Packet size for the Y Negative link
Y+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Average Packet size for the Y Postive link
X-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Average Packet size for the X Negative link
X+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Average Packet size for the X Postive link
Z-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) % of used bandwidth for the Z Negative link
Z+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) % of used bandwidth for the Z Postive link
Y-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) % of used bandwidth for the Y Negative link
Y+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) % of used bandwidth for the Y Postive link
X-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) % of used bandwidth for the X Negative link
X+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) % of used bandwidth for the X Postive link
Z-_SAMPLE_GEMINI_LINK_BW (B/s) Z negative total transfer rate
Z+_SAMPLE_GEMINI_LINK_BW (B/s) Z plus total transfer rate
Y-_SAMPLE_GEMINI_LINK_BW (B/s) Y negative total transfer rate
Y+_SAMPLE_GEMINI_LINK_BW (B/s) Y positive total transfer rate
X-_SAMPLE_GEMINI_LINK_BW (B/s) X negative total transfer rate
X+_SAMPLE_GEMINI_LINK_BW (B/s) X positive total transfer rate
Z-_recvlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
Z+_recvlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
Y-_recvlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes 
Y+_recvlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes 
X-_recvlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
X+_recvlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
Z-_sendlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
Z+_sendlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
Y-_sendlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes 
Y+_sendlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes 
X-_sendlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
X+_sendlinkstatus (1) Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes 
Z-_credit_stall (ns) Wait time between devices when the first device is waiting for a signal from the second device to send more data.
Z+_credit_stall (ns) Wait time between devices when the first device is waiting for a signal from the second device to send more data.
Y-_credit_stall (ns) Wait time between devices when the first device is waiting for a signal from the second device to send more data.
Y+_credit_stall (ns) Wait time between devices when the first device is waiting for a signal from the second device to send more data.
X-_credit_stall (ns) Wait time between devices when the first device is waiting for a signal from the second device to send more data.
X+_credit_stall (ns) Wait time between devices when the first device is waiting for a signal from the second device to send more data.
Z-_inq_stall (ns) Input queue stalled in nanoseconds for Z minus
Z+_inq_stall (ns) Input queue stalled in nanoseconds for Z positive
Y-_inq_stall (ns) Input queue stalled in nanoseconds for Y negative
Y+_inq_stall (ns) Input queue stalled in nanoseconds Y positive
X-_inq_stall (ns) Input queue stalled in nanoseconds X negative
X+_inq_stall (ns) Input queue installed in nanoseconds X positive
Z-_packets (1) Cumulative number of packets for Z negative
Z+_packets (1) Cumulative number of packets for Z positive
Y-_packets (1) Cumulative number of packets for Y negative
Y+_packets (1) Cumulative number of packets for Y positive
X-_packets (1) Cumulative number of packets for X negative
X+_packets (1) Cumulative number of packets for X positive
Z-_traffic (B) Cumulative number traffic in bytes for Z negative
Z+_traffic (B) Cumulative number traffic in bytes for Z positive
Y-_traffic (B) Cumulative number traffic in bytes for Y negative
Y+_traffic (B) Cumulative number traffic in bytes for Y positive
X-_traffic (B) Cumulative number traffic in bytes for X negative
X+_traffic (B) Cumulative number traffic in bytes for X positive
nettopo_mesh_coord_Z Z position in the Torus 
nettopo_mesh_coord_Y Y Position in the Torus
nettopo_mesh_coord_X X position in the Torus

 

1 B. MSR Data File

The MSR data file contains headers follow by associated data elements.  Those elements will be explained later in the document.   The headers for each MSR comma separated value (CSV) data file are formatted as follows:

#Time, Time_usec, CompId, Ctr0, Ctr0_c00, Ctr0_c08, Ctr0_c16, Ctr0_c24, Ctr1, Ctr1_c00, Ctr1_c08, Ctr1_c16, Ctr1_c24, Ctr2, Ctr2_c00, Ctr2_c08, Ctr2_c16, Ctr2_c24, Ctr3, Ctr3_c00, Ctr3_c08, Ctr3_c16, Ctr3_c24, Ctr4, Ctr4_c00, Ctr4_c01, Ctr4_c02, Ctr4_c03, Ctr4_c04, Ctr4_c05, Ctr4_c06, Ctr4_c07, Ctr4_c08, Ctr4_c09, Ctr4_c10, Ctr4_c11, Ctr4_c12, Ctr4_c13, Ctr4_c14, Ctr4_c15, Ctr4_c16, Ctr4_c17, Ctr4_c18, Ctr4_c19, Ctr4_c20, Ctr4_c21, Ctr4_c22, Ctr4_c23, Ctr4_c24, Ctr4_c25, Ctr4_c26, Ctr4_c27, Ctr4_c28, Ctr4_c29, Ctr4_c30, Ctr4_c31, Ctr5, Ctr5_c00, Ctr5_c01, Ctr5_c02, Ctr5_c03, Ctr5_c04, Ctr5_c05, Ctr5_c06, Ctr5_c07, Ctr5_c08, Ctr5_c09, Ctr5_c10, Ctr5_c11, Ctr5_c12, Ctr5_c13, Ctr5_c14, Ctr5_c15, Ctr5_c16, Ctr5_c17, Ctr5_c18, Ctr5_c19, Ctr5_c20, Ctr5_c21, Ctr5_c22, Ctr5_c23, Ctr5_c24, Ctr5_c25, Ctr5_c26, Ctr5_c27, Ctr5_c28, Ctr5_c29, Ctr5_c30, Ctr5_c31, Ctr6, Ctr6_c00, Ctr6_c01, Ctr6_c02, Ctr6_c03, Ctr6_c04, Ctr6_c05, Ctr6_c06, Ctr6_c07, Ctr6_c08, Ctr6_c09, Ctr6_c10, Ctr6_c11, Ctr6_c12, Ctr6_c13, Ctr6_c14, Ctr6_c15, Ctr6_c16, Ctr6_c17, Ctr6_c18, Ctr6_c19, Ctr6_c20, Ctr6_c21, Ctr6_c22, Ctr6_c23, Ctr6_c24, Ctr6_c25, Ctr6_c26, Ctr6_c27, Ctr6_c28, Ctr6_c29, Ctr6_c30, Ctr6_c31, Ctr7, Ctr7_c00, Ctr7_c01, Ctr7_c02, Ctr7_c03, Ctr7_c04, Ctr7_c05, Ctr7_c06, Ctr7_c07, Ctr7_c08, Ctr7_c09, Ctr7_c10, Ctr7_c11, Ctr7_c12, Ctr7_c13, Ctr7_c14, Ctr7_c15, Ctr7_c16, Ctr7_c17, Ctr7_c18, Ctr7_c19, Ctr7_c20, Ctr7_c21, Ctr7_c22, Ctr7_c23, Ctr7_c24, Ctr7_c25, Ctr7_c26, Ctr7_c27, Ctr7_c28, Ctr7_c29, Ctr7_c30, Ctr7_c31, Ctr8, Ctr8_c00, Ctr8_c01, Ctr8_c02, Ctr8_c03, Ctr8_c04, Ctr8_c05, Ctr8_c06, Ctr8_c07, Ctr8_c08, Ctr8_c09, Ctr8_c10, Ctr8_c11, Ctr8_c12, Ctr8_c13, Ctr8_c14, Ctr8_c15, Ctr8_c16, Ctr8_c17, Ctr8_c18, Ctr8_c19, Ctr8_c20, Ctr8_c21, Ctr8_c22, Ctr8_c23, Ctr8_c24, Ctr8_c25, Ctr8_c26, Ctr8_c27, Ctr8_c28, Ctr8_c29, Ctr8_c30, Ctr8_c31, Ctr9, Ctr9_c00, Ctr9_c01, Ctr9_c02, Ctr9_c03, Ctr9_c04, Ctr9_c05, Ctr9_c06, Ctr9_c07, Ctr9_c08, Ctr9_c09, Ctr9_c10, Ctr9_c11, Ctr9_c12, Ctr9_c13, Ctr9_c14, Ctr9_c15, Ctr9_c16, Ctr9_c17, Ctr9_c18, Ctr9_c19, Ctr9_c20, Ctr9_c21, Ctr9_c22, Ctr9_c23, Ctr9_c24, Ctr9_c25, Ctr9_c26, Ctr9_c27, Ctr9_c28, Ctr9_c29, Ctr9_c30, Ctr9_c31

Future data files may be different so reference the new header file.   Refer to Table 1 for information on the meaning of the counters (CTR 0-9).

                                                 Table 1 – Header Details 

Counter

MSR Counter Definitions

What is being measured

Validation Number

Ctr0

L3_CACHE_MISSES per NUMA domain (4 counters)

Memory Controller Counts

85903603681      

Ctr1

DCT_PREFETCH per NUMA domain

(4 counters)

Memory Controller Counts

 

73018664176

Ctr2

DCT_RD_TOT per NUMA domain

for each controller (4 counters)

Memory Controller Counts

 

730186636664

Ctr3

DCT_WRT per NUMA domain (4 counters)

Memory Controller Counts

 

73018644976

Ctr4

TOT-CYC per core (4 counters)

Total Processor Cycle for each core

4391030

Ctr5

TOT INS per Core (4 counters)

Total Instructions for each core

4391104

Ctr6

L1_DCM per core (4 counters)

L1 data Cache misses for the L1 each core

4391233

Ctr7

Retired flops per core (all types of flops) (4 counters)

Number of retired floating-point operations per core

4456195

Ctr8

Operation counts per core (4 counters)

Vector unit instructions per core

4392139

Ctr9

Translation Lookaside Buffer (TLB) data misses per core (4 counters)

TLB data misses per core

4392774

How to Read the MSR Data File

Reference this sample MSR comma separated value (CSV) data file:

1480996440.004670, 4670, 8672, 85903603681, 1075675482, 957589463, 717738766, 744067220, 73018664176, 412116844, 369559710, 125781222, 119420227, 73018663664, 1703147611, 1424989459, 813186830, 824852929, 73018644976, 941771093, 910328449, 393929432, 383752296, 4391030, 571110602344, 562415965217, 556924102961, 554761201724, 552701273182, 551182100824, 551820818084, 550270895655, 560016645494, 553663646637, 549783500782, 549381437004, 539166673211, 539742737722, 540313267024, 539874939820, 150688025199, 148035712784, 162794766413, 158869157833, 163899518848, 161418031801, 164150890354, 162961337921, 147052930419, 144195801312, 146327145247, 144225245105, 166696559177, 164632544561, 193190727930, 214273693014, 4391104, 503840210303, 640187476527, 597059695706, 596812465922, 594595322971, 595837373660, 591893240209, 592532355039, 596434671942, 592577910304, 593366025588, 591517987927, 590791274450, 592685890076, 591352966301, 591685943383, 157896260541, 155033869123, 159496313138, 155325793581, 158132877662, 156241919544, 156052879923, 156028880128, 161016004873, 160206105921, 157705581605, 154634583769, 157143534884, 156100936718, 158578417504, 163794562360, 4391233, 3350723806, 617055349, 543869780, 471503058, 461166615, 435289712, 465346004, 443590008, 503874718, 438612212, 452681008, 463562281, 416589356, 423156695, 429588594, 421988887, 295675170, 289542436, 313223710, 308085728, 325639270, 315319292, 365086035, 314506675, 287651138, 275625001, 271477379, 271835640, 335907639, 323885751, 341101565, 521244927, 4456195, 30779324559, 30779336977, 30779328042, 30779334057, 30779323889, 30779323032, 30779323435, 30779323120, 30779323916, 30779323364, 30779324117, 30779323774, 30779322949, 30779322839, 30779322762, 30779322922, 30778881320, 30778881309, 30778881499, 30778881574, 30778881320, 30778881385, 30778881216, 30778881349, 30778881632, 30778881297, 30778884573, 30778881471, 30778881316, 30778881237, 30778881391, 30778881787, 4392139, 14761832555, 14582318338, 14589458999, 14583243493, 14582424396, 14581571672, 14582586746, 14581792888, 14603781577, 14582028453, 14583226586, 14581883168, 14582393085, 14581535539, 14581592416, 14581549938, 14514853391, 14513979629, 14514014497, 14513973145, 14513995833, 14513974554, 14514007272, 14513980413, 14524818983, 14513954446, 14514343883, 14513937916, 14514059869, 14514016689, 14514188159, 14514218075, 4392774, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

How to understand what all those numbers above mean

10 counters are used with 4 values for the counters 0-4 and 32 values for counters 5-9.    Each entry of the 26,868 BW computer nodes is represented on a separate line in the CSV file.  Each file contains a full day of data for each computer node at 1 minute samples.  Each file should have approximately 38.7 million lines.  Each data block starts with three leading elements before the CTR 0-9 data. 

Therefore, in the sample CSV data file above, the three leading comma separated elements in bold correspond to #Time (EPOCH TIME), Time_Usec, and CompID.    Each of the 10 counters are followed by their associated values.  For the other bolded items (not the first three), these correspond to one of ten counters between CTR 0-9.  These are known as VALIDATION numbers.  The Validation number in EACH data block for CTR 0-9 must be equal to the listed validation number in the tables for the associated data to be valid. 

Table 2 below shows the three leading elements for the counter data set, plus the 10 counters.   Compare Table 2 to the above sample CSV file in order to see the relationship between the counter headers, validation number, and their associated data.

                                                  Table 2 – Counter Examples

MSR Counter Definitions

Validation Number

Counter

#Time

 

1480996440.004670

Time_Usec

 

4670

CompID

 

8672

Ctr0 *4    L3_CACHE_MISSES per NUMA domain

85903603681      

1075675482, 957589463, 717738766, 744067220

Ctr1 * 4 DCT_PREFETCH per NUMA domain

73018664176

412116844, 369559710, 125781222, 119420227

Ctr2 * 4 DCT_RD_TOT per NUMA domain

730186636664

1703147611, 1424989459, 813186830, 824852929

Ctr3 * 4 DCT_WRT per NUMA domain

73018644976

941771093, 910328449, 393929432, 383752296

Ctr4 * 32 TOT-CYC per core

4391030

571110602344, 562415965217, 556924102961, 554761201724, 552701273182, 551182100824, 551820818084, 550270895655, 560016645494, 553663646637, 549783500782, 549381437004, 539166673211, 539742737722, 540313267024, 539874939820, 150688025199, 148035712784, 162794766413, 158869157833, 163899518848, 161418031801, 164150890354, 162961337921, 147052930419, 144195801312, 146327145247, 144225245105, 166696559177, 164632544561, 193190727930, 214273693014

Ctr5 * 32 TOT INS per Core

4391104

32 counters….after 4391104

Ctr6 * 32 L1 DCM per core

4391233

32 counters…

Ctr7 * 32 retired flops per core (all types of flops)

4456195

32 counters…

Ctr8 * 32 vector instructions per core

4392139

32 counters…

Ctr9 * 32 TLB DM per core

4392774

32 counters…


2.  Syslogs Data

Syslog is a standard for sending and receiving notification messages–in a particular format–from various network devices. The messages include time stamps, event messages, severity, host IP addresses, diagnostics and more.  Syslog was designed to monitor network devices and systems to send out notification messages if there are any issues with functioning–it also sends out alerts for pre-notified events and monitors suspicious activity via the change log/event log of participating network devices. 

The posted logs have been anonymized to replace usernames and to remove ssh and sudo lines.

3. Resource Manager data (Torque)

Please go to Chapter 10: Accounting Records of the Adaptive website for background information on job log and accounting data.

The standard accounting logs are posted after the username and project/group name fields were anonymized.

4.    System Environment Data Collections

Cabinet and chassis data is separated into 4 types of files

L1_ENV_DATA

CSV with the following fields:    

service_id, datetime, PCB_TEMP, INLET_TEMP, XDP_AIRTEMP, CAB_KILOWATTS, FANSPEED

L1_XT5_STATUS

CSV with the following fields:            

service_id,datetime,L1_S_XT5_FWLEVEL,L1_H_XT5_PWRSTATUS,L1_H_XT5_CABHEALTH,L1_S_XT5_FANSPEED,L1_S_XT5_FANMODE,L1

_S_XT5_VFD_REG,L1_S_XT5_DOORSTAT,L1_H_XT5_CAGE0VRMSTAT,L1_H_XT5_CAGE1VRMSTAT,L1_H_XT5_CAGE2VRMSTAT,L1_H_XT5_

VALERE_SH0_SL0,L1_H_XT5_VALERE_SH0_SL1,L1_H_XT5_VALERE_SH0_SL2,L1_H_XT5_VALERE_SH1_SL0,L1_H_XT5_VALERE_SH1_SL1,

L1_H_XT5_VALERE_SH1_SL2,L1_H_XT5_VALERE_SH2_SL0,L1_H_XT5_VALERE_SH2_SL1,L1_H_XT5_VALERE_SH2_SL2,L1_S_XT5_VALERE_SHAREFAULTS,L1_H_XT5_XDPALARM

L1_XT5_TEMPS        

CSV with the following fields:    

service_id,datetime,L1_T_XT5_PCBTEMP,L1_T_XT5_INLETTEMP,L1_T_XT5_XDPAIRTEMP,L1_T_XT5_XDPSTARTTEMP,L1_T_XT5_VALERE_

FET_SH0_SL0,L1_T_XT5_VALERE_FET_SH0_SL1,L1_T_XT5_VALERE_FET_SH0_SL2,L1_T_XT5_VALERE_FET_SH1_SL0,L1_T_XT5_

VALERE_FET_SH1_SL1,L1_T_XT5_VALERE_FET_SH1_SL2,L1_T_XT5_VALERE_FET_SH2_SL0,L1_T_XT5_VALERE_FET_SH2_SL1,L1_T_

XT5_VALERE_FET_SH2_SL2

L1_XT5_VOLTS

CSV with the following fields:    

service_id,datetime,L1_V_XT5_PCB5VA,L1_V_XT5_PCB5VB,L1_V_XT5_PCB3V,L1_V_XT5_PCB2V,L1_V_XT5_VALERE_SH0_SL0,L1_V_XT5

_VALERE_SH0_SL1,L1_V_XT5_VALERE_SH0_SL2,L1_V_XT5_VALERE_SH1_SL0,L1_V_XT5_VALERE_SH1_SL1,L1_V_XT5_VALERE_SH1_

SL2,L1_V_XT5_VALERE_SH2_SL0,L1_V_XT5_VALERE_SH2_SL1,L1_V_XT5_VALERE_SH2_SL2,L1_I_XT5_VALERE_SH0_SL0,L1_I_XT5_

VALERE_SH0_SL1,L1_I_XT5_VALERE_SH0_SL2,L1_I_XT5_VALERE_SH1_SL0,L1_I_XT5_VALERE_SH1_SL1,L1_I_XT5_VALERE_SH1_SL2,

L1_I_XT5_VALERE_SH2_SL0,L1_I_XT5_VALERE_SH2_SL1,L1_I_XT5_VALERE_SH2_SL2,L1_P_XT5_CABKILOWATTS

 

Definitions of all parameters is not available but this is what we can provide at this time:

Parameter Definitions
Parameter Definition
PCB_TEMP

Printed circuit board temperature which is the L1 controller pcb

INLET_TEMP Incoming air temp Printed circuit board temperature which is the L1 controler pcb.at the bottom of each cabinet in centigrade.  At 32 degrees cabinet shuts down.
XDP_AIRTEMP XDP is the cooling unit that circulates coolant through the cabinets. This temperature is measured above each cabinet.
CAB_KILOWATTS Total power being used by the cabinet. Does not include the cooling fan.  DC power consumption.
FANSPEED  A 7.5 HP motor driving the circulating fan. Max speed is 75.
VALERE The manufacturer of the power supplies used in the cabinets. Each cabinet has 7 of them. There are 9 slots, 3 rows, 3 high so 2 are always empty.  Most of the VALERE outputs are referencing temperatures.
FET The field effect transistors which are the components that actually do the voltage regulation.

 

5.   Darshan Data

Darshan is a lightweight and scalable I/O profiling tool. Darshan is able to collect profile information for POSIX, HDF5, NetCDF and MPIIO calls. Darshan profile data can be used to investigate and tune I/O behavior of MPI applications.  Darshan can be used only with MPI applications. The application, at minimum, must call MPI_Init and MPI_Finalize. 

For information on Darshan Data, go to:  https://www.mcs.anl.gov/research/projects/darshan/

The available tar files contain an anonymized version of the darshan output for each Blue Waters job. Not all Blue Waters jobs used Darshan, however. The anonymization step used the same user map as other data above and obfuscates the username, uid and path (keeping the base file system path intact). The resulting files should be readable by the normal Darshan analysis tools. The anonymization was performed with a modified version of the darshan-convert utility that leaves more information (such as the application name) along with a python driver.

6.  Lustre User Experience Metrics

Active probing of filesystem components

The character of a Lustre HPC file system is complex with activity that is influenced by multiple users and subsystems, therefore understanding abnormal behavior can be difficult to identify.   To provide better insights on file system activity, the Integrated System Console (ISC) was utilized which provides an active monitoring tool of server storage data.  At its core, ISC processes logs and job metadata, then stores this data and provides a mechanism by which to view collected data through a web interface.

The component data probes actively monitor server storage and metadata by writing to every single component of the server storage file system to measure performance.   The active probing of the file system is different than the Cray Sampler Data in that the latter is passive and is basically a counter of operations and data flows from each computer node.

How data is collected

Data was collected from 3 file system components, which were the MOM, Login, and Import/Export nodes.  Problems typically associated with these components are:

  • a large number of metadata operations
  • a very large I/O to a striped file
  • a moderate amount of I/O to an unstriped file

Service Nodes - Service nodes is a general term for non-compute nodes. The service nodes which launch jobs are more specifically called “MOM.”  The server storage hosts within the main computer system that launch jobs (MOM) are used to represent file system interactions using the same clients as the compute nodes and using LNET routers to access the file system data.

Import/Export (aka DTN) Nodes – Data is collected to measure access via InfiniBand.

Login Nodes -  Login nodes are used for administrative tasks like copying, editing and transferring files.  For example. If a user connects via an ssh client they are connecting to a login nodes.  Login nodes are used when a  user compiles code and submits jobs to the batch scheduler. Login nodes are measured to represent users impact on each other’s behavior.   The login nodes require collection from each host as the user interference can be unique to a host.

How to Understand the Data

The data files are comma-separated in the following format:

Collection host, time, operation, filesystem, ost ID, measurement time

Collection host is one of the following:

  • H2ologin[1-4]   these are collections from the login nodes and have multi user access via infiniband
  • Mom[1-64] these are machines inside the high speed network and use lnet routers
  • Ie[01-28] Import/Export nodes without login access using infinband

Time: in epoch

Operation:

Operation

Test id

create

1

write

2

rmdir

3

end

4

single file create

5

single file delete

6

Filesystem:

Filesystem name

Before 2016-02-22

After 2016-02-22

home

snx11001

snx11002

projects

snx11002

snx11001

scratch

snx11003

snx11003

Ost ID:  Lustre node number for the filesystem server

Measurement time:   Time in milliseconds to perform the operation

 

How to Get Access to the Data

Access to the datasets is provided via https://www.globus.org/.  You may login with an existing institutional account or create a new account at that site.

The collection name is Blue Waters System Monitoring Data Set and can be accessed by clicking HERE.