Blue Waters Data Sets
For questions email: help+bw@ncsa.illinois.edu
Overview
The Blue Waters data is the result of scientific data processing since 2012 on the Petascale Computing Facility sponsored by the National Science Foundation. Blue Waters data is pubically available for viewing and downloading. The data has been anonymized; that is any personnel/account names associated with the data have been removed.
General Description of Collected Data
OVIS, an Open Source software designed for the monitoring of the performance, health and efficiency of large scale computing systems was used for most of the data collection. OVIS uses an API and network protocol for gathering this data which is called the Lightweight Distributed Metric Service (LDMS).
Blue Waters data comprises statistics compiled on various computer hardware and software activities; a few examples are:
- I/O data on various components such as NICs
- statistics on node usage
- memory allocations,
- CPU or GPU performance
- Read/write/caching, file opens and closes
- File transfers and data calls
- Communication link status
For each data set, a detailed description of actual data set or data element is provided, or a link is provided to obtain more information.
How to Get Access to the Data
Access to the datasets is provided via https://www.globus.org/. You may login with an existing institutional account or create a new account at that site.
The collection name is Blue Waters System Monitoring Data Set and can be accessed by clicking HERE.
Data Types Available
1. Node metric, compute and service node (time series) data -
A. Cray system sampler data - click link to description further down on page
B. Model Specific Registers (MSR) data - click link to description further down on page
2. Syslogs Data - click link to description further down on page
3. Resource Manager data (Torque) - click link to description further down on page
4. System Environment Data Collections - click link to description further down on page
5. Darshan data (I/O data) - click link to description further down on page
6. Lustre User Experience Metrics - click link to description further down on page
Explanation of Data Types
1. Node metric, compute and service node (time series) data
1 A. Cray System Sampler Data
The following is a description of the node and time series data contents.
Units of B are raw byte counts at the time of the sample. Most other values are raw counts at the sample time as well. The few rate datapoints are denoted as B/s (bytes per second)
To parse the datafiles, the appropriate header should be used to determine the position of the data within the comma seperated data file. As the data has changed slightly over time, there are files named HEADER.<date range> to denote the format.
File Data Name |
File Data Definition |
#Time | Time in epoch (GMT) |
Time_usec | partial second time in microseconds to the right of the decimal point |
CompId | Node ID |
Tesla_K20X.gpu_util_rate | Utilization reported by nvidia at the time of sample (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_agg_dbl_ecc_total_errors | GPU Double bit ecc errors (see attached NVIDIA doumentation for more info) |
Tesla_K20X.gpu_agg_dbl_ecc_texture_memory | GPU Double bit ecc errors for the texture memory (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_agg_dbl_ecc_register_file | GPU Double bit ecc errors for the register file (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_agg_dbl_ecc_device_memory | GPU Double bit ecc errors for device memory (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_agg_dbl_ecc_l2_cache | GPU Double bit ecc errors for the level 2 cache (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_agg_dbl_ecc_l1_cache | GPU Double bit ecc errors for the Level 1 cache (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_memory_used | Memory in use Kb (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_temp | GPU temperature in Celsius |
Tesla_K20X.gpu_pstate | Power management state (see attached NVIDIA documentation for more info) |
Tesla_K20X.gpu_power_limit | Power limit (maximum) in milliwatts |
Tesla_K20X.gpu_power_usage | GPU power consumption in milliwatts |
ipogif0_tx_bytes | Bytes transmitted with TCP/IP over the gemini interface |
ipogif0_rx_bytes | Bytes received with TCP/IP over the gemini interface |
RDMA_rx_bytes | Remote Direct Memory Address received bytes |
RDMA_nrx | RDMA number of cumulative receives |
RDMA_tx_bytes | RDMA cumulative transmit bytes |
RDMA_ntx | RDMA cumulative number of transfers |
SMSG_rx_bytes | Cumulative Bytes received via the Short Message cprotocol (refer to Cray documentation) |
SMSG_nrx | Cumulative Number of Short messages receives (refer to Cray documentation) |
SMSG_tx_bytes | Sort message transmit bytes (refer to Cray documentation) |
SMSG_ntx | Short message number of transmits (refer to Cray documentation) |
current_freemem | Unallocated memory in Kb |
loadavg_total_processes | Unixload of all processes ready to run average X 100 |
loadavg_running_processes | Unix load of processes in the running state average X 100 |
loadavg_5min(x100) | Unix load 5 minute average X 100 |
loadavg_latest(x100) | Current Unix load X 100 |
nr_writeback | Number of count of pages scheduled out but not completed |
nr_dirty | Number of count of pages waiting to be scheduled to output device |
lockless_write_bytes#stats.snx11001 | Number of lockless write I/O. This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf |
lockless_read_bytes#stats.snx11001 | Cumulative number of lockless read I/O. This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf |
direct_write#stats.snx11001 | Cumulative number of writes to storage |
direct_read#stats.snx11001 | Cumulative number of reads to storage |
inode_permission#stats.snx11001 | Cumulative number of checks for access rights to a given inode |
removexattr#stats.snx11001 | Cumulative number of remove attributes. Command removes the extended attribute identified by name and associated with the given path in the filesystem. |
listxattr#stats.snx11001 | Cumulative number of listattr. Command retrieves the list of extended attribute names associated with the given path in the filesystem. |
getxattr#stats.snx11001 | Cumulative number of times operation has occurred to retrieve the value of the extended attribute identified by name and associated with the given path in the filesystem. |
setxattr#stats.snx11001 | Cumulative number of calls to set extended attributes |
alloc_inode#stats.snx11001 | Cumulative number of Fragmentations. System will allocate another inode as needed. |
statfs#stats.snx11001 | Cumulative number of calls to stat fs |
getattr#stats.snx11001 | Cumulative number of get attribute calls |
flock#stats.snx11001 | This utility manages flock locks from within shell scripts or from the command line. A cumulative count of file locks. |
lockless_truncate#stats.snx11001 | Cumulative number of file truncates without locking a file. |
truncate#stats.snx11001 | The cumulative number of events to shrink (or extend) the size of a file to the specified size |
setattr#stats.snx11001 | The cumulative number of times setattr was called. This command sets the value of given attribute of an object |
fsync#stats.snx11001 | The cumulative number of fsync transfers. fsync transfers ("flushes") all modified in-core data of (i.e.,modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file. |
seek#stats.snx11001 | Cumulative File seeks |
mmap#stats.snx11001 | Cumulative number of new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. |
close#stats.snx11001 | Cumulative File Closes |
open#stats.snx11001 | Cumulative File Opens |
ioctl#stats.snx11001 | Cumulative Input/Output control calls |
brw_write#stats.snx11001 | Cumulative Bulk read writes to storage |
brw_read#stats.snx11001 | Cumulative Bulk reads to storage |
write_bytes#stats.snx11001 | Cumulative Writes to storage in bytes |
read_bytes#stats.snx11001 | Cumulative Reads to storage in bytes |
writeback_failed_pages#stats.snx11001 | Cumulative number of writeback failed pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_ok_pages#stats.snx11001 | Cumulative number of writeback success pages.Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_from_pressure#stats.snx11001 | Cumulative number of writeback from pressure. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_from_writepage#stats.snx11001 | Cumulative number of writeback from writepages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
dirty_pages_misses#stats.snx11001 | Cumulative number of Dirty page misses; Dirty pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk. |
dirty_pages_hits#stats.snx11001 | Cumulative number of Dirty page hits; Dirty pages are the pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk. |
lockless_write_bytes#stats.snx11002 | Number of lockless write I/O. This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf |
lockless_read_bytes#stats.snx11002 | Cumulative number of lockless read I/O. This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf |
direct_write#stats.snx11002 | Cumulative number of writes to storage |
direct_read#stats.snx11002 | Cumulative number of reads to storage |
inode_permission#stats.snx11002 | Cumulative number of checks for access rights to a given inode |
removexattr#stats.snx11002 | Cumulative number of remove attributes. Command removes the extended attribute identified by name and associated with the given path in the filesystem. |
listxattr#stats.snx11002 | Cumulative number of listattr. Command retrieves the list of extended attribute names associated with the given path in the filesystem. |
getxattr#stats.snx11002 | Cumulative number of times operation has occurred to retrieve the value of the extended attribute identified by name and associated with the given path in the filesystem. |
setxattr#stats.snx11002 | Cumulative number of calls to set extended attributes |
alloc_inode#stats.snx11002 | Cumulative number of Fragmentations. System will allocate another inode as needed. |
statfs#stats.snx11002 | Cumulative number of calls to stat fs |
getattr#stats.snx11002 | Cumulative number of get attribute calls |
flock#stats.snx11002 | This utility manages flock locks from within shell scripts or from the command line. A cumulative count of file locks. |
lockless_truncate#stats.snx11002 | Cumulative number of file truncates without locking a file. |
truncate#stats.snx11002 | The cumulative number of events to shrink (or extend) the size of a file to the specified size |
setattr#stats.snx11002 | The cumulative number of times setattr was called. This command sets the value of given attribute of an object |
fsync#stats.snx11002 | The cumulative number of fsync transfers. fsync transfers ("flushes") all modified in-core data of (i.e.,modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file. |
seek#stats.snx11002 | Cumulative File seeks |
mmap#stats.snx11002 | Cumulative number of new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. |
close#stats.snx11002 | Cumulative File Closes |
open#stats.snx11002 | Cumulative File Opens |
ioctl#stats.snx11002 | Cumulative Input/Output control calls |
brw_write#stats.snx11002 | Cumulative Bulk read writes to storage |
brw_read#stats.snx11002 | Cumulative Bulk reads to storage |
write_bytes#stats.snx11002 | Cumulative Writes to storage in bytes |
read_bytes#stats.snx11002 | Cumulative Reads to storage in bytes |
writeback_failed_pages#stats.snx11002 | Cumulative number of writeback failed pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_ok_pages#stats.snx11002 | Cumulative number of writeback success pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_from_pressure#stats.snx11002 | Cumulative number of writeback from pressure. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_from_writepage#stats.snx11002 | Cumulative number of writeback from writepages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
dirty_pages_misses#stats.snx11002 | Cumulative number of Dirty page misses; Dirty pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk. |
dirty_pages_hits#stats.snx11002 | Cumulative number of Dirty page hits; Dirty pages are the pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk. |
lockless_write_bytes#stats.snx11003 | Number of lockless write I/O. This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf |
lockless_read_bytes#stats.snx11003 | Cumulative number of lockless read I/O. This is a special kind of I/O where clients do not get any locks but instead instructs the server to take the locks on the client’s behalf |
direct_write#stats.snx11003 | Cumulative number of writes to storage |
direct_read#stats.snx11003 | Cumulative number of reads to storage |
inode_permission#stats.snx11003 | Cumulative number of checks for access rights to a given inode |
removexattr#stats.snx11003 | Cumulative number of remove attributes. Command removes the extended attribute identified by name and associated with the given path in the filesystem. |
listxattr#stats.snx11003 | Cumulative number of listattr. Command retrieves the list of extended attribute names associated with the given path in the filesystem. |
getxattr#stats.snx11003 | Cumulative number of times operation has occurred to retrieve the value of the extended attribute identified by name and associated with the given path in the filesystem. |
setxattr#stats.snx11003 | Cumulative number of calls to set extended attributes |
alloc_inode#stats.snx11003 | Cumulative number of Fragmentations. System will allocate another inode as needed. |
statfs#stats.snx11003 | Cumulative number of calls to stat fs |
getattr#stats.snx11003 | Cumulative number of get attribute calls |
flock#stats.snx11003 | This utility manages flock locks from within shell scripts or from the command line. A cumulative count of file locks. |
lockless_truncate#stats.snx11003 | Cumulative number of file truncates without locking a file. |
truncate#stats.snx11003 | The cumulative number of events to shrink (or extend) the size of a file to the specified size |
setattr#stats.snx11003 | The cumulative number of times setattr was called. This command sets the value of given attribute of an object |
fsync#stats.snx11003 | The cumulative number of fsync transfers. fsync transfers ("flushes") all modified in-core data of (i.e.,modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file. |
seek#stats.snx11003 | Cumulative File seeks |
mmap#stats.snx11003 | Cumulative number of new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. |
close#stats.snx11003 | Cumulative File Closes |
open#stats.snx11003 | Cumulative File Opens |
ioctl#stats.snx11003 | Cumulative Input/Output control calls |
brw_write#stats.snx11003 | Cumulative Bulk read writes to storage |
brw_read#stats.snx11003 | Cumulative Bulk reads to storage |
write_bytes#stats.snx11003 | Cumulative Writes to storage in bytes |
read_bytes#stats.snx11003 | Cumulative Reads to storage in bytes |
writeback_failed_pages#stats.snx11003 | Cumulative number of writeback failed pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_ok_pages#stats.snx11003 | Cumulative number of writeback success pages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_from_pressure#stats.snx11003 | Cumulative number of writeback from pressure. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
writeback_from_writepage#stats.snx11003 | Cumulative number of writeback from writepages. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem |
dirty_pages_misses#stats.snx11003 | Cumulative number of Dirty page misses; Dirty pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk. |
dirty_pages_hits#stats.snx11003 | Cumulative number of Dirty page hits; Dirty pages are the pages are the pages in memory (page cache) that have been updated and therefore have changed from what is currently stored on disk. |
SAMPLE_totaloutput_optB (B/s) | NIC Metrics (bytes per second). The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
SAMPLE_bteout_optB (B/s) | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
SAMPLE_bteout_optA (B/s) | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
SAMPLE_fmaout (B/s) | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
SAMPLE_totalinput (B/s) | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
SAMPLE_totaloutput_optA (B/s) | The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
totaloutput_optB | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
bteout_optB | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
bteout_optA | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
fmaout | Cumulative number of Fast Memory Accesses (small transfers) by Node's NIC |
totalinput | Sum of total bytes for the Node's NIC |
totaloutput_optA | NIC Metrics. The fundamental issue is that some of the performance counters count data that doesn't actually make it onto the HSN. There are overhead flits counted as parts of some Get transaction, and demarcation packets within some transactions that are entirely generated by and consumed by the local Gemini. There isn't enough information available to compensate exactly for them. Option A takes a simplistic approach, and ignores the issue. The extra bytes are counted as if they're message payload. Option B is preferred. It makes two assumption we believe are reasonable: 1. Packets part of BTE Puts will mostly be max-sized. 2. The majority of Get requests will be BTE, not FMA. We believe this matches MPI's use. The BTE is used for large transfers. Only the first and last packets of a transfer may be less than max sized. As the transfers are large, most of the packets will not be the first or last packet. Option A may be more accurate if actual use doesn't match these assumptions. |
Z-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Percentage of time that Z Negative link was in a stalled state |
Z+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Link aggregated Gemini output stalls for the Z Postive link |
Y-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Link aggregated Gemini output stalls for the Y Negative link |
Y+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Link aggregated Gemini output stalls for the Z Postive link |
X-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Link aggregated Gemini output stalls for the X Negative link |
X+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Link aggregated Gemini output stalls for the X Negative link |
Z-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | % of time spent in Input Queue Stall state for the Z Negative link |
Z+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | % of time spent in Input Queue Stall state for the Z Postive link |
Y-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | % of time spent in Input Queue Stall state for the Y Negative link |
Y+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | % of time spent in Input Queue Stall state for the Y Postive link |
X-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | % of time spent in Input Queue Stall state for the X Negative link |
X+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | % of time spent in Input Queue Stall state for the X Negative link |
Z-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Average Packet size for the Z Negative link |
Z+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Average Packet size for the Z Postive link |
Y-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Average Packet size for the Y Negative link |
Y+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Average Packet size for the Y Postive link |
X-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Average Packet size for the X Negative link |
X+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Average Packet size for the X Postive link |
Z-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | % of used bandwidth for the Z Negative link |
Z+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | % of used bandwidth for the Z Postive link |
Y-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | % of used bandwidth for the Y Negative link |
Y+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | % of used bandwidth for the Y Postive link |
X-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | % of used bandwidth for the X Negative link |
X+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | % of used bandwidth for the X Postive link |
Z-_SAMPLE_GEMINI_LINK_BW (B/s) | Z negative total transfer rate |
Z+_SAMPLE_GEMINI_LINK_BW (B/s) | Z plus total transfer rate |
Y-_SAMPLE_GEMINI_LINK_BW (B/s) | Y negative total transfer rate |
Y+_SAMPLE_GEMINI_LINK_BW (B/s) | Y positive total transfer rate |
X-_SAMPLE_GEMINI_LINK_BW (B/s) | X negative total transfer rate |
X+_SAMPLE_GEMINI_LINK_BW (B/s) | X positive total transfer rate |
Z-_recvlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
Z+_recvlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
Y-_recvlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes |
Y+_recvlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes |
X-_recvlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
X+_recvlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
Z-_sendlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
Z+_sendlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
Y-_sendlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes |
Y+_sendlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 12 send and receive lanes |
X-_sendlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
X+_sendlinkstatus (1) | Link stat is listed in number of communication lanes functioning; a fully functional gemini for a complete torus will have 24 send and receive lanes |
Z-_credit_stall (ns) | Wait time between devices when the first device is waiting for a signal from the second device to send more data. |
Z+_credit_stall (ns) | Wait time between devices when the first device is waiting for a signal from the second device to send more data. |
Y-_credit_stall (ns) | Wait time between devices when the first device is waiting for a signal from the second device to send more data. |
Y+_credit_stall (ns) | Wait time between devices when the first device is waiting for a signal from the second device to send more data. |
X-_credit_stall (ns) | Wait time between devices when the first device is waiting for a signal from the second device to send more data. |
X+_credit_stall (ns) | Wait time between devices when the first device is waiting for a signal from the second device to send more data. |
Z-_inq_stall (ns) | Input queue stalled in nanoseconds for Z minus |
Z+_inq_stall (ns) | Input queue stalled in nanoseconds for Z positive |
Y-_inq_stall (ns) | Input queue stalled in nanoseconds for Y negative |
Y+_inq_stall (ns) | Input queue stalled in nanoseconds Y positive |
X-_inq_stall (ns) | Input queue stalled in nanoseconds X negative |
X+_inq_stall (ns) | Input queue installed in nanoseconds X positive |
Z-_packets (1) | Cumulative number of packets for Z negative |
Z+_packets (1) | Cumulative number of packets for Z positive |
Y-_packets (1) | Cumulative number of packets for Y negative |
Y+_packets (1) | Cumulative number of packets for Y positive |
X-_packets (1) | Cumulative number of packets for X negative |
X+_packets (1) | Cumulative number of packets for X positive |
Z-_traffic (B) | Cumulative number traffic in bytes for Z negative |
Z+_traffic (B) | Cumulative number traffic in bytes for Z positive |
Y-_traffic (B) | Cumulative number traffic in bytes for Y negative |
Y+_traffic (B) | Cumulative number traffic in bytes for Y positive |
X-_traffic (B) | Cumulative number traffic in bytes for X negative |
X+_traffic (B) | Cumulative number traffic in bytes for X positive |
nettopo_mesh_coord_Z | Z position in the Torus |
nettopo_mesh_coord_Y | Y Position in the Torus |
nettopo_mesh_coord_X | X position in the Torus |
1 B. MSR Data File
The MSR data file contains headers follow by associated data elements. Those elements will be explained later in the document. The headers for each MSR comma separated value (CSV) data file are formatted as follows:
#Time, Time_usec, CompId, Ctr0, Ctr0_c00, Ctr0_c08, Ctr0_c16, Ctr0_c24, Ctr1, Ctr1_c00, Ctr1_c08, Ctr1_c16, Ctr1_c24, Ctr2, Ctr2_c00, Ctr2_c08, Ctr2_c16, Ctr2_c24, Ctr3, Ctr3_c00, Ctr3_c08, Ctr3_c16, Ctr3_c24, Ctr4, Ctr4_c00, Ctr4_c01, Ctr4_c02, Ctr4_c03, Ctr4_c04, Ctr4_c05, Ctr4_c06, Ctr4_c07, Ctr4_c08, Ctr4_c09, Ctr4_c10, Ctr4_c11, Ctr4_c12, Ctr4_c13, Ctr4_c14, Ctr4_c15, Ctr4_c16, Ctr4_c17, Ctr4_c18, Ctr4_c19, Ctr4_c20, Ctr4_c21, Ctr4_c22, Ctr4_c23, Ctr4_c24, Ctr4_c25, Ctr4_c26, Ctr4_c27, Ctr4_c28, Ctr4_c29, Ctr4_c30, Ctr4_c31, Ctr5, Ctr5_c00, Ctr5_c01, Ctr5_c02, Ctr5_c03, Ctr5_c04, Ctr5_c05, Ctr5_c06, Ctr5_c07, Ctr5_c08, Ctr5_c09, Ctr5_c10, Ctr5_c11, Ctr5_c12, Ctr5_c13, Ctr5_c14, Ctr5_c15, Ctr5_c16, Ctr5_c17, Ctr5_c18, Ctr5_c19, Ctr5_c20, Ctr5_c21, Ctr5_c22, Ctr5_c23, Ctr5_c24, Ctr5_c25, Ctr5_c26, Ctr5_c27, Ctr5_c28, Ctr5_c29, Ctr5_c30, Ctr5_c31, Ctr6, Ctr6_c00, Ctr6_c01, Ctr6_c02, Ctr6_c03, Ctr6_c04, Ctr6_c05, Ctr6_c06, Ctr6_c07, Ctr6_c08, Ctr6_c09, Ctr6_c10, Ctr6_c11, Ctr6_c12, Ctr6_c13, Ctr6_c14, Ctr6_c15, Ctr6_c16, Ctr6_c17, Ctr6_c18, Ctr6_c19, Ctr6_c20, Ctr6_c21, Ctr6_c22, Ctr6_c23, Ctr6_c24, Ctr6_c25, Ctr6_c26, Ctr6_c27, Ctr6_c28, Ctr6_c29, Ctr6_c30, Ctr6_c31, Ctr7, Ctr7_c00, Ctr7_c01, Ctr7_c02, Ctr7_c03, Ctr7_c04, Ctr7_c05, Ctr7_c06, Ctr7_c07, Ctr7_c08, Ctr7_c09, Ctr7_c10, Ctr7_c11, Ctr7_c12, Ctr7_c13, Ctr7_c14, Ctr7_c15, Ctr7_c16, Ctr7_c17, Ctr7_c18, Ctr7_c19, Ctr7_c20, Ctr7_c21, Ctr7_c22, Ctr7_c23, Ctr7_c24, Ctr7_c25, Ctr7_c26, Ctr7_c27, Ctr7_c28, Ctr7_c29, Ctr7_c30, Ctr7_c31, Ctr8, Ctr8_c00, Ctr8_c01, Ctr8_c02, Ctr8_c03, Ctr8_c04, Ctr8_c05, Ctr8_c06, Ctr8_c07, Ctr8_c08, Ctr8_c09, Ctr8_c10, Ctr8_c11, Ctr8_c12, Ctr8_c13, Ctr8_c14, Ctr8_c15, Ctr8_c16, Ctr8_c17, Ctr8_c18, Ctr8_c19, Ctr8_c20, Ctr8_c21, Ctr8_c22, Ctr8_c23, Ctr8_c24, Ctr8_c25, Ctr8_c26, Ctr8_c27, Ctr8_c28, Ctr8_c29, Ctr8_c30, Ctr8_c31, Ctr9, Ctr9_c00, Ctr9_c01, Ctr9_c02, Ctr9_c03, Ctr9_c04, Ctr9_c05, Ctr9_c06, Ctr9_c07, Ctr9_c08, Ctr9_c09, Ctr9_c10, Ctr9_c11, Ctr9_c12, Ctr9_c13, Ctr9_c14, Ctr9_c15, Ctr9_c16, Ctr9_c17, Ctr9_c18, Ctr9_c19, Ctr9_c20, Ctr9_c21, Ctr9_c22, Ctr9_c23, Ctr9_c24, Ctr9_c25, Ctr9_c26, Ctr9_c27, Ctr9_c28, Ctr9_c29, Ctr9_c30, Ctr9_c31
Future data files may be different so reference the new header file. Refer to Table 1 for information on the meaning of the counters (CTR 0-9).
Table 1 – Header Details
Counter |
MSR Counter Definitions |
What is being measured |
Validation Number |
Ctr0 |
L3_CACHE_MISSES per NUMA domain (4 counters) |
Memory Controller Counts |
85903603681 |
Ctr1 |
DCT_PREFETCH per NUMA domain (4 counters) |
Memory Controller Counts
|
73018664176 |
Ctr2 |
DCT_RD_TOT per NUMA domain for each controller (4 counters) |
Memory Controller Counts
|
730186636664 |
Ctr3 |
DCT_WRT per NUMA domain (4 counters) |
Memory Controller Counts
|
73018644976 |
Ctr4 |
TOT-CYC per core (4 counters) |
Total Processor Cycle for each core |
4391030 |
Ctr5 |
TOT INS per Core (4 counters) |
Total Instructions for each core |
4391104 |
Ctr6 |
L1_DCM per core (4 counters) |
L1 data Cache misses for the L1 each core |
4391233 |
Ctr7 |
Retired flops per core (all types of flops) (4 counters) |
Number of retired floating-point operations per core |
4456195 |
Ctr8 |
Operation counts per core (4 counters) |
Vector unit instructions per core |
4392139 |
Ctr9 |
Translation Lookaside Buffer (TLB) data misses per core (4 counters) |
TLB data misses per core |
4392774 |
How to Read the MSR Data File
Reference this sample MSR comma separated value (CSV) data file:
1480996440.004670, 4670, 8672, 85903603681, 1075675482, 957589463, 717738766, 744067220, 73018664176, 412116844, 369559710, 125781222, 119420227, 73018663664, 1703147611, 1424989459, 813186830, 824852929, 73018644976, 941771093, 910328449, 393929432, 383752296, 4391030, 571110602344, 562415965217, 556924102961, 554761201724, 552701273182, 551182100824, 551820818084, 550270895655, 560016645494, 553663646637, 549783500782, 549381437004, 539166673211, 539742737722, 540313267024, 539874939820, 150688025199, 148035712784, 162794766413, 158869157833, 163899518848, 161418031801, 164150890354, 162961337921, 147052930419, 144195801312, 146327145247, 144225245105, 166696559177, 164632544561, 193190727930, 214273693014, 4391104, 503840210303, 640187476527, 597059695706, 596812465922, 594595322971, 595837373660, 591893240209, 592532355039, 596434671942, 592577910304, 593366025588, 591517987927, 590791274450, 592685890076, 591352966301, 591685943383, 157896260541, 155033869123, 159496313138, 155325793581, 158132877662, 156241919544, 156052879923, 156028880128, 161016004873, 160206105921, 157705581605, 154634583769, 157143534884, 156100936718, 158578417504, 163794562360, 4391233, 3350723806, 617055349, 543869780, 471503058, 461166615, 435289712, 465346004, 443590008, 503874718, 438612212, 452681008, 463562281, 416589356, 423156695, 429588594, 421988887, 295675170, 289542436, 313223710, 308085728, 325639270, 315319292, 365086035, 314506675, 287651138, 275625001, 271477379, 271835640, 335907639, 323885751, 341101565, 521244927, 4456195, 30779324559, 30779336977, 30779328042, 30779334057, 30779323889, 30779323032, 30779323435, 30779323120, 30779323916, 30779323364, 30779324117, 30779323774, 30779322949, 30779322839, 30779322762, 30779322922, 30778881320, 30778881309, 30778881499, 30778881574, 30778881320, 30778881385, 30778881216, 30778881349, 30778881632, 30778881297, 30778884573, 30778881471, 30778881316, 30778881237, 30778881391, 30778881787, 4392139, 14761832555, 14582318338, 14589458999, 14583243493, 14582424396, 14581571672, 14582586746, 14581792888, 14603781577, 14582028453, 14583226586, 14581883168, 14582393085, 14581535539, 14581592416, 14581549938, 14514853391, 14513979629, 14514014497, 14513973145, 14513995833, 14513974554, 14514007272, 14513980413, 14524818983, 14513954446, 14514343883, 14513937916, 14514059869, 14514016689, 14514188159, 14514218075, 4392774, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
How to understand what all those numbers above mean
10 counters are used with 4 values for the counters 0-4 and 32 values for counters 5-9. Each entry of the 26,868 BW computer nodes is represented on a separate line in the CSV file. Each file contains a full day of data for each computer node at 1 minute samples. Each file should have approximately 38.7 million lines. Each data block starts with three leading elements before the CTR 0-9 data.
Therefore, in the sample CSV data file above, the three leading comma separated elements in bold correspond to #Time (EPOCH TIME), Time_Usec, and CompID. Each of the 10 counters are followed by their associated values. For the other bolded items (not the first three), these correspond to one of ten counters between CTR 0-9. These are known as VALIDATION numbers. The Validation number in EACH data block for CTR 0-9 must be equal to the listed validation number in the tables for the associated data to be valid.
Table 2 below shows the three leading elements for the counter data set, plus the 10 counters. Compare Table 2 to the above sample CSV file in order to see the relationship between the counter headers, validation number, and their associated data.
Table 2 – Counter Examples
MSR Counter Definitions |
Validation Number |
Counter |
#Time |
|
1480996440.004670 |
Time_Usec |
|
4670 |
CompID |
|
8672 |
Ctr0 *4 L3_CACHE_MISSES per NUMA domain |
85903603681 |
1075675482, 957589463, 717738766, 744067220 |
Ctr1 * 4 DCT_PREFETCH per NUMA domain |
73018664176 |
412116844, 369559710, 125781222, 119420227 |
Ctr2 * 4 DCT_RD_TOT per NUMA domain |
730186636664 |
1703147611, 1424989459, 813186830, 824852929 |
Ctr3 * 4 DCT_WRT per NUMA domain |
73018644976 |
941771093, 910328449, 393929432, 383752296 |
Ctr4 * 32 TOT-CYC per core |
4391030 |
571110602344, 562415965217, 556924102961, 554761201724, 552701273182, 551182100824, 551820818084, 550270895655, 560016645494, 553663646637, 549783500782, 549381437004, 539166673211, 539742737722, 540313267024, 539874939820, 150688025199, 148035712784, 162794766413, 158869157833, 163899518848, 161418031801, 164150890354, 162961337921, 147052930419, 144195801312, 146327145247, 144225245105, 166696559177, 164632544561, 193190727930, 214273693014 |
Ctr5 * 32 TOT INS per Core |
4391104 |
32 counters….after 4391104 |
Ctr6 * 32 L1 DCM per core |
4391233 |
32 counters… |
Ctr7 * 32 retired flops per core (all types of flops) |
4456195 |
32 counters… |
Ctr8 * 32 vector instructions per core |
4392139 |
32 counters… |
Ctr9 * 32 TLB DM per core |
4392774 |
32 counters… |
2. Syslogs Data
Syslog is a standard for sending and receiving notification messages–in a particular format–from various network devices. The messages include time stamps, event messages, severity, host IP addresses, diagnostics and more. Syslog was designed to monitor network devices and systems to send out notification messages if there are any issues with functioning–it also sends out alerts for pre-notified events and monitors suspicious activity via the change log/event log of participating network devices.
The posted logs have been anonymized to replace usernames and to remove ssh and sudo lines.
3. Resource Manager data (Torque)
Please go to Chapter 10: Accounting Records of the Adaptive website for background information on job log and accounting data.
The standard accounting logs are posted after the username and project/group name fields were anonymized.
4. System Environment Data Collections
Cabinet and chassis data is separated into 4 types of files
L1_ENV_DATA
CSV with the following fields:
service_id, datetime, PCB_TEMP, INLET_TEMP, XDP_AIRTEMP, CAB_KILOWATTS, FANSPEED
L1_XT5_STATUS
CSV with the following fields:
service_id,datetime,L1_S_XT5_FWLEVEL,L1_H_XT5_PWRSTATUS,L1_H_XT5_CABHEALTH,L1_S_XT5_FANSPEED,L1_S_XT5_FANMODE,L1
_S_XT5_VFD_REG,L1_S_XT5_DOORSTAT,L1_H_XT5_CAGE0VRMSTAT,L1_H_XT5_CAGE1VRMSTAT,L1_H_XT5_CAGE2VRMSTAT,L1_H_XT5_
VALERE_SH0_SL0,L1_H_XT5_VALERE_SH0_SL1,L1_H_XT5_VALERE_SH0_SL2,L1_H_XT5_VALERE_SH1_SL0,L1_H_XT5_VALERE_SH1_SL1,
L1_H_XT5_VALERE_SH1_SL2,L1_H_XT5_VALERE_SH2_SL0,L1_H_XT5_VALERE_SH2_SL1,L1_H_XT5_VALERE_SH2_SL2,L1_S_XT5_VALERE_SHAREFAULTS,L1_H_XT5_XDPALARM
L1_XT5_TEMPS
CSV with the following fields:
service_id,datetime,L1_T_XT5_PCBTEMP,L1_T_XT5_INLETTEMP,L1_T_XT5_XDPAIRTEMP,L1_T_XT5_XDPSTARTTEMP,L1_T_XT5_VALERE_
FET_SH0_SL0,L1_T_XT5_VALERE_FET_SH0_SL1,L1_T_XT5_VALERE_FET_SH0_SL2,L1_T_XT5_VALERE_FET_SH1_SL0,L1_T_XT5_
VALERE_FET_SH1_SL1,L1_T_XT5_VALERE_FET_SH1_SL2,L1_T_XT5_VALERE_FET_SH2_SL0,L1_T_XT5_VALERE_FET_SH2_SL1,L1_T_
XT5_VALERE_FET_SH2_SL2
L1_XT5_VOLTS
CSV with the following fields:
service_id,datetime,L1_V_XT5_PCB5VA,L1_V_XT5_PCB5VB,L1_V_XT5_PCB3V,L1_V_XT5_PCB2V,L1_V_XT5_VALERE_SH0_SL0,L1_V_XT5
_VALERE_SH0_SL1,L1_V_XT5_VALERE_SH0_SL2,L1_V_XT5_VALERE_SH1_SL0,L1_V_XT5_VALERE_SH1_SL1,L1_V_XT5_VALERE_SH1_
SL2,L1_V_XT5_VALERE_SH2_SL0,L1_V_XT5_VALERE_SH2_SL1,L1_V_XT5_VALERE_SH2_SL2,L1_I_XT5_VALERE_SH0_SL0,L1_I_XT5_
VALERE_SH0_SL1,L1_I_XT5_VALERE_SH0_SL2,L1_I_XT5_VALERE_SH1_SL0,L1_I_XT5_VALERE_SH1_SL1,L1_I_XT5_VALERE_SH1_SL2,
L1_I_XT5_VALERE_SH2_SL0,L1_I_XT5_VALERE_SH2_SL1,L1_I_XT5_VALERE_SH2_SL2,L1_P_XT5_CABKILOWATTS
Definitions of all parameters is not available but this is what we can provide at this time:
Parameter | Definition |
PCB_TEMP |
Printed circuit board temperature which is the L1 controller pcb |
INLET_TEMP | Incoming air temp Printed circuit board temperature which is the L1 controler pcb.at the bottom of each cabinet in centigrade. At 32 degrees cabinet shuts down. |
XDP_AIRTEMP | XDP is the cooling unit that circulates coolant through the cabinets. This temperature is measured above each cabinet. |
CAB_KILOWATTS | Total power being used by the cabinet. Does not include the cooling fan. DC power consumption. |
FANSPEED | A 7.5 HP motor driving the circulating fan. Max speed is 75. |
VALERE | The manufacturer of the power supplies used in the cabinets. Each cabinet has 7 of them. There are 9 slots, 3 rows, 3 high so 2 are always empty. Most of the VALERE outputs are referencing temperatures. |
FET | The field effect transistors which are the components that actually do the voltage regulation. |
5. Darshan Data
Darshan is a lightweight and scalable I/O profiling tool. Darshan is able to collect profile information for POSIX, HDF5, NetCDF and MPIIO calls. Darshan profile data can be used to investigate and tune I/O behavior of MPI applications. Darshan can be used only with MPI applications. The application, at minimum, must call MPI_Init and MPI_Finalize.
For information on Darshan Data, go to: https://www.mcs.anl.gov/research/projects/darshan/
The available tar files contain an anonymized version of the darshan output for each Blue Waters job. Not all Blue Waters jobs used Darshan, however. The anonymization step used the same user map as other data above and obfuscates the username, uid and path (keeping the base file system path intact). The resulting files should be readable by the normal Darshan analysis tools. The anonymization was performed with a modified version of the darshan-convert utility that leaves more information (such as the application name) along with a python driver.
6. Lustre User Experience Metrics
Active probing of filesystem components
The character of a Lustre HPC file system is complex with activity that is influenced by multiple users and subsystems, therefore understanding abnormal behavior can be difficult to identify. To provide better insights on file system activity, the Integrated System Console (ISC) was utilized which provides an active monitoring tool of server storage data. At its core, ISC processes logs and job metadata, then stores this data and provides a mechanism by which to view collected data through a web interface.
The component data probes actively monitor server storage and metadata by writing to every single component of the server storage file system to measure performance. The active probing of the file system is different than the Cray Sampler Data in that the latter is passive and is basically a counter of operations and data flows from each computer node.
How data is collected
Data was collected from 3 file system components, which were the MOM, Login, and Import/Export nodes. Problems typically associated with these components are:
- a large number of metadata operations
- a very large I/O to a striped file
- a moderate amount of I/O to an unstriped file
Service Nodes - Service nodes is a general term for non-compute nodes. The service nodes which launch jobs are more specifically called “MOM.” The server storage hosts within the main computer system that launch jobs (MOM) are used to represent file system interactions using the same clients as the compute nodes and using LNET routers to access the file system data.
Import/Export (aka DTN) Nodes – Data is collected to measure access via InfiniBand.
Login Nodes - Login nodes are used for administrative tasks like copying, editing and transferring files. For example. If a user connects via an ssh client they are connecting to a login nodes. Login nodes are used when a user compiles code and submits jobs to the batch scheduler. Login nodes are measured to represent users impact on each other’s behavior. The login nodes require collection from each host as the user interference can be unique to a host.
How to Understand the Data
The data files are comma-separated in the following format:
Collection host, time, operation, filesystem, ost ID, measurement time
Collection host is one of the following:
- H2ologin[1-4] these are collections from the login nodes and have multi user access via infiniband
- Mom[1-64] these are machines inside the high speed network and use lnet routers
- Ie[01-28] Import/Export nodes without login access using infinband
Time: in epoch
Operation:
Operation |
Test id |
create |
1 |
write |
2 |
rmdir |
3 |
end |
4 |
single file create |
5 |
single file delete |
6 |
Filesystem:
Filesystem name |
Before 2016-02-22 |
After 2016-02-22 |
home |
snx11001 |
snx11002 |
projects |
snx11002 |
snx11001 |
scratch |
snx11003 |
snx11003 |
Ost ID: Lustre node number for the filesystem server
Measurement time: Time in milliseconds to perform the operation
How to Get Access to the Data
Access to the datasets is provided via https://www.globus.org/. You may login with an existing institutional account or create a new account at that site.
The collection name is Blue Waters System Monitoring Data Set and can be accessed by clicking HERE.