Blue Waters Discounts Available: For details, see: https://bluewaters.ncsa.illinois.edu/manage-news/-/blogs/blue-waters-discounts-available
On Monday, April 4th, a change will be made to the user environment where the current altd module will be replaced by its followup, xalt. No action is required on your part.
For more information on xalt please see the SC14 HUST paper "User environment tracking and problem detection with XALT" at http://dl.acm.org/citation.cfm?id=2691140.
Beginning at 11:30 on 4/25/2016, we began experiencing some delays for file operations on Home/Projects. It is our expectation that responsiveness will improve soon.
We are investigating an issue causing poor interactive command response that is likely filesystem related, we hope to have response times at expected levels soon.
We Believe we have identified the source of the filesystem responsiveness issues. Work is underway to correct the problem.
All jobs that end on or between May 1 – June 17 2016 and meet the
following job characteristics are eligible for half off (50% reduction)
of the charge factor for any queue:
*** Node Count >= 1024
*** Used Wall clock time >= 4:00:00
Blue Waters will have a scheduled maintenance on May 25th, starting at 5pm, extending to May 26th 11:59pm. This maintenance will complete a milestone in the planned filesystem upgrade process discussed earlier. Home and Projects filesystems will be hosted by an upgraded filesystem when the system returns, but will still be combined on the same physical resource until the next upgrade milestone is reached. Any changes during this maintenance period should be transparent when the system returns to service, with the exception of stripe width.
Stripe width change:
The maximum file stripe width on Home and Projects will change from 144 to 36. Files previously striped wider than 36 will be striped at 36 wide. Scripts attempting to set a wider stripe than 36 will result in a 36 wide stripe.
Maintenance has been extended to Friday 5/27 5pm CDT. This has been done to allow extra measures of data integrity assurance as filesystems are migrated for the update process.
The maintenance period ended May 27th 11:45 PM and all system resources are returned to service. We apologize for the extended maintenance period but this was necessary in order to accommodate an extra measure of integrity assurance step in the filesystem migration to upgraded servers.
The Blue Waters scratch file system currently is experiencing a metadata issue that is causing operations to hang. The issue is currently being diagnosed and a follow up message will be sent when the system is returned to production. The scheduled is paused while the issue is correct.
Blue Waters will have a scheduled maintenance on Monday June 20th, starting at 8pm, extending to Wednesday June 22nd, 11:59pm. This maintenance will complete the next milestone in the planned filesystem upgrade process. Any changes during this maintenance period should be transparent.
The cabinet that failed has been restored and the communication fabric is whole again.
A migration of data to the new home file system in preparation for the maintenance period starting June 20th may impact the performance and responsiveness of home/project.
Blue Waters has been returned to service with another milestone achieved in the planned summer filesystem upgrade process. Please note the change in CUDA default to version 7.5 (previously 7.0).
Blue Waters Admins
We recently started a migration of data to the upgraded scratch file system. This may impact the performance and responsiveness of scratch.
We continue to experience instability in the login node ecosystem. We expect further interruptions while we work toward a resolution for the known Lustre Bug. As a result your connection may drop. Please log in again to access the available healthy login nodes. We apologize for the inconvenience. Blue Waters Admin Team
We have taken steps to increase the stability of the login node ecosystem until a Lustre patch is released for the external logins. We apologize for the recent instability.
Blue Waters will have a scheduled maintenance on Friday August 26th, starting at 06am, extending to Saturday, August 27th, 8pm. This outage is the final step in the file system upgrade process and will return the scratch file system to it’s full size and performance.
Blue Waters maintenance is complete. The system was returned to service Aug 26th 8:45PM. The final step in the file system upgrade process is complete and has returned the scratch file system to it’s full size and performance. File System Quotas are now enabled and hard quotas enforced on all Lustre file systems. Please note: users over hard quota will not be able to write to file system. Wide Striping Feature on scratch file system will now support 360 maximum file stripe width.
Login access to the Blue Waters portal and the NCSA Jira ticket system has been restored.
We apologize for the interruption.
Wide Striping Feature on the scratch file system was reset to a 160 maximum file stripe width. A bug was discovered with our monitoring software and wide striping. When the bug is resolved we will again allow a wide stripe of 360 on the scratch file system.
Wide Striping Feature on scratch file system will support 360 maximum file stripe width. We have reset the maximum file stripe width to 360.
We are investigating an ongoing issue with dependency jobs.
Nearline maintenance has been completed. At this this point all Bluewaters services are available.
We had an Emergency Power Off of one rack at 12:23 PM. We have paused the scheduler until we can assess the health of the system.
The down rack was brought back online. Resumed the scheduler at 2PM. There could be some jobs lost that spanned the nodes in the down cabinet between 12:30PM and 2PM.
The Blue Waters login node user process watcher settings will be adjusted to ensure responsive login nodes and limit the impact of certain use case scenarios. Targeted, IO intensive processes will be terminated after 1 hour. Other processes will be terminated after 4 hours of CPU time if the CPU utilization is greater than 20%. Processes using more than 25% of a login node’s physical memory will also be terminated.
We are having intermittent instability on the login nodes. We will be rotating the login nodes into & out of service until the issue is isolated and resolved. Please access the available logins by ssh to bw or h2ologin.
Blue Waters experienced a power blip that effected the Lustre storage subsystems for home, projects and scratch file systems that began Feb 11th at 7:00 PM CT. We have paused the scheduler while we attempt to bring the file systems back online. Most of the file systems have recovered except for a small portion of scratch. Blue Waters resumed normal operations Feb 11th 11:30PM and the scheduler was resumed.
Blue Waters has been returned to service at 2:27am on 3/29/2017
We have been alerted to an issue causing jobs to fail at runtime and preventing applications from linking dynamically after the maintenance.
Examples of the issues are:
error while loading shared libraries: libudreg.so.0: cannot open shared object file: No such file or directory
libxpmem.so.0 => not found
libudreg.so.0 => not found
The scheduler has been paused while we investigate the issue.
Update: This issue has been resolved. No changes to your code is necessary. Please report any other issues to email@example.com.
Blue Waters is experiencing a component failure and HSN instability that began around 09:30 AM CT. System support staff are evaluating and attempting to restore normal service. Job scheduling is paused until the issue is resolved.
We have identified and isolated the failing components on Bluewaters. Job scheduling has been resumed.
The failing components on Bluewaters have been repaired and returned to service.
The maintenance was a success and Blue Waters was returned to service at 11:55pm, with login nodes returned earlier by 10pm. Though critical security patches warrant such short notice interruptions, we apologize for any inconvenience this may have caused. Have a safe and happy Independence Day.