MESSAGE OF THE DAY ARCHIVE
On Monday, April 4th, a change will be made to the user environment where the current altd module will be replaced by its followup, xalt. No action is required on your part.
For more information on xalt please see the SC14 HUST paper "User environment tracking and problem detection with XALT" at http://dl.acm.org/citation.cfm?id=2691140.
Beginning at 11:30 on 4/25/2016, we began experiencing some delays for file operations on Home/Projects. It is our expectation that responsiveness will improve soon.
We are investigating an issue causing poor interactive command response that is likely filesystem related, we hope to have response times at expected levels soon.
We Believe we have identified the source of the filesystem responsiveness issues. Work is underway to correct the problem.
All jobs that end on or between May 1 – June 17 2016 and meet the
following job characteristics are eligible for half off (50% reduction)
of the charge factor for any queue:
*** Node Count >= 1024
*** Used Wall clock time >= 4:00:00
Blue Waters will have a scheduled maintenance on May 25th, starting at 5pm, extending to May 26th 11:59pm. This maintenance will complete a milestone in the planned filesystem upgrade process discussed earlier. Home and Projects filesystems will be hosted by an upgraded filesystem when the system returns, but will still be combined on the same physical resource until the next upgrade milestone is reached. Any changes during this maintenance period should be transparent when the system returns to service, with the exception of stripe width.
Stripe width change:
The maximum file stripe width on Home and Projects will change from 144 to 36. Files previously striped wider than 36 will be striped at 36 wide. Scripts attempting to set a wider stripe than 36 will result in a 36 wide stripe.
Maintenance has been extended to Friday 5/27 5pm CDT. This has been done to allow extra measures of data integrity assurance as filesystems are migrated for the update process.
The maintenance period ended May 27th 11:45 PM and all system resources are returned to service. We apologize for the extended maintenance period but this was necessary in order to accommodate an extra measure of integrity assurance step in the filesystem migration to upgraded servers.
The Blue Waters scratch file system currently is experiencing a metadata issue that is causing operations to hang. The issue is currently being diagnosed and a follow up message will be sent when the system is returned to production. The scheduled is paused while the issue is correct.
Blue Waters will have a scheduled maintenance on Monday June 20th, starting at 8pm, extending to Wednesday June 22nd, 11:59pm. This maintenance will complete the next milestone in the planned filesystem upgrade process. Any changes during this maintenance period should be transparent.
The cabinet that failed has been restored and the communication fabric is whole again.
A migration of data to the new home file system in preparation for the maintenance period starting June 20th may impact the performance and responsiveness of home/project.
Blue Waters has been returned to service with another milestone achieved in the planned summer filesystem upgrade process. Please note the change in CUDA default to version 7.5 (previously 7.0).
Blue Waters Admins
We recently started a migration of data to the upgraded scratch file system. This may impact the performance and responsiveness of scratch.
We continue to experience instability in the login node ecosystem. We expect further interruptions while we work toward a resolution for the known Lustre Bug. As a result your connection may drop. Please log in again to access the available healthy login nodes. We apologize for the inconvenience. Blue Waters Admin Team
We have taken steps to increase the stability of the login node ecosystem until a Lustre patch is released for the external logins. We apologize for the recent instability.
Blue Waters will have a scheduled maintenance on Friday August 26th, starting at 06am, extending to Saturday, August 27th, 8pm. This outage is the final step in the file system upgrade process and will return the scratch file system to it’s full size and performance.
Blue Waters maintenance is complete. The system was returned to service Aug 26th 8:45PM. The final step in the file system upgrade process is complete and has returned the scratch file system to it’s full size and performance. File System Quotas are now enabled and hard quotas enforced on all Lustre file systems. Please note: users over hard quota will not be able to write to file system. Wide Striping Feature on scratch file system will now support 360 maximum file stripe width.
Login access to the Blue Waters portal and the NCSA Jira ticket system has been restored.
We apologize for the interruption.
Wide Striping Feature on the scratch file system was reset to a 160 maximum file stripe width. A bug was discovered with our monitoring software and wide striping. When the bug is resolved we will again allow a wide stripe of 360 on the scratch file system.
Wide Striping Feature on scratch file system will support 360 maximum file stripe width. We have reset the maximum file stripe width to 360.
We are investigating an ongoing issue with dependency jobs.
Nearline maintenance has been completed. At this this point all Bluewaters services are available.
We had an Emergency Power Off of one rack at 12:23 PM. We have paused the scheduler until we can assess the health of the system.
The down rack was brought back online. Resumed the scheduler at 2PM. There could be some jobs lost that spanned the nodes in the down cabinet between 12:30PM and 2PM.
The Blue Waters login node user process watcher settings will be adjusted to ensure responsive login nodes and limit the impact of certain use case scenarios. Targeted, IO intensive processes will be terminated after 1 hour. Other processes will be terminated after 4 hours of CPU time if the CPU utilization is greater than 20%. Processes using more than 25% of a login node’s physical memory will also be terminated.
We are having intermittent instability on the login nodes. We will be rotating the login nodes into & out of service until the issue is isolated and resolved. Please access the available logins by ssh to bw or h2ologin.
Blue Waters experienced a power blip that effected the Lustre storage subsystems for home, projects and scratch file systems that began Feb 11th at 7:00 PM CT. We have paused the scheduler while we attempt to bring the file systems back online. Most of the file systems have recovered except for a small portion of scratch. Blue Waters resumed normal operations Feb 11th 11:30PM and the scheduler was resumed.
Blue Waters has been returned to service at 2:27am on 3/29/2017
We have been alerted to an issue causing jobs to fail at runtime and preventing applications from linking dynamically after the maintenance.
Examples of the issues are:
error while loading shared libraries: libudreg.so.0: cannot open shared object file: No such file or directory
libxpmem.so.0 => not found
libudreg.so.0 => not found
The scheduler has been paused while we investigate the issue.
Update: This issue has been resolved. No changes to your code is necessary. Please report any other issues to firstname.lastname@example.org.
Blue Waters is experiencing a component failure and HSN instability that began around 09:30 AM CT. System support staff are evaluating and attempting to restore normal service. Job scheduling is paused until the issue is resolved.
We have identified and isolated the failing components on Bluewaters. Job scheduling has been resumed.
The failing components on Bluewaters have been repaired and returned to service.
The maintenance was a success and Blue Waters was returned to service at 11:55pm, with login nodes returned earlier by 10pm. Though critical security patches warrant such short notice interruptions, we apologize for any inconvenience this may have caused. Have a safe and happy Independence Day.
Blue Waters will undergo emergency maintenance tomorrow (Oct 25th), beginning at 6AM. Access to Blue Waters resources will be restricted at 6AM and all running jobs will be terminated at the beginning of the maintenance. The compute system and scheduler will be unavailable for the entire duration; the login nodes should be available within three hours. The Lustre filesystems and Globus transfers will remain unaffected. A Return to Service email will be sent when the maintenance is complete. Interim updates will be posted on the Blue Waters Message of the Day: https://bluewaters.ncsa.illinois.edu/motd
Blue Waters is down for Emergency Maintenance that started at 6AM.
Blue Waters Emergency Maintenance is complete and the system was returned to service at 2:05PM, with the login nodes returned by 9AM. Though critical security patches warrant such a short notice interruption, we apologize for any inconvenience this may have caused.
A compute node cabinet has become unavailable. This has resulted in the loss of some jobs and the communications have been rerouted.
Blue Waters experienced an outage of 1/36 of the home file system from ~5PM 1/4 - ~8PM 1/5. Jobs requiring access to files on the affected storage would have hung and possibly timed out. After careful repair work by our vendor, the file system has been restored to full health and we believe with no loss of data. If you believe you have a suspect file on home please let us know by submitting a help request to email@example.com as soon as possible.
The compute nodes maintenance was a success and Blue Waters has returned to full service at 1:05PM today. We apologize for any inconvenience this may have caused.
Maintenance on the login nodes is complete and all are back in service. We recommend that you use bw.ncsa.illinois.edu or h2ologin.ncsa.illinois.edu to access the available login nodes.
Blue Waters will be unavailable during scheduled maintenance Monday (February 26th) beginning at 6 AM for a duration of 24hrs. All Blue Waters resources will be unavailable including the Globus online endpoints and login nodes. Interim updates will be posted on the Blue Waters Message of the Day: https://bluewaters.ncsa.illinois.edu/motd
Blue Waters has started maintenance at 6AM and will be unavailable for up to 24hrs.
The maintenance was a success and Blue Waters was returned to service February 27 at 1:35AM, with ncsa#Nearline endpoint (HPSS) returned earlier February 26 at 10PM.
Blue Waters is experiencing a full system issue that began at 6:30 AM CT. System support staff are evaluating and attempting to restore normal service but may require a full system reboot. Job scheduling is paused until the issue is resolved.
Full System Reboot in progress. All running jobs were lost and will require a re-submission from latest checkpoint file. Lustre Filesystems, GO endpoints BlueWaters and Nearline will remain up, Login nodes will remain accessible.
Blue Waters has returned to full service at 2:14 PM CT.
Thunderstorms have resulted in a power interruption of the BlueWaters System. A system full reboot is currently in progress. Return to service is estimated to be approximately 10 am Centeral time.
Blue Waters has returned to full service operations, new jobs will launch shortly.
Part of the scratch filesystem was unavailable from 7:44PM - 8PM. ost168-179 were unavailable for 15 minutes. Jobs that ended during this time may have been impacted. Jobs that remain running should have paused their i/o and awaited the ost targets to return. Note: this brief outage was for a small portion of the scratch filesystem.
We have ad some issues with the login nodes hanging up, and have required some reboots. We are analyzing the problem and hope to resolve the problem shortly.
Blue Waters will be unavailable during maintenance Monday July 16 9AM - 10PM
The logins and lustre file systems will remain available. Nearline storage will be unavailable from 9AM - Noon. New jobs will not be accepted as the scheduler will be down during the maintenance period.
Blue Waters Returned to Service at 7:30PM CT. Please review the Programming Environment Update Document on the user portal. Report any issues via email to firstname.lastname@example.org.
BlueWaters returned to service after a system restart was required
A power outage has impacted Blue Waters service on 12/27 from 1:16pm to 11:40pm. All running jobs had to be terminated for recovery. We apologize for any inconvenience this may have caused.
Lustre Scratch file system issue at 8AM. We have paused the scheduler while we address the issue.
We have determined the high speed network is in a bad state and a full system reboot is required to recover.
Blue Waters was returned to service at 1PM CT.
Blue Waters near line storage system will be undergoing maintenance starting 5AM Thursday morning.
It will undergo maintenance starting on Thursday, February 7th, possibly extending through February 8th. The system will be undergoing a significant software and hardware update that is expected to enhance performance and function. All other Blue Waters subsystems will remain operational and uninterrupted.