Predicting Performance Degradation and Failures of Applications Through System Activity Monitoring
Large-scale systems suffer from an increasingly high number of errors and failures. Protecting applications from system errors is a key factor for ensuring valid outputs and high-system utilization. Based on analysis of three years of Blue Waters system error data, we were able to determine the cause of application failures and found that a large-scale application is 20X more vulnerable to system errors than a small-scale application. Based on our empirical study, we can surmise that fault/error containment in a large-scale system is no longer an option and applications need to compute in presence of errors ensuring the correctness of the results and high-utilization of the system. Therefore, in this proposal we plan to extend our previous study to build an intelligent fault monitoring and management system using machine learning models by training the models on three years of data (~100 TB of data), specifically the system will learn features to detect and predict application performance degradation and failures due to system errors using performance and system logs. Blue Waters would be the enabling platform for large-scale analytics and testing platform.