Skip to Content

Predicting Performance Degradation and Failures of Applications Through System Activity Monitoring

Ravishankar Iyer, University of Illinois at Urbana-Champaign

Usage Details

Ravishankar Iyer, Subho Banerjee, Saurabh Jha, Valerio Formicola, Matthew Krafczyk, Lavin Devnani, Sharon Tang, Mikel Hernaez

Large-scale systems suffer from an increasingly high number of errors and failures. Protecting applications from system errors is a key factor for ensuring valid outputs and high-system utilization. Based on analysis of three years of Blue Waters system error data, we were able to determine the cause of application failures and found that a large-scale application is 20X more vulnerable to system errors than a small-scale application. Based on our empirical study, we can surmise that fault/error containment in a large-scale system is no longer an option and applications need to compute in presence of errors ensuring the correctness of the results and high-utilization of the system. Therefore, in this proposal we plan to extend our previous study to build an intelligent fault monitoring and management system using machine learning models by training the models on three years of data (~100 TB of data), specifically the system will learn features to detect and predict application performance degradation and failures due to system errors using performance and system logs. Blue Waters would be the enabling platform for large-scale analytics and testing platform.