Performance Analysis of Large Scale Deep Learning Systems
In recent years, there has been a substantial growth of interest in neural networks with many layers usually referred to as Deep Learning. It has been observed that increasing the number of training data and the number of parameters of these models can improve their accuracy significantly. This observation led to huge interest in large-scale training of these models. However, existing distributed implementations of the deep learning training process lacks efficiency across a large set of machines, limiting their scalability. The efficiency loss is caused by the high overhead of message passing between multiple machines as well as CPU/GPU data transfer within nodes. In this project, we plan to model analytically and benchmark existing distributed deep learning frameworks to identify their main bottlenecks. The work leads to guidelines for designing highly scalable deep learning frameworks, as well as resolving the issues of the existing ones. Large number of nodes equipped with powerful GPUs, a fast shared parallel storage, all connected via a fast interconnection network, makes Blue Waters extremely appealing and suitable for this research project.