Deep Learning (DL) algorithms have become the {\em de facto} Machine Learning
(ML) algorithm for large scale data analysis. DL algorithms are
computationally expensive -- even distributed DL implementations which use MPI
require days of training (model learning) time on commonly studied datasets.
Long running DL applications become susceptible to faults -- requiring
development of a fault tolerant system infrastructure, in addition to fault
tolerant DL algorithms. This raises an important question: {\em What is needed
from MPI for designing fault tolerant DL implementations?} In this paper, we
address this problem for permanent faults. We motivate the need for a fault
tolerant MPI specification by an in-depth consideration of recent innovations
in DL algorithms and their properties, which drive the need for specific fault
tolerance features. We present an in-depth discussion on the suitability of
different parallelism types (model, data and hybrid); a need (or lack thereof)
for check-pointing of any critical data structures; and most importantly,
consideration for several fault tolerance proposals (user-level fault
mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL
implementations. We leverage a distributed memory implementation of Caffe,
currently available under the Machine Learning Toolkit for Extreme Scale
(MaTEx). We implement our approaches by extending MaTEx-Caffe for using
ULFM-based implementation. Our evaluation using the ImageNet dataset and
AlexNet neural network topology demonstrates the effectiveness of the proposed
fault tolerant DL implementation using OpenMPI based ULFM.
Revised: December 28, 2017 |
Published: September 25, 2017
Citation
Amatya V.C., A. Vishnu, C.M. Siegel, and J.A. Daily. 2017.What does fault tolerant Deep Learning need from MPI?. In Proceedings of the 24th European MPI Users' Group Meeting, September 25-28, 2017, Chicago, Illinois, Paper No. 13. New York, New York:ACM.PNNL-SA-127971.doi:10.1145/3127024.3127037