November 16, 2017
Conference Paper

Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

Abstract

Abstract—Distributed computing platforms provide a robust mechanism to perform large-scale computations by splitting the task and data among multiple locations, possibly located thousands of miles apart geographically. This distribution of resources can lead to benefits such as redundancy of data, engagement with scientific teams that involve a diverse group of domain experts distributed spatially around the globe. It allows scientists to share their expertise and data with everyone and achieve insights by deploying collective intelligence of entire team. However, this distributed computing comes with its associated problems such as rampant duplication of file transfers increasing congestion, long job completion times, unexpected site crashing, suboptimal data transfer rates, unpredictable reliability in a time range, overshooting congestion, suboptimal usage of storage elements. In addition, each sub-system becomes a potential failure node that can trigger system wide disruptions.

Revised: June 28, 2019 | Published: November 16, 2017

Citation

Singh A., E.G. Stephan, M. Schram, and I. Altintas. 2017. Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows. In IEEE 13th International Conference on e-Science (e-Science 2017), October 24-27, 2017, Auckland, New Zealand, 586-591. Los Alamitos, California:IEEE Computer Society. PNNL-SA-136096. doi:10.1109/eScience.2017.94