December 9, 2019
Conference Paper

TAZeR: Hiding the Cost of Remote I/O in Distributed Scientific Workflows

Abstract

A perennial bottleneck in distributed workflow analytics is long access latencies for remote data. We ask the question: assuming that data must be accessed remotely, can latencies be hidden? We present TAZeR, a framework that reduces data access latency while increasing data reuse. TAZeR transparently converts POSIX I/O into operations that interleave application work with data transfer, i.e. read, prefetching, and write stage-out. TAZeR ensures read data moves directly to application memory without synchronous intervention (soft zero-copy). TAZeR uses distributed bandwidth-aware staging to exploit reuse across application tasks and to manage the capacity constraints of fast hierarchical storage. We evaluate TAZeR on a High Energy Physics workflow that requests remote data at 48 Gb/s (over two 1 Gb/s WAN links) using complex access patterns. TAZeR is 12× and 22× faster than XRootD (state-of-the-art) and file copies (current approach), respectively; and within 7% of optimal. We discuss conditions when TAZeR can hide I/O accesses; and evaluate performance as effective staging sizes change.

Revised: August 27, 2020 | Published: December 9, 2019

Citation

Suetterlein J.D., R.D. Friese, N.R. Tallent, and M. Schram. 2019. TAZeR: Hiding the Cost of Remote I/O in Distributed Scientific Workflows. In IEEE International Conference on Big Data (Big Data 2019), December 9-12, 2019, Los Angeles, CA, 383-394. Piscataway, New Jersey:IEEE. PNNL-SA-148879. doi:10.1109/BigData47090.2019.9006418