May 19, 2013
Conference Paper

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Abstract

Abstract—With growing dataset sizes, and as computing cycles are increasing faster than storage and wide-area bandwidths, compression appears like a promising approach for improving the performance of large-scale data analytics applications. In this context, this paper makes the following contributions. First, we develop a new compression methodology, which exploits the similarities between spatial and/or temporal neighbors in a simulation dataset, and enables high compression ratios and low decompression costs. Second, we have developed a framework that can be used to incorporate a variety of compression and decompression algorithms. This framework also supports a simple API to allow integration with an existing application or data processing middleware. Once a compression algorithm is implemented, this framework can allow multi-threaded retrieval, multi-threaded data decompression, and use of informed prefetching and caching. By integrating this framework with a data-intensive middleware, we have applied our compression methodology and framework to three applications over two datasets, including a GCRM climate model dataset. We obtained an average compression ratio of 51.68%, and up to 53.27% improvement in execution time of data analysis applications.

Revised: September 2, 2013 | Published: May 19, 2013

Citation

Bicer T., J. Yin, D. Chiu, G. Agrawal, and K.L. Schuchardt. 2013. Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications. In 27th IEEE International Parallel & Distributed Processing Symposium, (IPDPS 2013), May 20-24, 2013, Boston, MA, 1205-1216. Piscataway, New Jersey:IEEE. PNNL-SA-93019. doi:10.1109/IPDPS.2013.81