May 15, 2025
Conference Paper
Custom Accessors: Enabling Scalable Data Ingestion, (Re-)Organization, and Analysis on Distributed Systems
Abstract
The emerging class of high velocity and high volume data analytic workflows comprise interwoven data ingestion, organization, and processing stages, with ingestion and organization steps often contributing comparable or even higher computational costs than actual processing steps. Since complex workflows consist of a variety of phases that view and use data differently, being able to construct efficient, scalable, distributed data structures (arrays, vectors, sets, maps, and multi-maps) is essential and requires custom methods to extend and shrink containers, analyze and position data, and, maintain globallyconsistent meta-data. In this paper, we propose a novel datastructure access paradigm based on the concept of Accessors. At a high level, accessors are customizable callable objects that can modify the behavior of insert, read, update, and delete operations for distributed containers while preserving atomicity guarantees. Accessors provide a very clean and natural way to implement a variety of programming patterns, e.g., conditional insertion/deletion and cascading computations, which would be otherwise hard (or even impossible) to express in parallel and distributed settings without using locks. We demonstrate the practicality and usefulness of our approach with two representative use cases and study the performance of these applications on a distributed High-Performance Computing system. Our analysis highlights that our proposed abstraction allows for an effective overlapping and concurrent execution of different workflow steps (e.g., data ingestion and analysis), which in a conventional analytics pipeline would execute sequentially, contributing cumulatively to the overall latency.Published: May 15, 2025