September 30, 2025
Feature

Massive Datasets Meet Their Match

Award-winning poly-streaming approach processes massive datasets faster while using limited memory 

Poly-streaming model

The poly-streaming approach allows for massive datasets to be analyzed while using small amounts of memory. 

(Illustration by Nathan Johnson | Pacific Northwest National Laboratory)

Just as streaming services have replaced CDs and DVDs by letting people watch or listen without downloading the content first, data streaming lets scientists analyze raw data from tools like microscopes and drones without needing to save the entire dataset beforehand.

Recent award-winning research by S. M. Ferdous at Pacific Northwest National Laboratory (PNNL) and his collaborators Ahammed Ullah and Alex Pothen at Purdue University makes analyzing large data streams significantly faster, and in the process, can make extreme-scale data AI-ready. By combining streaming with parallel computing, the team developed an algorithm that speeds up data analysis by nearly two orders of magnitude. 

Powering a poly-streaming model with parallel computing

For algorithm design and analysis, researchers use models of computation which are agreed-upon rules for what operations an algorithm can perform and what resources it can use. Different models capture different settings. The classic random access memory (RAM) model does not impose a strict memory limit, but in a streaming model, memory is limited.

A streaming algorithm processes a large dataset sequentially, often in one or a few passes, while maintaining a compact summary that fits in its limited memory. These summaries are designed to recover a high-quality solution for the entire input.

data stream
The poly-streaming approach allows for massive datasets to be analyzed while using small amounts of memory. (Animation by Nathan Johnson | Pacific Northwest National Laboratory)

“The poly-streaming model generalizes streaming to many processors and streams,” said Ullah. “Each processor maintains a small local summary of what it sees. Processors communicate as needed, which helps them choose summaries of good quality while limiting the number of passes. With suitably designed algorithms, the combined summaries suffice to obtain a high-quality solution.”

Ullah formulated the poly-streaming model as part of his PhD thesis in collaboration with Ferdous and Pothen. Within this framework, algorithms can jointly optimize time via parallel computing and space via data summarization. The researchers demonstrated its effectiveness using the maximum weight matching problem in graphs, which is a classical optimization problem with many applications.

Making large datasets manageable

“The size of data is getting larger and larger,” said Ferdous, a staff scientist and past Linus Pauling Fellow at PNNL. “When the datasets get too large, we can’t easily store them on a computer. At the same time, we need to solve larger and larger problems involving these datasets.”

One solution has been to use supercomputers, such as exascale computers developed by the Department of Energy (DOE). However, some problems are too large for even the supercomputers to handle, and the large number of memory accesses increases the time needed to solve them. Streaming the datasets circumvents these memory storage issues, since only a small summary of the data is saved in the streaming mode, and the amount of memory needed to analyze the dataset is much smaller.

“While this doesn’t give an exact solution, we can prove that the approximations are accurate; they are a factor of two off the best solution, in the worst case,” said Pothen, professor of computer science at Purdue University and Ullah’s PhD advisor.

Making extreme-scale data ready for AI

Optimization problems such as the maximum weight matching problem have many applications. One such application is in the field of AI, where the data may need to be denoised and reduced in size before it can be analyzed. The maximum weight matching problem can play a crucial role in processing data for AI tasks by identifying significant subsets of the data. This preprocessing step makes the data more relevant and leads to more accuracy in reasoning tasks.

Making large datasets “AI-ready” can be a challenge. Taking raw data and running it through an AI model without first denoising the data or reducing its size may lead to inaccurate results or make the computations infeasible.  

“The poly-streaming model has the ability to process extreme-scale data,” said Ferdous. “Our model can act as the mediator between the raw data and the AI model by processing and making sense of the data before the AI model analyzes it further.”

Looking ahead, the research team sees their model being especially applicable for processing the large amounts of data from DOE’s scientific user facilities and preparing it for AI analysis, bridging the gap between AI and instrumentation.

The theoretical contributions, practical performance, and the applicability of the poly-streaming model were recognized with the ‘best paper’ prize at the recent European Symposium on Algorithms, which took place in Warsaw, Poland during September 15 – 17, 2025. This work was supported by the Advanced Scientific Computing Research program of the DOE Office of Science, and by PNNL’s Linus Pauling Distinguished Postdoctoral Fellowship. 

###

About PNNL

Pacific Northwest National Laboratory draws on its distinguishing strengths in chemistry, Earth sciences, biology and data science to advance scientific knowledge and address challenges in energy resiliency and national security. Founded in 1965, PNNL is operated by Battelle and supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit the DOE Office of Science website. For more information on PNNL, visit PNNL's News Center. Follow us on Twitter, Facebook, LinkedIn and Instagram.