May 16, 2023
Article

Try This to Integrate Your Multi-Omics Data with Missing Values

Methods review suggests ways to use artificial intelligence and machine learning to handle missing data when integrating two or more omics approaches

An image of a computer

An image of an open laptop computer.

(Photo by Andrea Starr | Pacific Northwest National Laboratory)

Complex interactions between various biological molecules are the basis for functioning living systems. In pursuit of understanding these biomolecules, whole disciplines such as genomics, proteomics, and metabolomics have emerged. As it becomes cheaper and easier to obtain multiple types of omics data, there is sometimes the assumption that all data possible has been collected from a sample or system. Even if all data is collected, integration of that data to provide a complete view of the whole system under study is challenging. There are many underlying reasons which make it difficult to integrate data in large data sets, which include the ratio of measured molecules to samples, the distribution of those measurements wherein not all molecules are measured equally, and the inherent complexity of biological data.

Artificial intelligence (AI) and machine learning (ML) can address the integration of those various types of data, but the issue of missing data must also be considered. Recently, a group of Pacific Northwest National Laboratory (PNNL) scientists published a paper describing a variety of data integration approaches that simultaneously address the challenge of missing data. This is unlike most other AI/ML integration approaches that will simply ignore missing data, toss them out, and consider only samples with all omics of interest completely observed.

The authors provide some context and suggestions for data treatment and describe how AI and ML are increasingly being used to address the problems of integration, especially integrating datasets with missing omics data. As a collection of data scientists, computational biologists and biomedical scientists, their approach to this review came from a place of necessity.

According to PNNL data scientist, Lisa Bramer, “AI and ML methods show great potential to aid in overcoming data integration challenges. However, unlike many genomics data types, data that comes from mass spectrometry instruments, such as proteomics, often contain missing values. These missing values are a result of biological and non-biological processes, so removing missing values results in throwing away potentially useful data, and the ability to handle missing data is essential.”

They also outline that even with AI and ML there are still limitations to the methods described. The AI/ML approaches typically require large samples, and this is often not available in multi-omic-based studies. Not all molecules are measured, and not all methods capture all molecules. In short, there are multiple solutions presented together in this paper, providing a decision matrix for researchers struggling to integrate omics data while addressing missing data.

“Data integration is a rapidly expanding area of research,” said PNNL statistician, Javier Flores, “ and sifting through the ever-expanding list of methods for approaches that specifically address one of the biggest challenges when integrating data—missing data—is difficult. With this review we are providing a snapshot of the current state of research, conveniently describing methods for integration that address the missing data problem, while still providing readers a sense of their limitations and avenues for future improvements.”

Multi-omics data are a powerful tool in the examination of biomolecules, but until measurements and methods capture all molecules AI and ML can integrate complex data sets that have missing data for better understanding of whole systems.

Published: May 16, 2023

Flores J.E., D.M. Claborne, Z.D. Weller, B.M. Webb-Robertson, K.M. Waters, and L.M. Bramer. 2023. "Missing Data in Multi-Omics Integration: Recent Advances Through Artificial Intelligence." Frontiers in Artificial Intelligence 6. PNNL-SA-179732. DOI:10.3389/frai.2023.1098308