Using Artificial Intelligence to Mine Protein Interactions Within Biological Literature
Computing and artificial intelligence aid scientists who seek to understand function-driven design and control of biological systems
The Science
The volume of biological scientific literature has expanded beyond our ability to digest it in its entirety. Using artificial intelligence (AI), scientists can pull information from large bodies of publications and generate summary tables that can suggest answers to specific research questions. In this project, computational scientists compared several AI tools to evaluate their ability to identify related proteins, as described in the literature—a significant step as researchers learn more about phenotypes and how we can use them to our advantage in health and biomanufacturing fields.
The Impact
Over the years, scientists have identified roughly 245 million proteins. Of particular interest to researchers are protein-protein interactions that result in complex biological functions. For those who want to identify all protein interactions for their specific experiment, the task is insurmountable. However, AI can help find these interactions, homing in on protein-protein “needles” in a “haystack” of proteins. But as the number of AI tools continues expanding, scientists need guidelines that help them decide which ones are best suited to their experiment.
In this paper, the authors provide such guidelines by finding and describing consistent patterns among types of AI tools for mining protein-protein interactions in literature. Researchers can use this information to find real molecular targets to manipulate and create phenotypes, which is a goal of Pacific Northwest National Laboratory’s Predictive Phenomics Initiative. Using AI is much faster than conducting several manual experiments.
Summary
Understanding how proteins interact with other proteins is crucial to understanding how they influence functions and to making other scientific discoveries. For example, the study of interactions between viral and human proteins contributed to the development of vaccines to stimulate a phenotype in humans: the ability to combat SARS-CoV-2. While some information about protein-protein interactions are stored in publicly accessible databases, significant amounts of useful information is contained within the continuously growing body of literature; such literature may contain novel information that can have high impact. To uncover the information, researchers can harness the power of AI, guided by prior knowledge of which tool is best suited for their needs.
There is a growing number of AI models, but testing each one is time consuming. Instead, this project team tested groups of text mining models, including large language models (e.g., those in the ChatGPT family) and other methods. The researchers identified common patterns across these models and provided recommendations based on the provided context (titles and abstracts only or full papers) and project goals. For example, this study found that large language model methods are well-suited for discovery-based studies in smaller pools of literature. Overall, the framework outlined in the paper can be used to guide scientists through applying AI models in order to extract biological interactions in literature.
Contact
Lisa Bramer, lisa.bramer@pnnl.gov, PNNL
Funding
The research described in this paper is part of the Predictive Phenomics Initiative at Pacific Northwest National Laboratory and is conducted under the Laboratory Directed Research and Development Program. Pacific Northwest National Laboratory is a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract No. DE-AC05-76RL01830.
Published: November 19, 2024
Degnan D.J., C.W. Strauch, M.Y. Obiri, E.D. VonKaenel, S.J. Kim, J. Kershaw, D.L. Novelli, K. Pazdernik, and L.M Bramer. 2024. "Protein-Protein Interaction Networks Derived from Classical and Machine Learning-Based Natural Language Processing Tools." Journal of Proteome Research. https://doi.org/10.1021/acs.jproteome.4c00535