Novel Computing Tool Learns the Language of Chemistry
ChemReasoner suggests novel catalysts by combining large language models and deep structural knowledge
Scientists continually strive to develop newer, better catalysts to carry out chemical reactions—typically through a trial and error process. Researchers from Pacific Northwest National Laboratory (PNNL), the University of Illinois at Urbana-Champaign, and Microsoft scientists created ChemReasoner to make this process easier. Combining the power of generative artificial intelligence (AI) with computational chemistry, researchers taught a large language model (LLM) the “language” of chemistry to help identify promising strategies for catalyst development by quickly synthesizing decades of chemical knowledge. Lead author Henry Sprueill, a data scientist at PNNL, presented this research at the International Conference on Machine Learning on July 25 in Vienna, Austria, and the 2024 Accelerate Conference on August 8 in Vancouver, Canada.
Catalysts have transformed many industries, from enabling the upcycling of plastics to producing a sustainable aviation fuel. The overarching goal with ChemReasoner is to enable the discovery of new catalyst materials and energy-efficient processes.
“For that goal, we need to be able to design a material that is selective toward specific chemical reactions,” said Sprueill. “And this means that we have to make very in-depth decisions about what kinds of elements we’d want to include in that structure—and that’s a very difficult problem.”
Learning the language of chemistry
When the project began a year and a half ago, the team sought to replicate the experimental process that scientists use on a daily basis.
“I started by talking with catalysis researchers to get an understanding of their challenges when designing experiments,” said Sutanay Choudhury, primary architect of ChemReasoner.
Through those conversations, Choudhury learned that PNNL catalysis experts Mariefel Olarte and Udishnu Sanyal scoured the scientific literature to find information about a new catalyst, then performed experiments to see how novel catalysts would perform a reaction of interest.
Choudhury then recruited computer scientist Khushbu Agarwal and Sprueill to help automate the literature search process and combine it with experimental knowledge in the form of an LLM. The challenge, however, was that generic LLMs don’t speak the same “language” as catalysis researchers.
“Today’s commercially available LLMs do not understand chemistry well enough to propose new catalyst structures or steer reactions toward preferred pathways,” said Choudhury. “When we started this project, we got a lot of pushback from scientists saying that output from LLMs, such as ChatGPT, sounds more like content from a science encyclopedia, and their reasoning lacked the depth that is necessary to enable the discovery of the next generation of catalysts.”
Instead of giving up, the ChemReasoner team—which includes experts in catalysis, LLMs, graph neural networks, and computational chemistry—saw this challenge as an opportunity.
Strengthening outputs with simulations
While LLMs can derive information from published literature, without extra guardrails, the accuracy of their outputs can widely vary—and sometimes even be completely made up.
“The real challenge for AI in chemistry is that the training of these models on scientific data is very limited,” said Johannes Lercher, PNNL Battelle Fellow and chemist. “This is even more of a challenge in catalysis, where, even with the most high-throughput methods, less data is produced compared to other scientific disciplines.”
The ChemReasoner team needed to compensate the LLM for its deficiency in chemistry reasoning. General LLMs, like ChatGPT, are designed to alter their outputs based on human input—a principle called “Reinforcement Learning with Human Feedback.”
“Our idea was, why don’t we invent a system with ‘Learning from Simulation Feedback’?” said Choudhury.
The team designed ChemReasoner so that the language model could propose new designs, then get feedback from quantum chemical simulations as a means to keep the accuracy in check. They paired their LLM with a graph neural network trained on chemical simulation data to create a machine learning feedback loop. The tool uses the adsorption energies calculated in simulations as a means to score different catalysts that fit within user-defined parameters. The team found that ChemReasoner outperformed GTP-4, the top LLM currently available, in catalyst queries.
Their results show that incorporating catalysis-specific concepts, such as adsorption energies and reaction energy barriers, steers the AI system toward energetically favorable, high-efficiency catalysts.
Toward autonomous science
Though ChemReasoner already shows promise, the development team has plans to expand its capabilities. The scientists are now working to experimentally validate the ChemReasoner-selected catalysts for efficiency in converting carbon dioxide to methanol.
ChemReasoner was supported by the PNNL Laboratory Directed Research and Development program’s Generative AI initiative, and through Microsoft’s Accelerate Foundation Models Research initiative and partnership with Azure Quantum Elements.
PNNL supports many different initiatives in computational chemistry and AI. The Computational and Theoretical Chemistry Institute (CTCI) at PNNL accelerates chemistry software and methods development to solve critical challenges in mission areas such as scientific discovery for sustainable energy. By expediting the integration of chemistry software development with computer science efforts, quantum computing, novel datasets, and data science tools like AI and machine learning, the CTCI is advancing the development of next-generation molecular modeling capabilities. The Center for AI @PNNL advances the frontiers of artificial intelligence to pioneer solutions that transform science, security, and energy.
Published: August 22, 2024