Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned

October 13, 2022

Conference Paper

Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned

Abstract

Foundation models pre-trained on large corpora demonstrate significant gains across many natural language processing tasks and domains e.g., law, healthcare, education, etc. However, only limited efforts have investigated the opportunities and limitations of applying these powerful models to science and security applications. In this work we develop foundation models of scientific knowledge for chemistry to augment scientists with the advanced ability to perceive and reason at scale previously unimagined. Specifically, we build large-scale (1.47B parameter) general-purpose models for chemistry that can be effectively used to perform a wide range of in-domain and out-of-domain tasks. Evaluating these models in a zero-shot setting, we analyze the effect of model and data scaling, knowledge depth, and temporality on model performance in context of model training efficiency. Our novel findings demonstrate that (1) model size significantly contributes to the task performance when evaluated in a zero-shot setting; (2) data quality (aka diversity) affects model performance more than data quantity; (3) similarly, unlike previous work (Luu et al., 2021) temporal order of the documents in the corpus boosts model performance only for specific tasks, e.g., SciQ; and (4) models pre-trained from scratch perform better on in-domain tasks than those tuned from general-purpose models like Open AI’s GPT-2.

Published: October 13, 2022

Citation

Horawalavithana Y.S., E.M. Ayton, S. Sharma, S.A. Howland, M. Subramanian, S.W. Vasquez, and R.J. Cosbey, et al. 2022. Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned. In Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, May 2022, Vitrual and Dublin, Ireland, 160–172. Stroudsburg, Pennsylvania:Association for Computational Linguistics. PNNL-SA-171279. doi:10.18653/v1/2022.bigscience-1.12

Research topics

Chemistry

Artificial Intelligence

PNNL

Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned

Abstract

Citation

Research topics

Who should I trust? A Visual Analytics Approach for Comparing Net Load Forecasting Models

Yes, No, Maybe So: Human Factors Considerations for Fostering Calibrated Trust in Foundation Models Under Uncertainty

Spectral Reflectance of Common Surfaces for (Laser) Detection of Aerosols and Gases