April 25, 2025
Journal Article

NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy Domains

Abstract

Natural language processing (NLP) tasks (text classification, named entity recognition, etc.) have seen amazing improvements over the last few years. This is due to models such as BERT that achieve deep knowledge transfer by using a large pre-trained model, then fine-tuning the model on specific tasks. The BERT architecture has shown even better performance on domain-specific tasks when the model is pre-trained using domain-relevant texts. Inspired by these recent advancements, we have developed NukeLM, a nuclear-domain BERT model pre-trained on 1.5 million abstracts from the DOE Office of Scientific and Technical Information (OSTI) database. This NukeLM model is then fine-tuned for the classification of research articles into either binary classes (related to the nuclear fuel cycle (NFC) or not) or multiple categories related to the subject of the article. We show that continued pre-training of a BERT-style architecture prior to fine-tuning results in greater performance in both article classification tasks. This information is critical for properly triaging manuscripts, a necessary task for better understanding citation networks that publish in the nuclear space and uncovering new areas of research in the nuclear (or nuclear relevant) domain.

Published: April 25, 2025

Citation

Burke L.M., K. Pazdernik, D.C. Fortin, B.A. Wilson, R. Goychayev, and J. Mattingly. 2021. NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy Domains. ESARDA Bulletin 2021, no. 63:30-40. PNNL-SA-159410. doi:10.3011/ESARDA.IJNSNP.2021.9