Machine learning-based prediction of material properties is often hampered by the lack of sufficiently large training datasets. The majority of such measurement data is embedded in scientific literature and the ability to automatically extract these data is essential to support the development of reliable property prediction methods. In this work, we describe a methodology for an automatic property extraction framework using material solubility as the target property. We create an annotated dataset containing tags for solubility-related entities using a combination of regular expressions and manual tagging. We then compare five entity recognition models leveraging both token-level and span-level architectures on the task of classifying solute names, solubility values, and solubility units. Additionally, we explore a novel pretraining approach that leverages automated chemical name and quantity extraction tools to generate large datasets that do not rely on intensive manual effort. Finally, we perform an analysis to identify the causes of classification errors.
Published: April 27, 2022
Citation
Panapitiya G.U., F.C. Parks, J.P. Sepulveda, and E.G. Saldanha. 2021.Extracting Material Property Measurement Data from Scientific Articles. In Proceedings of the 2021 Conference on Empirical Methods i Natural Language Processing (EMNLP 2021), November 7-11, 2021, Online and Punta Cana, Dominican Republic, 5393-5402. Stroudsburg, Pennsylvania:Association for Computational Linguistics.PNNL-SA-162563.doi:10.18653/v1/2021.emnlp-main.438