August 7, 2024
Conference Paper

GOLEM: GOld standard for Learning and Evaluation of Motifs

Abstract

Motifs are distinctive, recurring, widely used idiom-like words or phrases, often originating from folklore, whose meaning is anchored in a narrative and have a significance as communicative devices across a wide range of media, including news, literature, and propaganda. Many motifs concisely imply a large constellation of culturally relevant information, and their broad usage suggests their cognitive importance as touchstones of cultural knowledge. As such, their detection is a step towards culturally aware natural language processing. We present GOLEM (GOld standard for Learning and Evaluation of Motifs) a dataset of English news articles, opinion pieces, and broadcast transcripts annotated for motific information. The dataset identifies 25,737 motif candidates across 34 motif types drawn from three cultural or national groups: Jewish, Irish, and Puerto Rican. The dataset contains 2,024,141 words split into 25,737 text snippets drawn from 8,073 articles. Each motif candidate is labeled according to a scheme which identifies the type of usage (motific, referential, eponymic, or unrelated), resulting in 1,743 actual motific instances in the data. Annotation was performed by individuals identifying as members of each group and achieved a Fleiss’ kappa (?) of > 0.55. In addition to the data, we demonstrate that classification of the candidate type is a challenging task for Large Language Models (LLMs) using a few-shot approach; recent models such as T5, FLAN-T5, GPT-2, and Llama 2 (7B) achieved a performance of 41% accuracy at best, where the majority class accuracy is 41% and the average chance accuracy is 27%. These data will support development of new models and approaches for detecting (and reasoning about) motific information in text.

Published: August 7, 2024

Citation

Yarlott W.H., A. Acharya, D. Castro Estrada, D. Gomez, and M.A. Finlayson. 2024. GOLEM: GOld standard for Learning and Evaluation of Motifs. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 20-24, 2024, Torino, Italy, edited by N. Calzolari, et al, 7801–7813. Kerrville, Texas:Association for Computational Linguistics. PNNL-SA-191514.

Research topics