August 20, 2018
Conference Paper

RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian

Abstract

This paper presents RuSentiment, a new dataset for sentiment analysis of social media posts in Russian, and a new set of comprehensive annotation guidelines that are extensible to other languages. RuSentiment is currently the largest in its class for Russian, with 30,521 posts annotated with Cohen’s kappa of 0.58 (3 annotations per post). To diversify the dataset, 6,749 posts were pre-selected with an active learning-style strategy. We report baseline classification results, and release the bestperforming embeddings trained on 3.2B tokens in Russian VKontakte posts.

Revised: June 28, 2019 | Published: August 20, 2018

Citation

Rogers A., A. Romanov, A. Rumshisky, S. Volkova, M. Gronas, and A. Gribov. 2018. RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), August, 2018, Santa Fe, NM, edited by E.M. Bender, L. Derczynski, P. Isabelle, 755–763. Stroudsburg, Pennsylvania:Association for Computational Linguistics. PNNL-SA-134041.