Predictive models for tweet deletion have been a relatively unexplored area of Twitter-related computational research. We first approach the deletion of tweets as a spam detection problem, applying a small set of handcrafted features to improve upon the current state-of-the- art in predicting deleted tweets. Next, we apply our approach to a dataset of deleted tweets that better reflects the current deletion rate. Since tweets are deleted for reasons beyond just the presence of spam, we apply topic modeling and text embeddings in order to capture the semantic content of tweets that can lead to tweet deletion. Our goal is to create an effective model that has a low-dimensional feature space and is also language-independent. A lean model would be computationally advantageous processing high-volumes of Twitter data, which can reach 9,885 tweets per second. Our results show that a small set of spam-related features combined with word topics and character-level text embeddings provide the best f1 when trained with a random forest model. The highest precision of the deleted tweet class is achieved by a modification of paragraph2vec to capture author identity.
Revised: December 2, 2016 |
Published: February 29, 2016
Citation
Potash P.J., E.B. Bell, and J.J. Harrison. 2016.Using Topic Modeling and Text Embeddings to Predict Deleted Tweets. In AAAI Workshop on Incentive and Trust in E-Communities (WIT-EC'16), February 12–17 2016, Phoenix, Arizona. Palo Alto, California:Association for the Advancement of Artificial Intelligence.PNNL-SA-114673.