The emergence and increasing prevalence of social media, such as internet forums, weblogs (blogs), wikis, etc., has created a new opportunity to measure public opinion, attitude, and social structures. A major challenge in leveraging this information is isolating the content and metadata in weblogs, as there is no standard, universally supported, machine-readable format for presenting this information. We present two algorithms for isolating this information. The first uses web block classification, where each node in the Document Object Model (DOM) for a page is classified according to one of several pre-defined attributes from a common blog schema. The second uses a set of heuristics to select web blocks. These algorithms perform at a level suitable for initial use, validating this approach for isolating content and metadata from blogs. The resultant data serves as a starting point for analytical work on the content and substance of collections of weblog pages.
Revised: February 3, 2014 |
Published: September 4, 2011
Citation
Marshall E.J., and E.B. Bell. 2011.ISOLATING CONTENT AND METADATA FROM WEBLOGS USING CLASSIFICATION AND RULE-BASED APPROACHES. In Proceedings of the IADIS International Conferences: Web Based Communities and Social Media 2011, Collaborative Technologies 2011 and Internet Applications and Research 2011, July 22-24, 2011, Rome, Italy, edited by P Kommers, N Bessis and P Isaias, 187-191. Internet Resource:IADIS Press.PNNL-SA-79748.