Toward Parallel Document Clustering

September 1, 2011

Conference Paper

Toward Parallel Document Clustering

Abstract

A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multimillion dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of distance calculations needed to cluster the documents into “anchors” around reference documents called “pivots”. We extend the original algorithm to increase the amount of available parallelism and consider two implementations: a complex data structure which affords efficient searching, and a simple data structure which requires repeated sorting. The sorting implementation is integrated with a text corpora “Bag of Words” program and initial performance results of end-to-end a document processing workflow are reported.

Revised: September 9, 2011 | Published: September 1, 2011

Citation

Mogill J.A., and D.J. Haglin. 2011. Toward Parallel Document Clustering. In Proceedings of the 25th IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW 2011), May 16-20, 2011, Anchorage, Alaska, 1700-1709. Los Alamitos, California:IEEE Computer Society. PNNL-SA-77367.