Graph mining is an important data analysis
methodology, but struggles as the input graph size increases.
The scalability and usability challenges posed by such large
graphs make it imperative to sample the input graph and
reduce its size. The critical challenge in sampling is to identify
the appropriate algorithm to insure the resulting analysis
does not suffer heavily from the data reduction. Predicting
the expected performance degradation for a given graph and
sampling algorithm is also useful. In this paper, we present
different sampling approaches for graph mining applications
such as Frequent Subgrpah Mining (FSM), and Community
Detection (CD). We explore graph metrics such as PageRank,
Triangles, and Diversity to sample a graph and conclude
that for heterogeneous graphs Triangles and Diversity perform
better than degree based metrics. We also present two new
sampling variations for targeted graph mining applications. We
present empirical results to show that knowledge of the target
application, along with input graph properties can be used
to select the best sampling algorithm. We also conclude that
performance degradation is an abrupt, rather than gradual
phenomena, as the sample size decreases. We present the
empirical results to show that the performance degradation
follows a logistic function.
Revised: April 18, 2018 |
Published: December 11, 2017
Citation
Purohit S., S. Choudhury, and L.B. Holder. 2017.Application-Specific Graph Sampling for Frequent Subgraph Mining and Community Detection. In IEEE International Conference on Big Data (Big Data 2017), December 11-14, 2017, Boston, Massachusetts, 1000-1005. Piscataway, New Jersey:IEEE.PNNL-SA-128679.doi:10.1109/BigData.2017.8258022