December 1, 2018
Journal Article

Fractal Dimension Calculation for Big Data Using Box Locality Index

Abstract

The box-counting approach for fractal dimension calculation is scaled up for big data using a data structure named box locality index (BLI). The BLI is constructed as key-value pair style with the key (an integer array) indexing the location of a “box” (i.e., a gird cell on the multi-dimensional space embedding a given dataset) and the value counting the number of data points inside the box (called “box occupancy”). Compared with the traditionally used E-dim tree, the BLI avoids complex hierarchical structure and encodes only necessary information required by box-counting approach for fractal dimension calculation. Moreover, the key-value pair nature of BLI, together with the fact that the box occupancy is aggregatable, grants box-counting approach the needed scalability for fractal dimension calculation of big data using distributed computing techniques (e.g., MapReduce and Spark). Taking the advantage of the BLI, MapReduce and Spark methods for fractal dimension calculation of big data are developed, which conduct box-counting for each grid level as a cascade of MapReduce/Spark jobs in a bottom-up fashion. In an empirical validation, the MapReduce and Spark methods demonstrated good effectiveness and efficiency in fractal calculation of a big synthetic dataset. In summary, this work provides an efficient solution for the estimation of intrinsic dimension of big data that is essential for many machine learning methods and data analytics.

Revised: August 14, 2019 | Published: December 1, 2018

Citation

Liu R., R.J. Rallo Moya, and Y. Cohen. 2018. Fractal Dimension Calculation for Big Data Using Box Locality Index. Annals of Data Science 5, no. 4:549–563. PNNL-SA-127675. doi:10.1007/s40745-018-0152-5