Identifying similarities between datasets is a fundamental task in data mining and has
become an integral part of modern scientific investigation. Whether the task is to
identify co-expressed genes in large-scale expression surveys or to predict combinations
of gene knockouts which would elicit a similar phenotype, the underlying computational
task is often a multi-dimensional similarity test. As datasets continue to grow,
improvements to the efficiency, sensitivity or specificity of such computation will have
broad impacts as it allows scientists to more completely explore the wealth of scientific
data. A significant practical drawback of large-scale data mining is the vast majority of
pairwise comparisons are unlikely to be relevant, meaning that they do not share a
signature of interest. It is therefore essential to efficiently identify these unproductive
comparisons as rapidly as possible and exclude them from more time-intensive similarity
calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity
algorithm which enables extensive data mining within a reasonable amount of time.
The algorithm transforms datasets into binary metrics, allowing it to utilize the
computationally efficient bit operators and provide a coarse measure of similarity. As a
result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise
comparison. Two bioinformatics applications of the tool are presented to demonstrate
the ability to scale to billions of pairwise comparisons and the usefulness of this
approach.
Revised: April 13, 2020 |
Published: June 11, 2018
Citation
Lee J., G.M. Fujimoto, R.E. Wilson, H. Wiley, and S.H. Payne. 2018.Blazing Signature Filter: a library for fast pairwise similarity comparisons.BMC Bioinformatics 19.PNNL-SA-126956.doi:10.1186/s12859-018-2210-6