June 9, 2012
Conference Paper

An Efficient Machine Learning Approach To Low-Complexity Filtering In Biological Sequences

Abstract

Biological sequences contain low-complexity regions (LCRs) which produce super?uous matches in homology searches, and lead to slow execution of database search algorithms such as BLAST. These regions are ef?ciently identi?ed by low-complexity ?ltering algorithms such as SDUST and SEG, which are included in the BLAST tool-suite. These algorithms target differing notions of complexity, so an algorithm which combines their sensitivities is pursued. A variety of features are derived from these algorithms, as well as a new ?ltering algorithm based on Lempel-Ziv complexity. Arti?cial sequences with known LCRs are used to train and evaluate an SVM classi?er, which signi?cantly outperforms the standalone ?ltering algorithms.

Revised: November 6, 2012 | Published: June 9, 2012

Citation

Barber C.A., and C.S. Oehmen. 2012. An Efficient Machine Learning Approach To Low-Complexity Filtering In Biological Sequences. In IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), May9-12, 2012, San Diego, CA, 237-243. Piscataway, New Jersey:IEEE. PNNL-SA-84473. doi:10.1109/CIBCB.2012.6217236