February 23, 2007
Journal Article

Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms

Abstract

Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependent upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.

Revised: July 12, 2007 | Published: February 23, 2007

Citation

Shah A.R., C.S. Oehmen, J.K. Harper, and B.M. Webb-Robertson. 2007. Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms. Computational Biology and Chemistry 31, no. 2:138-142. PNNL-SA-48402. doi:10.1016/j.compbiolchem.2007.02.012