Assigning Proteins ID Cards
New approach identifies proteins with confidence, dropping false identifications to near zero
Results: Just as a sequence of numbers on a Social Security card uniquely identifies a person, so do unique sequences of amino acids unambiguously identify proteins. Scientists at the Pacific Northwest National Laboratory have found a new approach that combines these unique sequences with precise data and conventional database searches to correctly identify the thousands of proteins produced by a single cell.
In conventional methods, the identity of a protein is based on a number of factors, and researchers then rank their confidence of the identifications. However, the ranking unavoidably results in false identifications for about five percent of the proteins and also results in many other ambiguities. With the new approach, researchers have identified proteins and their modifications, including unknown ones, with nearly zero false or ambiguous identifications.
Why it matters: Because many techniques for early disease detection and environmental monitoring rely on the identification of proteins, correct identifications play an elemental rule in such scientific research. Probing unknown protein modifications will also be one of the most important tasks in the next generation of proteomics.
Methods: The new approach combined precise data, conventional database searches, and finally the unique sequence tags, known as UStags, for short. The process began with high-quality measurements of the masses of fragment sequences of proteins from the proteome, a cell's entire repertoire of proteins. The measurements were obtained with a high-powered mass spectrometer, located in the Department of Energy's Environmental Molecular Sciences Laboratory, a national scientific user facility at PNNL.
The resulting measurements of the masses or mass spectra were searched twice against the yeast sequence database. The first search was done with very narrow parameters, providing a list of possible peptide candidates. The second search was done with far wider tolerances, to include potential amino acid substitutions and modifications. Amino acids are the 20 building blocks used to construct proteins. Amino acids may be added or dropped in a protein.
Next, the scientists counted each amino acid obtainable from the spectrum for top 10 proposed candidates, construct sequences, and filtered the resultant sequences from the narrow and broad searches through their residue replacement filter. After considering possible amino acid substitutions and potential modifications, the filter rejected all ambiguous sequences. The resulting set of amino acid sequences or UStags, which ranged from 5 to 45 amino acids in length, were then used to identify proteins and assign modifications.
What's next? Having proven the principle of UStag, the team is beginning to explore various projects including the degradation of proteins inside a cell, protein cleaving enzymes, amino acid mutations, and protein interactions with larger masses. In addition, they are using UStag to look for the unexpected in the proteome.
Acknowledgments: The National Institutes of Health's National Institute of Allergy and Infectious Diseases and National Center for Research Resources along with the EMSL's 2007 Intramural Program funded this research.
The research was conducted by Yufeng Shen, Nikola Tolic, Kim K. Hixson, Samuel O. Purvine, Ljiljana Paša-Tolic, Wei-Jun Qian, Joshua N. Adkins, Ronald J. Moore, and Richard D. Smith, all of Pacific Northwest National Laboratory. This work is part of PNNL's contributions to developing and deploying transformational tools and techniques for the biological, chemical, environmental, and physical sciences.
Citation: Shen Y, N Tolic, KK Hixson, SO Purvine, L Paša-Tolic, WJ Qian, JN Adkins, RJ Moore, and RD Smith. 2008. "Proteome-wide Identification of Proteins and Their Modifications with Decreased Ambiguities and Improved False Discovery Rates Using Unique Sequence Tags." Analytical Chemistry 80(6):1871-1882. DOI: 10.1021/ac702328x S0003-2700(70)02328-4