Sponsors: National Institutes of Health, PNNL Laboratory Research and Development, and DOE Office of Biological and Environmental Research
Contact: Gordon Anderson.
The large volumes of LC-MS data, often encompassing many datasets, must be processed to extract the useful information from the raw data files (generally a series of mass spectra) produced by the MS analyzer. Proteomic analysis of just one organism may include hundreds to thousands of LC-MS analyses, with each dataset containing thousands of mass spectra.
The first step in our current proteomics data analysis pipeline involves reducing the MS raw data to a form compatible with downstream analysis algorithms. In the case of LC-FTICR data, the raw spectra are analyzed and converted to tables of masses and spectrum number (or elution times) that represent individual species detected in each spectrum. In the case of LC-MS/MS analyses, the raw MS/MS fragmentation spectra are used to search databases of possible peptide sequences (e.g., using SEQUEST) and generate tentative peptide identifications. The process of converting isotopic distributions to tables of masses is referred to as de-isotoping or mass transformation. This process is performed by in-house developed software called ICR-2LS. Once the spectrum is processed and the peak information extracted, a table is generated that contains the masses of the detected species, their intensities, and quality information. This information is then used in later stages of the proteomics data analysis pipeline.
The ICR-2LS software developed at PNNL converts isotopic distributions from FTICR analyses to tables of masses. Full Image (png 36kb)
For comparative studies, the peak finding algorithm has been enhanced to allow distinctive isotopic signatures to be detected. For example, a comparative analysis of mixtures of peptides from organisms grown on normal media and isotopically enriched media can be analyzed and the type of isotopic distribution identified.
After the mass spectra for a given analysis have been reduced to tables of monoisotopic masses, the masses are processed by another in-house developed software package called VIPER. This software processes each analysis in an automated fashion to 1) load and filter the data, 2) find mass and elution time features, 3) regress the observed elution times with the normalized elution times of putative mass and time tags, 4) further refine the mass calibration, 5) match the mass and elution time features to the mass and time tags, 6) export the results to a database, and 7) generate initial 2D plots and chromatograms for data evaluation.
Peak matching and UMC assignment
The process of matching detected mass and elution time features to mass and time tags in the database is conceptually straightforward, since only the mass and normalized elution time of each feature needs to be compared to the mass and normalized elution time of each mass and time tag within a given tolerance to arrive at an identification. However, making reliable assignments with an established measure as to the confidence of the assignments is challenging. When a detected mass and elution time feature can potentially be assigned to two or more mass and time tags, the best match can be determined on the basis of the mass and normalized elution time deviations.
For this purpose, an algorithm has been developed to estimate the closeness of the match. Given the large number of detected features typically observed during analysis of complex proteomics samples, a set of mass errors associated with identifications; that is, the assignments of detected features to matching mass and time tags, is computed and used to refine the mass calibration for the data. This step is accomplished by computing the mass errors for each measurement that contributes to a given mass and elution time feature.
The detected features and mass and time tags for each analyzed dataset are exported to a database. Queries can then be applied to "rollup" the results for replicate analyses and to export the list of features and/or identified peptides for further informatics analyses and characterization. During automated analysis, several 2D displays and chromatograms are created to show the data both before and after filtering and before and after searching the database. These plots can be used to gauge sample complexity, the fraction of unique features identified, and the overall quality of the analysis. Additionally, these 2D plots can reveal the presence of contaminants such as surfactants or polymers.
LC elution time normalization
To combine and interpret data from multiple LC-MS analyses, it is important to effectively group the same peptides from separate LC-MS analyses. However, the elution time of the peptides from one analysis is generally somewhat distorted or shifted in comparison to the retention time for the same peptides in another analysis. We use an advanced approach that involves aligning the datasets by stretching the time axis before "fine tuning" the mass and time locations of individual features in the two datasets to obtain optimal overlap. This process takes advantage of the fact that relative peptide elution times are generally maintained between analyses. Thus, to align two LC-MS chromatograms, the analyses are first broken down into smaller segments and then the similarity between subsections is compared to uncover retention time shifts between the two analyses.
While peptide abundances can be compared between samples because the same peptide sequence will typically ionize with a similar efficiency, it is useful to normalize sample intensities since co-eluting species can introduce suppression effects. We employ the simplest form of normalization in which peptides common to both samples are identified and the intensity of one is plotted against the intensity of the other. Assuming the majority of the peptides in common remain unchanged, the slope of the line represents the correction factor to apply to the second sample. Following normalization, the ratio of the observed intensity for each peptide can be computed, and the ratios for peptides belonging to the same protein can be averaged to obtain a measure of protein change. Peptide intensities observed in one condition are plotted against the peptide intensities observed in the second condition. A linear fit with a slope of 0.907 suggests that the abundances in the second condition should be multiplied by 1.103 to normalize them to the abundances in the first condition. While the distribution indicates that there are substantial variations in relative peptide abundances between samples, replicate analyses are required to evaluate experimental contributions and to enable the development of an error model to estimate the significance of the observed variation (i.e., how likely that the observation is real).