As most genome annotation pipelines consist of automated gene finding, they lack experimental validation of primary structure, having to rely on DNA centric sources of data. Through the analysis of proteomics mass spectrometry data, our protocol is able to improve the existing annotations by discovering novel genes, post-translational modifications (PTMs) and correcting the erroneous primary sequence annotations. PGP pipeline is designed to run in a wide range of parallel Linux computing environments in order to address the high computational cost of proteomics data processing. It has been already used to improve the annotation of 46 genomes across the prokaryotic tree of life. Availability and Implementation: Source code is freely available from https://bitbucket.org/andreyto/proteogenomics under GPL license. It is implemented in Python and C++. It bundles the Makeflow engine to execute the workflows.
Revised: May 22, 2014 |
Published: January 27, 2014
Citation
Tovchigrechko A., P. Venepally, and S.H. Payne. 2014.PGP: Parallel Prokaryotic Proteogenomics Pipeline for MPI clusters, high-througput batch clusters and multicore workstations.Bioinformatics 30, no. 10:1469-1470. PNWD-SA-10240. doi:10.1093/bioinformatics/btu051