To define the network structure of a cell, network biology researchers use proteomic, genomic, and metabonomic data as well as computational capabilities to identify and map cell signaling networks. Data generated by scientists at PNNL and Metacore software were used to generate a representative signaling network stimulated by epidermal growth factor (EGF) in human mammary epithelial cells (HMEC). View larger image.
In the era of post-genomic biology, scientists have proposed the new and innovative idea that cells can be understood in terms of their network structure (1, 2). To define the network structure of a cell, network biology researchers use data generated by experimental methods, including high-throughput proteomic, genomic, and metabolomic data, as well as computational capabilities to identify and map cell signaling networks. It has been shown that by identifying the most highly connected proteins, the overall network topology can be inferred (3). These interaction networks can be defined in terms of the protein complexes within cells and their relationships to each other.
Viewing cells in terms of their underlying network structure is a very powerful concept. All networks share common characteristics, and mathematical treatments have been developed to understand their structure and how they can be regulated. Thus, organizing biological information in the context of networks is fundamental to applying systems-level approaches to understanding biological function. By defining protein complexes as dynamic structures and quantifying the extent and quality of interactions of the partners, information relating to their function can be inferred.
At Pacific Northwest National Laboratory (PNNL), our multidisciplinary teams of biologists, computational biologists, bioinformaticists, proteomic experts, and genomic experts collaborate to conduct network biology research. PNNL’s integrated research environment and our unique combination of in-house laboratory facilities and computational capabilities lend strength to our network biology efforts. All aspects of a network biology study ranging from in vitro experiments to computational modeling can be performed in one laboratory setting. PNNL’s network biology research focuses on four key components: data collection, data analysis, data storage, and data sharing.
Microarray has proven to be a very effective and reliable method for studying the transcriptome of cells. For defining the proteome and protein interactions, several methods have been developed. Proteome studies commonly use mass spectrometry. No single method can identify all of the protein interactions in a cell, but work in our laboratories as well as by other groups has shown that affinity-based isolation of complexes followed by identification by mass spectrometry is the best high-throughput approach. Affinity-isolation approaches are usually implemented by genetically modifying a targeted protein (referred to as bait) with an epitope tag or by using a specific affinity reagent (e.g., antibodies) to isolate the target protein, together with other associated proteins. The isolated proteins can then be identified by analytical techniques, most commonly mass spectrometry.
Data analysis for high-throughput data sets is challenging. For example, the first hurdle in analyzing mass spectrometry data on affinity-isolated proteins is to determine which proteins are really interacting and which are present as nonspecific components. This problem is complicated by differences in the technical details of individual isolation procedures and by the fact that all peptides are not equally detected by mass spectrometry. Rigorous statistical methods are necessary to derive meaningful analyses from raw, high-throughput data sets. When disparate sets of data, such as microarray and proteomics information obtained from the same experiment, must be integrated and analyzed, the bar is raised even higher. One very powerful approach to validate a protein interaction is to compare results with existing data from other organisms (4). Therefore, it is desirable for computational analysis techniques to allow for the cross-referencing of other available databases. This approach will increase in importance as the amount of data on different microbial and mammalian systems expands. If cell-wide protein complex interaction maps are to be as useful to biology as DNA sequences, a combination of data from many organisms will be needed to interpret the functional significance of network topologies.
Data Storage and Data Sharing
Biological data should be stored in a form accessible to the biological research community, whether this research community is defined as small group of researchers in the same laboratory, people at collaborating institutions, or any interested party with access to the web. These stored databases should also be able to integrate data from other sources, such as publicly available websites with genomic, proteomic, and metabolomic information. To be able to derive knowledge from high-throughput data sets, especially disparate data sets, the data must be stored in a common framework, there must be ways to visualize this data to aid data interpretation, and there should be methods to query the data.
PNNL’s Network Biology Research Efforts to Collect, Analyze, Store, and Share Biological Data Are Supported by a Variety of Projects
Tools and methods for data collection, analysis, storage, and sharing are being integrated at PNNL to help researchers understand cells in terms of their network structure. View larger image.
Center for Molecular and Cellular Systems - A central challenge posed by the Department of Energy’s (DOE’s) Genomics:GTL program is to understand how the information held in a microbe’s DNA sequence gives rise to the myriad molecular processes that allow the organism to function. To address this challenge, researchers at PNNL are developing and implementing high-throughput approaches for mapping protein interactions in microbes under the Genomics:GTL Center for Molecular and Cellular Systems (CMCS).
Bioinformatics Resource Manager - Researchers at PNNL are developing tools to automate molecular profiling data analysis. These computational tools are designed for automated data gathering, interfacing between data storage and analysis programs, and linking data between applications.
Data Integration and Pattern Recognition – New bioinformatic approaches for integrating and analyzing high-throughput data, such as microarray and proteomics datasets, are being developed at PNNL. Also, novel visualization tools are being developed to facilitate comprehension of extremely large data sets.
Crosstalk Among Receptor Signaling Pathways - Partnering with ongoing experimental efforts at PNNL, our research teams are developing a signaling pathway for the insulin-like growth factor-1 receptor (IGF-1R) and connecting it to our epidermal growth factor receptor (EGFR) model to form a unified signaling network model in which different receptor signaling pathways can transmodulate each other’s properties. We are also building a network model for the tumor necrosis factor receptor (TNFR), and later we will develop the signaling network model for the G-protein coupled receptor (GPCR) system.
Complex Queries – Researchers at PNNL are building a system to support data acquisition, metadata tracking, data storage, data retrieval, and analysis capabilities in a structured framework. As part of this effort, three tools are being integrated: Complex Queries (CQ, a database to store biological information that can be asked complex questions and receive meaningful answers), Integrated Database for Experiment and Analysis (IDEA, a centralized framework for managing projects, defining and designing experiments, cataloging resources, tracking samples sent for different analytical techniques, and storing results), and the Computational Cell Environment (CCE, a problem-solving environment that provides user-friendly access to an extensible set of data sources).
Integrated Data Structures for Mapping Cellular Networks – Scientists at PNNL are building the most comprehensive, multifaceted, RNA and protein database for a human cell line to date. A method is being developed to store, organize, and manage the large and divergent data in this database, while providing the links to the bioinformatic and computational tools needed to interrogate the data and integrate results across multiple experiments and experimental approaches.
Collective Analysis of Biological Interaction Networks - The Collective Analysis of Biological Interaction Networks (CABIN) is a plugin to Cytoscape, which is an open source network visualization and analysis tool. CABIN promotes analytical reasoning for integrating evidence of interaction data from multiple sources by the use of interactive visual interfaces. Its functionalities maximize human perception and understanding of uncertain and complex data, facilitating high-quality human judgment with limited investment of the user's time.
Software Environment for BIological Network Inference (SEBINI) - PNNL's SEBINI project team has created a software platform for the inference of (1) genetic regulatory networks from high-throughput microarray, messenger RNA (mRNA) expression data; (2) protein regulatory networks from high-throughput, protein abundance data; and (3) protein signaling networks from protein activation state data. The algorithms within SEBINI use correlations in gene expression, protein abundance, and protein activation state to infer direct regulatory connections between genes or proteins. With these tools, scientists are able to rapidly reconstruct biological regulatory networks with greater ease and accuracy.
SVM-HUSTLE - As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular parts list. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. Scientist at Pacific Northwest National Laboratory (PNNL) have developed a Support Vector Machine (SVM)-based tool to detect Homology Using Semi-supervised iTerative LEarning (SVM-HUSTLE) that identifies significantly more remote homologs than current state-of-the-art sequence or cluster-based methods.
The following project is supported by the National Institutes of Health.
Center for Genomic Experimentation and Computation – A collaborative project team led by Roger Brent at the Molecular Sciences Institute is combining functional genomic and computational research to model a prototype signal transduction pathway.
(1) Ideker T. 2004. “A Systems Approach to Discovering Signaling and Regulatory Pathways – or, How to Digest Large Interaction Networks Into Relevant Pieces.” Advances in Experimental Medicine and Biology. 547:21-30.
(2) Xia Y, H Yu, R Jansen, M Seringhaus, S Baxter, D Greenbaum, H Zhao, and M Gerstein. 2004. “Analyzing Cellular Biochemistry in Terms of Molecular Networks.” Annual Review of Biochemistry. 73:1051-1087.
(3) Lappe M, and L Holm. 2004. “Unraveling Protein Interaction Networks with Near-Optimal Efficiency.” Nature Biotechnology. 22:98-103.
(4) Kelley BP, R Sharan, RM Karp, T Sittler, DE Root, BR Stockwell, and T Ideker. 2003, “Conserved Pathways Within Bacteria and Yeast as Revealed by Global Protein Network Alignment.” Proceedings of the National Academy of Sciences of the United States of America. 100:11394-11399.