# Graph Analytics

## Introduction to Graph Analytics

Graph analytics is the evaluation of information that has been organized as objects and their connections. The purpose of graph analytics is to understand how the objects relate or could relate. The objects are commonly referred to as *nodes* because they are points at which connections intersect. The collection of nodes and their connections is called a *graph*. (Graph analytics is also known as network analysis.)

Your relationships with your family and coworkers—and any relationships between them independent of you—could be represented as a graph where each person is a node and the connections between each node could be categorized as personal or professional. Applying graph analytics to this collection of relationships could show the average number of relationships for an individual or how many "degrees of separation" connect two individuals.

A graph captures the *strength* of the relationship between nodes (such as how often you speak with your family or coworkers) and the *direction* of the relationship (are you always the one who starts text conversations with your best friend?). Not every node in a graph has to be of the same type; for example, a talent-related graph could include companies, people, and work skills all as nodes.

Graph analytics differs from numeric analysis by focusing on the relationships between nodes. An example of numeric analysis is calculating the average of a list of high temperatures. Graph analytics provides a way to organize and store the types of information where the value comes from the relationships. Researchers are interested in graph analytics because it allows them to determine how important a single node is to the whole group, detect communities within the group, and determine the shortest path between two nodes, among other characteristics.

Understanding graphs and how to evaluate them allows you to investigate relationships in topics as varied as internet searching, shipping optimization, and neuroscience. How does a credit card company detect fraudulent charges? They can analyze the relationship between people and purchases. What is the best route for a ride-hailing driver to take to transport multiple riders? Route-determining software can use the relationship between locations. How do your entertainment services decide what to recommend to you? They analyze the relationship between media you have enjoyed and all media available through the service.

## History of Graph Analytics

The first discussion recorded in Europe that used analysis of a graph was Leonhard Euler's paper on the Seven Bridges of Königsberg, published in 1736. Euler posed the question of how to travel between the islands and mainland of Königsberg via its seven bridges using only each bridge once. This problem can be represented as a graph whose nodes stand in for the four pieces of land and the connections for the pathways between them.

While graph theory and topology developed from Euler's paper in the field of mathematics, most of the applications of graphs were in social sciences during the twentieth century. One notable example was psychologist Jacob Moreno, who used a type of graph he called a sociogram to represent social relations between school children, news of which garnered this 1933 headline from *The New York Times*: "Emotions Mapped by New Geography."

At the turn of the twenty-first century, both computational power and computer accessibility had sufficiently increased to allow biologists and physicists to apply graph analytics to the pressing big-data problems of their fields, such as molecular networks within a cell and the structure of the internet. The analysis of graphs spread from more academic questions into areas of inquiry such as social media in the twenty-first century.

## Importance and Applications of Graph Analytics

Graph analytics is important because it is the most widely used method for evaluating information whose value derives from the relationships between nodes. Because graphs are a useful way of storing a wide variety of information—from your media preferences to data from the sensors that comprise an electrical grid—the ability to analyze graph data allows researchers to investigate questions in many fields.

Graph analytics is best suited to evaluate objects whose importance is in their relationships. More and more types of information are being thought of as nodes with relationships, including in industries like health care and fossil fuel distribution.

One use of graph analytics is to detect fraudulent credit card activity. In an example by Todd Blaschka and Gaurav Deshpande, the nodes in the graph correspond to people, banks, and the devices through which a person could initiate a payment or other fund transfer, such as cell phones or email accounts. The graph in this example shows which people relate to which phone numbers, which email accounts, and which banks, as well as other people through payment activity. By using graph analytics, the fraud detection software can examine a more extensive amount of the connections between people, phone numbers, and payments than numerical analysis would allow. In this example, a payment sent from one person to another seems legitimate—until the fraud detection software evaluates more of the nodes connected through *any* relationship to the people involved in the transaction and finds that the recipient has at one time been associated with a phone number that was previously used for fraud.

In the previous example, it is the existence of a relationship between nodes that allows the fraud detection software to determine that some part of a financial transaction is connected to a bad actor. But the relationships between nodes can tell more than simply that a relationship exists. In many graphs, the connections between nodes are assigned values and directions. When two internet pages link to each other, the connection between them can be stored as bi-directional. For two pages where only one links to the other, the connection would be one-way. The most common page-ranking algorithm, Google's PageRank, assigns values to the connections between nodes (in addition to direction). These values are calculated based on the number of connections a web page has as well as the importance of the web pages that link to it. Once these connection values exist, common graph analytics algorithms such as clustering and shortest-path calculations can be used to derive information from the graph.

## Graph Analytics: Benefits, Strengths, and Challenges

Because graphs are the ideal way to represent information whose importance derives from its relationships, especially large datasets, the main benefit of graph analytics is characterization, evaluation, and prediction concerning the relationships represented by the graph.

One strength of graph analytics is the ease with which information can be added to a graph. Many of the most commonly used systems for working with connected data, such as relational database management systems, require a complete understanding of the data and its relationships *before* storage and investigation can occur. But adding new nodes and connections to a graph doesn't invalidate existing data processing or relationships. This means that you don't need to understand everything about your data to begin storing and investigating. This also means that maintenance of a graph database involves less risk because there is no need to modify data models as the dataset grows. Another strength of graph analytics is its use for predictive or discovery types of analysis. An example of this type of analysis is media recommendation engines, which examine both the existing relations between people and media as well as similarities between people based on their media choices to generate recommendations.

Graph analytics faces many of the same challenges as other connected data systems, such as computer processing time for querying the data. However, the characteristics of graphs themselves can also create longer query times or require more hardware because of the complexity of the type of graph and the randomness of the graph. (The randomness of a graph results from there being fewer constraints on a graph than other connected data systems when adding new data.)

## The Limitations of Graph Analytics

Although graphs are the ideal way to store information whose value derives from its relationships, the storage and analysis of graphs has limitations based on the software and hardware architecture chosen to implement it.

Information can be modeled as a graph and stored in *graph database* software or it can be modeled using existing relational database techniques and manipulated in tabular form. One limitation to using a graph database is that relational databases are currently more widely in use and understood by computer professionals. Additionally, it may take more time to retrieve information from a graph database if you're also using it to store connected data that relates through more conventional relational database methodologies.

Another limitation of a graph database is the amount of time required to traverse through the graph to respond to a query. The time required depends on what fraction of the graph the query looks at in order to return its results. Recall the earlier discussion of credit card fraud detection. In that example, fraud was detected because representing the data as a graph made it simpler for the software to examine a larger set of nodes. However, the query to examine this larger set of nodes would take longer to return results than a query that restricted the amount of the graph it examined. As another example, a query asking for all the people connected to you through multiple layers of acquaintance would take longer than results for a query asking for all the people connected directly to you.

## New Developments in Graph Analytics at Pacific Northwest National Laboratory

Many projects at Pacific Northwest National Laboratory (PNNL) leverage graph analytics in areas as varied as transportation, molecular biology, and energy sensor networks. Addressing an issue common to many graph analytics projects—that of long computation time—is ExaGraph, a software effort collaborating with other national laboratories. ExaGraph identifies the difficult mathematical problems at the heart of graph analytics and works to decrease their processing time. ExaGraph speeds up basic algorithms, such as graph partitioning and graph clustering (random walk), so its benefits can be applied to any project.

Ongoing efforts (2021) at PNNL touch on many different aspects of graph analytics. PNNL continues to expand its application of graph analytics to challenging problems in computational chemistry, bioinformatics, high-energy physics, and power engineering. PNNL also continues existing work developing graph algorithms specifically for new graphical processing units and other hardware accelerators because these hardware become available to researchers, continuing to decrease the processing speed required to analyze graph data. On the more theoretical end, the National Security Directorate will continue to apply hypergraphs—an abstraction of graphs in which the relationships between nodes can simultaneously connect multiple nodes—to national security concerns. Several groups within the Advanced Computing, Mathematics and Data Division continue their efforts on a scalable framework for high-level shared-memory programming that incorporates data structures created specifically for high-data-rate processing.

In addition to fine-tuning and broadening the application of graph analytics in fields where it is already used, PNNL plans to emphasize research in the emerging areas of graph representation learning and geometric deep learning. Both of these learning types are categories of machine learning: graph representation algorithms determine the characteristic features of a dataset and geometric deep learning algorithms leverage artificial neural networks to build complex relationships from simple ones in a layered approach. Early successes working with these machine learning algorithms include an investigation into the energies of different water molecule clusters.

## Graph Analytics in Use at Pacific Northwest National Laboratory

PNNL performs basic science research with the aim of innovating sustainable energy technology. This includes collaborating with academia and industry to share the benefits of their research results as widely as possible.

Computer scientists at PNNL have developed Ripples, a software tool that combines social network analysis methods with parallel computing to produce graph analytics results nearly in real time. At the heart of the software tool is the issue of influence maximization, which addresses the questions of which nodes in a graph have the largest ability to affect the remaining nodes and how the connectivity of the graph as a whole allows influence to spread. Ripples has been applied to areas including air traffic control disruptions and tracking infectious diseases, and showcases PNNL’s collaborative efforts: mitigating air traffic control disruptions involved team members from Northeastern University and the Department of Transportation, while the development of Ripples' core algorithms included team members from Washington State University at Pullman.

Teams at PNNL are also applying graph analytics to assess car travel times and improve traffic congestion using the Uber movement dataset. One team analyzed the time required to travel by ride-sharing between traffic analysis zones in Los Angeles. By organizing the travel data from the ride-sharing company Uber as a graph, the team was able to apply graph analytics techniques to calculate patterns in travel times. Because the Uber movement dataset includes only some of the traffic zones, the team also used graph analytics to predict travel times between the complete set of traffic zones in Los Angeles. A second team created analysis software that leverages graph analytics to alleviate traffic congestion such as bottlenecks and chokepoints. This software, called TranSEC, utilizes high-performance computing resources, allowing traffic engineers to see results in minutes rather than hours. Unlike other congestion modeling software, TranSEC begins with partial information and uses machine learning techniques to fill in gaps.