Technology Overview
Analysts for business and government are faced with millions of bits of data to process for insights. Visual analytics tools help these analysts dig into large data collections, such as volumes of reports, social media posts, and news articles. But users must have a deep understanding of the tools and explicitly work to control them to be successful. In other words, they must know what and how to query to find their needle in this haystack.
Researchers at Pacific Northwest National Laboratory are experts in data visualization. Tools such as IN-SPIRETM and the copyrighted Scalable Reasoning System allow analysts to begin to understand their data. Now, PNNL researchers have gone a step further by developing a system that infers user reasoning by a user’s interaction with the tool and provides the appropriate information to query, process, and visualize information more effectively.
TexTonic is a visual analytic system for interactive exploration of large text datasets in a single, multi-scale spatial layout. The system visualizes data at multiple levels of aggregation (terms, phrases, snippets, and full documents) in a spatial layout like a map, where the distance between terms represents the relative similarity between terms. Users can interactively explore the data by directly manipulating information on the map. For example, users can drag and move two terms closer together to increase their relative similarity (and underlying value in the text model). They can also arrange the layout and enlarge or shrink terms to give them greater or lesser weight in the analysis. The system then infers the user’s analytical reasoning and steers the underlying data model and representative visual model. For example, the model might reduce the number of dimensions, change the weighting of information, and retrieve different information, based on user interactions.
TexTonic has been field-tested. For example, a set of users employed TexTonic to analyze all of Wikipedia (in English, more than four million documents). One of the results was a Wikipedia world map that represents the volumes as geographic terrain, where peaks and valleys represent the amount of information about a topic. TexTonic’s ingest pipeline processed all of Wikipedia’s 4 million documents on single workstation with 48 Gb of RAM and 24 cores (6, 4-core xeon processors) in 13 hours. After processing, the user can interactively explore in real-time the entire dataset using TexTonic’s visualization interface.
The approach engages the perceptual and cognitive processes of the user to detect patterns, relationships, and other informal insights about the information. Users can also test hypotheses and assertions. In addition, TexTonic scales to extremely large data sets that can stymy other visualization systems.
APPLICABILITY
TexTonic can be used in any analytical setting where voluminous text data must be evaluated. Examples include epidemic tracking, law enforcement and intelligence monitoring, scientific research, and new technology development.
Advantages
- Learns from the user to teach the user how best to query and analyze voluminous data
- Can handle millions of documents, with relatively short ingest times (for example, 4 million documents in 13 hours)
- Leads to unique insights, increasing analyst efficiency and decreasing time to results