Systems and Process for Identifying Features and Determining Feature Associations in Groups of Documents

Battelle Number: 30187 | N/A

Technology Overview

One issue in information analysis is the scarcity of time and/or resources to review large volumes of information, which is often impossible using traditional approaches, such as lists, tables, and simple graphs. Many traditional text analysis techniques focus on selecting features that distinguish documents within a document group. However, these techniques may fail to select features that characterize or describe the majority or a minor subset of documents within a group. Also, when the information is streaming and/or updated over time, the group is dynamic and can change significantly. Most current tools are limited in that they only allow information consumers to interact with snapshots of an information space that is often continually changing. Tools are needed to help automatically identify and/or understand the themes, topics, and/or trends within these large volumes of information.

To meet this need, PNNL scientists have developed a method for selecting features and measuring association between arbitrary pairs of features based on their suitability as predictors for themselves and for each other, respectively. The computation of predictive ability leverages a) automatic feature extraction algorithms, such as the Rapid Automatic Keyword Extraction (RAKE) algorithm, which identifies expressed features within individual objects; and b) search functions that identify all objects in a collection in which an arbitrary feature occurs. The PNNL-developed method describes how the feature-object information generated by feature extraction and search functions can be combined to measure the predictive ability of features for themselves, and for each other, thereby improving analytic capabilities that rely on insight to features and feature associations within object collections.

Advantages

Improves analytic capabilities in features and associations within information collections
Provides ability to identify topics and/or trends within large volumes of information

Availability

Available for licensing in all fields

Keywords

Information Analysis, Keyword Extraction, Text Analysis Techniques, RAKE

IP files

Patent #: 9,235,563

Portfolio

DS-Visualization

Market Sectors

Data Sciences