Skip to main content

PNNL

  • About
  • News & Media
  • Careers
  • Events
  • Research
    • Scientific Discovery
      • Biology
        • Chemical Biology
        • Computational Biology
        • Ecosystem Science
        • Human Health
          • Cancer Biology
          • Exposure Science & Pathogen Biology
        • Integrative Omics
          • Advanced Metabolomics
          • Chemical Biology
          • Mass Spectrometry-Based Measurement Technologies
          • Spatial and Single-Cell Proteomics
          • Structural Biology
        • Microbiome Science
          • Biofuels & Bioproducts
          • Human Microbiome
          • Soil Microbiome
          • Synthetic Biology
        • Predictive Phenomics
      • Chemistry
        • Computational Chemistry
        • Chemical Separations
        • Chemical Physics
        • Catalysis
      • Earth & Coastal Sciences
        • Global Change
        • Atmospheric Science
          • Atmospheric Aerosols
          • Human-Earth System Interactions
          • Modeling Earth Systems
        • Coastal Science
        • Ecosystem Science
        • Subsurface Science
        • Terrestrial Aquatics
      • Materials Sciences
        • Materials in Extreme Environments
        • Precision Materials by Design
        • Science of Interfaces
        • Smart Advanced Manufacturing
          • Cold Spray
          • Friction Stir Welding & Processing
          • ShAPE
      • Nuclear & Particle Physics
        • Dark Matter
        • Fusion Energy Science
        • Neutrino Physics
      • Quantum Information Sciences
    • Energy Resiliency
      • Electric Grid Modernization
        • Emergency Response
        • Grid Analytics
          • AGM Program
          • Tools and Capabilities
        • Grid Architecture
        • Grid Cybersecurity
        • Grid Energy Storage
        • Transmission
        • Distribution
      • Energy Efficiency
        • Appliance and Equipment Standards
        • Building Energy Codes
        • Building Technologies
          • Advanced Building Controls
          • Advanced Lighting
          • Building-Grid Integration
        • Commercial Buildings
        • Federal Buildings
          • Federal Performance Optimization
          • Resilience and Security
        • Residential Buildings
          • Building America Solution Center
          • Energy Efficient Technology Integration
          • Home Energy Score
        • Energy Efficient Technology Integration
      • Energy Storage
        • Electrochemical Energy Storage
        • Flexible Loads and Generation
        • Grid Integration, Controls, and Architecture
        • Regulation, Policy, and Valuation
        • Science Supporting Energy Storage
        • Chemical Energy Storage
      • Environmental Management
        • Waste Processing
        • Radiation Measurement
        • Environmental Remediation
      • Fossil Energy
        • Subsurface Energy Systems
        • Advanced Hydrocarbon Conversion
      • Nuclear Energy
        • Fuel Cycle Research
        • Advanced Reactors
        • Reactor Operations
        • Reactor Licensing
      • Renewable Energy
        • Solar Energy
        • Wind Energy
          • Wind Resource Characterization
          • Wildlife and Wind
          • Community Values and Ocean Co-Use
          • Wind Systems Integration
          • Wind Data Management
          • Distributed Wind
        • Marine Energy
          • Environmental Monitoring for Marine Energy
          • Marine Biofouling and Corrosion
          • Marine Energy Resource Characterization
          • Testing for Marine Energy
          • The Blue Economy
        • Hydropower
          • Environmental Performance of Hydropower
          • Hydropower Cybersecurity and Digitalization
          • Hydropower and the Electric Grid
          • Materials Science for Hydropower
          • Pumped Storage Hydropower
          • Water + Hydropower Planning
        • Grid Integration of Renewable Energy
        • Geothermal Energy
      • Transportation
        • Bioenergy Technologies
          • Algal Biofuels
          • Aviation Biofuels
          • Waste-to-Energy and Products
        • Hydrogen & Fuel Cells
        • Vehicle Technologies
          • Emission Control
          • Energy-Efficient Mobility Systems
          • Lightweight Materials
          • Vehicle Electrification
          • Vehicle Grid Integration
    • National Security
      • Chemical & Biothreat Signatures
        • Contraband Detection
        • Pathogen Science & Detection
        • Explosives Detection
        • Threat-Agnostic Biodefense
      • Cybersecurity
        • Discovery and Insight
        • Proactive Defense
        • Trusted Systems
      • Nuclear Material Science
      • Nuclear Nonproliferation
        • Radiological & Nuclear Detection
        • Nuclear Forensics
        • Ultra-Sensitive Nuclear Measurements
        • Nuclear Explosion Monitoring
        • Global Nuclear & Radiological Security
      • Stakeholder Engagement
        • Disaster Recovery
        • Global Collaborations
        • Legislative and Regulatory Analysis
        • Technical Training
      • Systems Integration & Deployment
        • Additive Manufacturing
        • Deployed Technologies
        • Rapid Prototyping
        • Systems Engineering
      • Threat Analysis
        • Advanced Wireless Security
          • 5G Security
          • RF Signal Detection & Exploitation
        • Internet of Things
        • Maritime Security
        • Millimeter Wave
        • Mission Risk and Resilience
    • Data Science & Computing
      • Artificial Intelligence
      • Graph and Data Analytics
      • Software Engineering
      • Computational Mathematics & Statistics
      • Future Computing Technologies
        • Adaptive Autonomous Systems
    • Lab Objectives
    • Publications & Reports
    • Featured Research
  • People
    • Inventors
    • Lab Leadership
    • Lab Fellows
    • Staff Accomplishments
  • Partner with PNNL
    • Education
      • Undergraduate Students
      • Graduate Students
      • Post-graduate Students
      • University Faculty
      • University Partnerships
      • K-12 Educators and Students
      • STEM Education
        • STEM Workforce Development
        • STEM Outreach
        • Meet the Team
      • Internships
    • Community
      • Regional Impact
      • Philanthropy
      • Volunteering
    • Industry
      • Why Partner with PNNL
      • Explore Types of Engagement
      • How to Partner with Us
      • Available Technologies
  • Facilities & Centers
    • All Facilities
      • Atmospheric Radiation Measurement User Facility
      • Electricity Infrastructure Operations Center
      • Energy Sciences Center
      • Environmental Molecular Sciences Laboratory
      • Grid Storage Launchpad
      • Institute for Integrated Catalysis
      • Interdiction Technology and Integration Laboratory
      • PNNL Portland Research Center
      • PNNL Seattle Research Center
      • PNNL-Sequim (Marine and Coastal Research)
      • Radiochemical Processing Laboratory
      • Shallow Underground Laboratory

IN-SPIRE™ Visual Document Analysis

  • FAQs
  • Get a Copy
  • Training and Support

Breadcrumb

  1. Home
  2. Projects
  3. IN-SPIRE™ Visual Document Analysis

Frequently Asked Questions

Click a question below to read its corresponding answer.

1. What is IN-SPIRE™?

2. What does IN-SPIRE do?

3. What types of documents can it process?

4. What do I have to tell it about the format of my documents?

5. How do I get my data into IN-SPIRE?

6. How long does it take to process a set of documents?

7. How does IN-SPIRE work?

8. How do I install the software?

9. Is technical support available?

10. Can IN-SPIRE be integrated with my database?

11. What is Galaxy visualization?

12. What are the blue shaded areas in the Galaxy?

13. What is the ThemeView™ Visualization?

14. What does the ThemeView peak height and color mean?

15. How are the ThemeView peak labels related to the cluster labels?

16. What if some text isn't in English?

What is IN-SPIRE™?

IN-SPIRE provides tools for exploring textual data (including Boolean and “topical” queries), term gisting, and performing time/trend analysis. This suite of tools allows the user to rapidly discover hidden information relationships by reading only pertinent documents. IN-SPIRE has been used to explore technical and patent literature, marketing and business documents, web data, accident and safety reports, newswire feeds, and more. It has applications in many areas, including information analysis, strategic planning, and medical research. 

IN-SPIRE has the following goals: 

  • Quickly create meaningful visualizations of text documents. 
  • Provide effective ways to explore and understand large collections of text without reading every document.

What does IN-SPIRE do?

IN-SPIRE’s strength is in its ability to quickly scan thousands of documents, determine the topical content of those documents, and then present the documents in an interactive visual context. Since it requires almost no advanced knowledge of the information being processed, IN-SPIRE is a great tool for identifying information hidden in documents and understanding its “topical landscape.” IN-SPIRE provides several query and display tools to support deeper analysis and interrogation of the information space.

What types of documents can it process?

IN-SPIRE organizes and visualizes the topical content of multiple types of text files. These files may come from web pages, databases, results from Optical Character Reading processes, message traffic, or other sources. IN-SPIRE supports encoding for ASCII, UTF-8, and UTF-16. It will also ingest most types of PDF, MS-Word, MS-Excel, and RTF files, as well as emails and spreadsheets. IN-SPIRE is capable of ingesting documents formatted in XML or JSON and can read text in various types of web formats, such as HTML and RSS/XML. IN-SPIRE directly retrieves HTML from the web or local file systems that are cleaned for markup.

What do I have to tell IN-SPIRE about the format of my documents?

The only information that IN-SPIRE needs to analyze a collection of documents is the starting point of each document. For example, if a user were to provide IN-SPIRE with 1,000 news articles that were each stored in a file, they would need to identify the files for IN-SPIRE and specify the string of characters listed at the beginning of each document. If the documents contain structured fields, such as titles or dates, the user may identify them so that IN-SPIRE can query them separately from other document content during analysis. 

How do I get my data into IN-SPIRE?

Create a dataset by specifying a data source, such as local files, folders, or a remote web site. If desired, specify additional text processing and formatting parameters. IN-SPIRE’s dataset editor provides a step-by-step walkthrough of the process.

How long does it take to process a set of documents?

Although this is largely dependent upon the speed and capacity of the computer, IN-SPIRE will process a typical dataset of 3,000 documents in under a minute. The software is capable of processing upward of 100,000 one-page documents in minutes on newer desktop computer configurations. Although there are no theoretical limits for IN-SPIRE’s dataset size or number of documents, the practical upper limit for the number of documents IN-SPIRE can process while maintaining responsive interactions with visualizations ranges from 30,000 to 60,000 documents.

How does IN-SPIRE work?

In brief, IN-SPIRE creates mathematical representations of the documents, which are then organized into clusters and visualized into "maps" that can be interrogated for analysis.

More specifically, IN-SPIRE performs the following steps:

  1. The text engine scans through the document collection and automatically determines the distinguishing words or "topics" within the collection, based upon statistical measurements of word distribution, frequency, and co-occurrence with other words. Distinguishing words are those that help describe how each document in the dataset is different from any other document. For example, the word "and" would not be considered a distinguishing word, because it is expected to occur frequently in every document. In a dataset where every document mentions Iraq, "Iraq" wouldn't be a distinguishing word either.
  2. The text engine uses these distinguishing words to create a mathematical signature for each document in the collection. Then it does a rough similarity comparison of all the signatures to create cluster groupings.
  3. IN-SPIRE compares the clusters against each other for similarity, and then arranges them in high-dimensional space (about 200 axes) so that similar clusters are located close together. The clusters can be thought of as a mass of bubbles, but in 200-dimensional space instead of just three.
  4. That high-dimensional arrangement of clusters is then flattened down to a comprehensible two-dimensions—trying to preserve a picture where similar clusters are located close to each other, and dissimilar clusters are located far apart. Finally, the documents are added to the picture by arranging each within the invisible “bubble” of their respective cluster. All of this information is then mapped onto the Galaxy and ThemeView™ visualizations that convey the document and topical relationships of the information.

How do I install the software?

Visit Get a Copy to learn how to download the software.

Please note: Almost all versions of IN-SPIRE are copy-protected and require input of an unlock code before the software will operate. Unlock codes are sent via email and are based on information obtained from the activation program installed with IN-SPIRE.

Is technical support available?

Video tutorials are available here. Most users will benefit from a short training session that covers the key aspects of using the tool. Training sessions usually consist of a 4–6-hour, hands-on class that cover the general capabilities of the system along with tips and techniques for data import and analysis. Classes are usually held at the user’s site.

In some cases, an organization may have greater support needs, such as datasets that require some level of preprocessing. Pacific Northwest National Laboratory can assist in these cases as well, on a time and materials basis. Contact us for more information.

Can IN-SPIRE be integrated with my database?

Some installations of IN-SPIRE process information exclusively from a database interface. IN-SPIRE can be configured to interface with most database systems that support http:// or https:// protocols. Installation of a database interface involves some level of software customization.

What is Galaxy visualization?

In the Galaxy visualization, individual documents are represented as gray dots. With this visualization, the goal is to give the user a view of the dataset where closely related documents are generally located close to one another and dissimilar documents are far apart. It is not a perfect representation of the document relationships due to the squeezing that occurs in reducing high-dimensional space down to 2D space, but it gives a good starting point and general overview to work with. 

What are the blue shaded areas in the Galaxy?

The shaded areas on the Galaxy are "ThemeClouds" which are analogous to ThemeView Peaks. ThemeClouds provide a 2D representation of theme strength. Areas with higher thematic content and/or document density are more intensely colored in blue. Areas with less document density and thematic content are more lightly colored.

What is the ThemeView visualization?

The ThemeView visualization is the fastest way to get an overview of your document collection. It translates the Galaxy into a 3D “landscape” of the information space.

Think of the Galaxy as the “flat” sea-level foundation for a ThemeView. Each document that has content related to a major theme in the overall document collection will add height to the peak in that location (how much it adds will depend on the strength of that theme's relevance to that document). If a document is not related to that theme, it won't add any height to the layer. Repeating this layer-building process for all 200 or so major themes (i.e., topics) in the dataset, stacking the layers on top of each other and smoothing the results, creates the thematic summary view—ThemeView.

What does the ThemeView peak height and color mean?

The labels flagging the peaks reveal what the strongest themes are under those peaks. Areas of documents with very similar thematic content contain tall peaks, while areas of documents with weaker thematic relationships never rise above sea level. The coloring of a ThemeView allows the user to know how far above sea level a region is—yellow being the highest. If the documents in a region are practically void of any thematic content, they are represented at sea level height on the ThemeView. If there are only one or two documents in a region that are unusually packed full of topical content, they are represented as tall peaks on the ThemeView.

How are the ThemeView peak labels related to the cluster labels?

The ThemeView landscape is created by piling up the topicality of individual documents, so users will generally see higher peaks in areas of high document density. The number, placement, and height of peaks are an indirect correlation to the cluster. However, since they are based strictly on the Galaxy documents underneath–not the cluster groupings–an area under the peak may, and often does, include documents from multiple clusters.

In addition, the words used to label the cluster centroids are terms with the highest frequency count, whereas the ThemeView labels are words with the highest topical content in the region. These factors help explain why the ThemeView peak labels often differ from cluster centroid labels.

What if some text isn't in English?

IN-SPIRE visualizations are language-independent, although the use of system or custom stop words is recommended for optimal visualizations. For some languages, such as Chinese, preprocessing with a segmentation tool may be necessary. If the data contain text in multiple languages, the documents from one language may use very different terms than documents from another and visualizations will naturally show this division.

IN-SPIRE does support some third-party language detection and machine-translation software. If a user is working with documents in a language they cannot read, they can translate document titles and text on demand or translate queries from their native language into the language used in the documents.

PNNL

  • Get in Touch
    • Contact
    • Careers
    • Doing Business
    • Environmental Reports
    • Security & Privacy
    • Vulnerability Disclosure Policy
    • Notice to Applicants
  • Research
    • Scientific Discovery
    • Energy Resiliency
    • National Security
Subscribe to PNNL News
Department of Energy Logo Battelle Logo
Pacific Northwest National Laboratory (PNNL) is managed and operated by Battelle for the Department of Energy
  • YouTube
  • Facebook
  • X (formerly Twitter)
  • Instagram
  • LinkedIn