PNNL-WSU Data Day

Meeting/Workshop

PNNL-WSU Data Day

PNNL-WSU Data Day is a celebration of collaboration around data!

Image by Chris DeGraaf | Pacific Northwest National Laboratory

November 10, 8:30 am - 5:00 pm

Discovery Hall
Pacific Northwest National Laboratory
650 Horn Rapids Road
Richland, WA 99354

Researchers from PNNL and WSU will be featured throughout the day with oral presentations, flash talks, and a poster session. The goal of the event is to learn about the work each institution is doing in the exploration and analysis of data and form new connections and collaborations.

Abstracts

Session 1.1

Keynote

Deep Reinforcement Learning for Cyber System Defense under Dynamic Adversarial Uncertainties

Sam Chatterjee (PNNL)

Development of autonomous cyber system defense strategies and action recommendations in the real-world is challenging and includes characterizing system state uncertainties and attack-defense dynamics. In this talk, we present a data-driven deep reinforcement learning (DRL) framework to learn proactive, context-aware, defense countermeasures that dynamically adapt to evolving adversarial behaviors while minimizing loss of cyber system operations. A dynamic defense optimization problem is formulated with multiple protective postures against different types of adversaries with varying levels of skill and persistence. A custom simulation environment was developed, and experiments were devised to systematically evaluate the performance of four model-free DRL algorithms against realistic, multi-stage attack sequences. Our results suggest the efficacy of DRL algorithms for proactive cyber defense under multi-stage attack profiles and system uncertainties.

Dynamic Parametric Modelling of Large-scale Data via Change Points

Abhishek Kaul (WSU)

This talk shall serve as an expository note on versatility of change point models as a modelling approach to non-stationarity and/or non-linearity of data generating processes, with an emphasis on the case of high dimensional/diverging number of parameters. Recent modelling frameworks including first order (mean shift), second order (covariance shift) and evolving network type model shall be discussed along with some applications. General purpose algorithms utilized towards implementation of these models, their available statistical properties, as well as currently open research directions shall be highlighted.

Data Science and Modern Power Systems

Jason Fuller (PNNL)

What’s the role of data science in a modernized, decarbonized power grid? Since the advent of our current power delivery system, data has been a key to success. Sensor networks have been used to monitor and control the operational system, ensuring reliability and efficiency. Historical data has been used for planning new investments, forecasting system behaviors, and predicting potential threats. Developments in advanced sensor networks, new modeling and simulation techniques, and availability and applicability of external datasets has led to a rapid adoption of data science techniques within the power system domain as these tools become more mainstream. This talk will discuss some of the ways PNNL has been applying data science tools and methods to real-world power system challenges.

Session 1.2

Data-Driven Challenges in Multi-Omics Applications and Integration

Lisa Bramer (PNNL)

Multi-omic experiments are at the forefront of understanding complex biological samples holistically, offering detailed pictures of the organism in question through the measurement of multiple types of biomolecules. The generation of multiple datasets at different scales has presented the need for the development of integration approaches that are able to capture the complex, often non-linear, interactions that define these biological systems and are adapted to the challenges of combining the heterogenous data across ‘omic views. However, multiple analytic platforms are available, each with unique capabilities and differences in run-time variability, and therefore varying data analysis requirements. Further, a major challenge to multi-omic integration is missing data because all biomolecules are not measured in all samples due to factors such as instrument sensitivity, differences in physiochemical properties of biomolecules or other experimental factors. Recent methodological developments in artificial intelligence and statistical learning have greatly facilitated the analyses of multi-omics data, however many of these techniques assume access to completely observed data and sample sizes that are larger than in typical biological experiments. Here, we discuss potential avenues for further developments of artificial intelligence and machine learning integration methods, as well as available datasets for these investigations.

Estimating Signal Proportion by Integral Equations

Xiongzhi Chen (WSU)

The “signal proportion” is a very important quantity in statistical modelling and inference based on the two-component mixture model and its extensions, and in control and estimation of the false discovery rate and false non-discovery rate. Most existing estimators of this proportion threshold p-values, deconvolve the mixture model under constraints on its components, or depend heavily on the location-shift property of distributions. Hence, they usually are not consistent, applicable to non-location-shift distributions, or applicable to discrete statistics or p-values. To eliminate these shortcomings, we construct consistent estimators of the proportion as solutions to Lebesgue-Stieltjes integral equations. In particular, we provide such estimators respectively for random variables whose distributions have Riemann-Lebesgue type characteristic functions, form discrete natural exponential families with infinite supports, or form natural exponential families with separable moment sequences.

Session 2.1

Keynote

The Data Program at WSU: A Synopsis of the Current State and Plans for the Future

Jan Dasgupta and Jonathan Male (WSU)

In this talk we will talk about the current Data Analytics at WSU. What we have learned six years into our BS degree and what our plan is for the Graduate Degree. We will focus on the current state of the field in terms of why it is crucial for all scientists and researchers to have a basic idea of data science. We will delve into topics around algorithmic bias and data ethics and what we are trying to do to train responsible data scientists. Why we need to think beyond the coding and the math and look at the questions asked.

Improving Data Representations: Applications in Molecular Property Prediction and Social Media Analysis

Emily Saldanha (PNNL)

The use of structured and unstructured data for downstream machine learning applications requires careful choices about how to best represent the data to encode relevant information for the task. In this talk, I will discuss representation learning approaches for two application areas. The first is related to the representation of molecular structures to support molecular property prediction. I will discuss the effect of representation choices on model accuracy and present analysis of what chemical knowledge is encoded into learned representations of molecular structures. In the second application, I will discuss representation learning methods to improve embeddings of social media text. Existing methods for text embeddings typically cluster documents by topic but struggle to differentiate by different user viewpoints towards those topics. I will present a novel weakly supervised learning method to leverage proxy signals to address this challenge.

KoPA: Automated Kronecker Product Approximation

Chencheng Cai (WSU)

We propose to approximate a given matrix by the sum of a few Kronecker products of matrices, which we refer to as the Kronecker product approximation (KoPA). Comparing with the low-rank matrix approximation, KoPA also offers a greater flexibility, since it allows the user to choose the configuration, which are the dimensions of the two smaller matrices forming the Kronecker product. On the other hand, the configuration to be used is usually unknown, and needs to be determined from the data to achieve the optimal balance between accuracy and parsimony. We propose to use extended information criteria to select the configuration. Under the paradigm of high dimensional analysis, we show that the proposed procedure can select the true configuration with probability tending to one, under suitable conditions on the signal-to-noise ratio. We demonstrate the superiority of KoPA over the low rank approximations through numerical studies, and several benchmark image examples.

Session 2.2

Sampling and Multi-Objective Optimization for Computational Redistricting Problems

Daryl Deford (WSU)

Tools from discrete optimization have become increasingly important for analyzing graph-based formulations of redistricting, requiring both operationalizing legislative text and exploring complex Pareto frontiers. In this talk I will discuss recent theoretical results and applications of these methods on data derived from the 2020 census, including detecting racial gerrymandering, evaluating nonpartisan justifications, and balancing multiple population constraints to address within-cycle vote dilution. This final topic includes recently published joint work with a WSU Data Analytics major and includes theoretical bounds on worst-case behavior of partitioning methods and empirical optimization.

Detecting Distribution Shifts via Foundation Models

Tony Chiang (PNNL)

Distribution shifts between train and test datasets obscure our ability to understand the generalization capacity of neural network models. This topic is especially relevant given the success of pre-trained foundation models as starting points for transfer learning (TL) models across tasks and contexts. We present a case study for TL on a pre-trained GPT-2 model onto the Sentiment140 dataset for sentiment classification. We show that Sentiment140's test dataset (M) is not sampled from the same distribution as the training dataset (P), and hence training on (P) and measuring performance on (M) does not actually account for the model's generalization on sentiment classification.

For more information, contact:

TONY CHIANG, Data Scientist
Pacific Northwest National Laboratory | tony.chiang@pnnl.gov

NIARANJANA (JAN) DASGUPTA, Boeing Distinguished Professor, Department of Math and Stat
Washington State University | dasgupta@wsu.edu

The symposium is sponsored in part by PNNL's Mathematics for Artificial Reasoning in Science initiative.

Research topics

National Security

Data Science & Computing

Graph and Data Analytics

Computational Mathematics & Statistics

PNNL-WSU Data Day

Abstracts

Session 1.1

Keynote

Deep Reinforcement Learning for Cyber System Defense under Dynamic Adversarial Uncertainties

Sam Chatterjee (PNNL)

Dynamic Parametric Modelling of Large-scale Data via Change Points

Abhishek Kaul (WSU)

Data Science and Modern Power Systems

Jason Fuller (PNNL)

Session 1.2

Data-Driven Challenges in Multi-Omics Applications and Integration

Lisa Bramer (PNNL)

Estimating Signal Proportion by Integral Equations

Xiongzhi Chen (WSU)

Session 2.1

Keynote

The Data Program at WSU: A Synopsis of the Current State and Plans for the Future

Jan Dasgupta and Jonathan Male (WSU)

Improving Data Representations: Applications in Molecular Property Prediction and Social Media Analysis

Emily Saldanha (PNNL)

KoPA: Automated Kronecker Product Approximation

Chencheng Cai (WSU)

Session 2.2

Sampling and Multi-Objective Optimization for Computational Redistricting Problems

Daryl Deford (WSU)

Detecting Distribution Shifts via Foundation Models

Tony Chiang (PNNL)

For more information, contact:

Related organizations

Research topics