Los Alamos National LaboratoryInformation Science and Technology Institute (ISTI)
Implementing and fostering collaborative research, workforce and program development, and technical exchange

2021 Project Descriptions

Creates next-generation leaders in Machine Learning for Scientific Applications

 

 

Contacts  

  • Program Lead
  • Nick Lubbers
  • Program Co-Lead
  • Youzuo Lin
  • Program Co-Lead
  • Diane Oyen
  • Program Co-Lead
  • Lissa Moore
  • Administrative Assistant
  • Brittney Vigil

Contact Us  

Research Focus Areas for 2021:

 Focus 1: Interpretability and Explainability in Machine Learning

There is an urgent need in the national security space for interpretable machine learning models, particularly in very high-risk applications. Unfortunately, the majority of widely-accepted interpretability techniques focus specifically on image data and supervised classification tasks. Many applications of national importance at Los Alamos and elsewhere use timeseries, text, or numeric data, and require unsupervised models for tasks such as knowledge discovery or anomaly detection. Summer 2021 AML projects under this focus area will involve developing and/or evaluating interpretability techniques for models which use text and/or timeseries for national security applications.

 Focus 2: Physics-informed Machine Learning

Physical systems underpin the complexity of the world, and in recent years Machine Learning has proven a useful tool for addressing this complexity.  Modeling of physical systems presents both opportunities and challenges. For example, in many areas, data can be collected automatically via traditional simulation techniques -- but the computational expense may be significant, and simulated data has a potential to be misaligned with the real world. Another example is that modeling of physical systems benefits greatly from the incorporation of hard constraints such as symmetries and conservation laws. Designing models that obey these principles is an ongoing endeavor.

 Focus 3: Uncertainty Quantification in Machine Learning

Scientific applications often require not only an answer, but error bars. Although Statistical Learning Theory is grounded in probability theory, it turns out that there is much work needed in order to develop algorithms which naturally produce useful probabilities describing the true likelihood of predictions. Research in UQ has the potential to impact other areas of ML such as Active Learning and Interpretability, and also has the potential to lessen the amount of education required for ML to be productively applied to new data.

 Focus 4: Machine Learning for Earth Sciences

Recent advances in computer science and data analytics have brought machine learning techniques to the forefront of Earth sciences. As a result, old questions are being addressed in new ways as techniques are being developed to exploit large volume and complex datasets. In computer science, scientists have recently developed various novel machine learning algorithms for diverse applications. Simultaneously, there have been many exciting demonstrations of successful machine learning applications applying deep, active, reinforced, supervised and unsupervised techniques in the Earth sciences.  The goal of this focus is to develop novel machine learning tools to resolve some of the most challenging geoscience problems.

 Projects for 2021:

 Project: Interpretability for Unsupervised Text Models

Focus Area: Interpretability

Social media data contain insight into individual's opinions and day-to-day activities. Developing useful ways to mine these data can contribute to our understanding of a number of domains. Here, we focus on public health applications in the context of COVID-19. In particular, we have two large social media datasets (Twitter and Reddit) that we are interested in mining for health misinformation and health behaviors. The goal is to identify misinformation and behaviors, describe it, and analyze changes in patterns over time and geospatially. The description task is particularly challenging and requires novel interpretability techniques, especially to assist human analysts in sifting through large quantities of text. We have focused on unsupervised learning, but there is also ample opportunity to develop classification models to identify different pieces of interest.

 Project: Radio Frequency Feature Identification

Focus Areas: Interpretability, ML for Physics

The ability to identify radio features from the data would be significant to some cognitive radio domains. Specifically, the ability to identify symbol transition boundaries in digital transmissions from the raw data would be valuable. Although this does not assist with demodulating the information, the ability to identify meaningful shifts in transmission characteristics would advance the basic science of understanding non-stationary distributions from a machine learning perspective. The students involved in this project will be given a string of time-series data containing sequential patterns in the form of modulated symbols. The patterns will be random, however, each symbol will be drawn from a fixed generating function. The task will be to apply time-series techniques to identify the boundaries of the symbols, and potentially the characteristics of the symbols, from the raw data. Ground truth and the generating functions will be provided.

 Project: Exploring the Intersection Between Interpretability and Tensor Factorization

Focus Areas: Interpretability

Unsupervised Machine Learning (ML) methods aim to extract sets of hidden (latent) features from unlabeled datasets. Unsupervised ML methods include classical neural networks, clustering, various auto-encoders, and the contemporary blind source separation (BSS) techniques based on matrix factorization. Tensor (i.e., multidimensional array) factorization methods are the natural extension of the matrix factorization for decomposition of high-dimensional datasets, that provide meaningful links among various low-dimensional features hidden in different dimensions of the data tensor. A limitation shared by most of the factorization techniques is the difficulty to relate the extracted latent factors and sub-spaces to physically interpretable quantities. The nonnegative factorization overcomes this limitation as the non-negativity leads to a collection of strictly additive features that are parts of the data and hence are amenable to simple and meaningful interpretation. We are looking for graduate students interested in applying and developing novel ML algorithms based on nonnegative tensor factorization and tensor networks. We will explore latent features buried in various data, such as, pictures, text, computer simulations of biological molecules and others, that naturally incorporate explainable hidden variables, features and topics.

 Project: Adsorption equilibria of fluid mixtures in subsurface nanopores

Focus Areas: Physics-informed Machine Learning, Machine Learning for Earth Sciences

The subsurface provides more than 75% of the world’s energy and an ideal location for CO2 and nuclear waste storage. While ubiquitous in the subsurface characterization, adsorption equilibria of multi-component systems are poorly characterized. The majority of the current experimental measurements are conducted on single-component systems. Currently, less than 2% of binary mixtures have experimentally-derived adsorption models. Alternatively, molecular simulation provides an essential research tool that allows us to probe experimentally-challenging areas such as adsorption equilibria of multi-component mixtures. Consequently, many simulations would be needed to build an exhaustive thermodynamic model where simple adsorption isotherms become inadequate to describe the adsorption equilibria. We propose building machine learning capabilities to develop adsorption models for multi-component systems of interests to subsurface applications using a molecular-simulation-generated database. This combines the strengths of molecular simulation in capturing adsorption equilibria of complex mixtures, and the strengths of deep learning for modeling high-dimensional spaces.

 Project: Smart Sampling of chemical and materials simulations

Focus Areas: Physics-informed Machine Learning

Sampling chemical space in physics driven simulations, such as molecular dynamics, is difficult because they tend to bias towards equilibrium states. Samplers targeting specific non-equilibrium processes such as reaction transition states and environments found in shocked materials are required to generate unbiased data for training machine learning potentials. We aim to employ machine learning methods to develop a smart sampler that drives sampling into non-equilibrium chemical environments for the generation of more robust machine learning-based atomistic potential training data sets.

 Project: Deep Learning for Scientific Spatiotemporal Data Analytics in Earth Sciences

Focus Areas: Machine Learning for Earth Sciences

Spatiotemporal measurements and observations are ubiquitous in many scientific disciplines. Accordingly, spatiotemporal data analysis becomes an emerging research area due to the development and application of novel computational techniques allowing for large-scale data. Different from industry-oriented applications, the analysis of scientific spatiotemporal data requires that both the governing physics and temporal/spatial correlations are taken into account. Our summer projects will focus on the development of spatiotemporal deep learning and computing techniques. Particularly, we will explore earth science problems to demonstrate the efficacy of the techniques. Our targeting problems would be computational imaging and visualization for subsurface applications [1, 2], earthquake characterization and detection [3, 4], and others. Students with strong deep learning skills and/or domain knowledge backgrounds are encouraged to apply. For any additional questions regarding this project, please contact Dr. Youzuo Lin at ylin@lanl.gov.

 [1]. Zhongping Zhang and Youzuo Lin, “Data-driven Seismic Waveform Inversion: A Study on the Robustness and Generalization,” in IEEE Transactions on Geoscience and Remote Sensing, 58(10):6900-6913, 2020.

 [2]. Yue Wu and Youzuo Lin, “InversionNet: An Efficient and Accurate Data-driven Full Waveform Inversion,” IEEE Transactions on Computational Imaging, 6(1):419-433, 2019.

 [3]. Yue Wu, Youzuo Lin, Zheng Zhou, David Chas Bolton, Ji Liu, Paul Johnson, “DeepDetect: A Cascaded Region-based Densely Connected Network for Seismic Event Detection,” in IEEE Transactions on Geoscience and Remote Sensing, 57(1), 62-75, 2019.

 [4]. Zhongping Zhang, Zheng Zhou, Tianlang Chen, Youzuo Lin, “Adaptive Filtering for Event Recognition from Noisy Signal: An Application to Earthquake Detection,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Page 3327-3331, 2019.

 Project: Uncertainty Quantification for ChemCam

Focus Areas: Uncertainty Quantification

Machine learning (ML) models generalize patterns from datasets and result in emergent behaviors that are poorly understood by their creators and users. ML is trained and validated on available datasets -- whether from simulations, experiments or observations -- but must be trusted to deploy on real data and to answer scientific puzzles. We approach uncertainty quantification (UQ) of ML from a hypothesis-testing viewpoint; in which patterns learned from opportunistically collected data are assessed as falsifiable hypotheses with respect to output predictions and suggest what information is needed to build interpretable mechanistic models. This vision of exploiting machine learning to support the scientific method requires combining UQ and the systematic study of failure modes (sensitivities and robustness) of data-driven algorithms, with the ability to learn and transfer knowledge from one dataset to another related one. While elements from each of these tasks exist today in the context of building predictive models, there does not exist a coherent mathematical framework that brings them together to support the scientific method. This project will advance our understanding of the mathematical framework of machine learning and developing the needed software tools to apply ML to challenges of mission-critical science and security problems. As a testbench problem, we consider the geochemical analysis of rocks using ChemCam. The ChemCam instrument on the Mars rover, "Curiosity", observes spectra of rock targets but the mapping of spectral response to element abundance is non-linear with matrix effects and poorly characterized by existing models. ML may be able to better predict element abundance, but it must also quantify the uncertainty of output predictions, identify regions of input space that are poorly characterized and evaluate how well the model is identifying known physical processes.