# Project Descriptions

Creates next-generation leaders in Machine Learning for Scientific Applications

## Contacts

**Program Lead**- Nick Lubbers

**Program Co-Lead**- Youzuo Lin

**Program Co-Lead**- Natalie Klein

**Program Co-Lead**- Yen Ting Lin

**Administrative Assistant**- Iris Equnio

## Contact Us

**Los Alamos Applied Machine Learning Fellowship 2024 Project Descriptions**

There are 12 projects described in this document, with project numbers randomly assigned.

Please see aml.lanl.gov for the application form, FAQ, and contact information.

__Project 1: New approach to learning of RBMs and FNNs__

__Project 1: New approach to learning of RBMs and FNNs__

**Keywords: **statistical learning, learning algorithms, Restricted Boltzmann Machines (RBMs), graphical models, Markov Random Fields (MRFs), feedforward neural networks

**Abstract: **Restricted Boltzmann Machines (or RBMs) form one of the simplest yet a widely-used class of graphical models with latent variables that are used as a building block for deep belief networks. The most popular heuristic for learning RBMs is the contrastive divergence algorithm, but there is no guarantee it will succeed. Recently, a number of rigorous results have been established on algorithms for learning of RBMs in the sense of learning of the marginal Markov Random Field (MRF) induced on the visible nodes when the distribution is marginalized over the hidden variables. This project will explore an alternative approach to learning of RBMs, where the goal is to learn an explicit instance of RBM (i.e., parameters of a weighted bipartite graph on visible and hidden variables) which is on one hand compatible with the induced MRF on the visible nodes, and on the other avoids issues related to convergence of the contrastive divergence algorithm, thus combining both the benefits of exactness and computational tractability. The method will be based on the sample-efficient Interaction Screening family of estimators which are specifically useful in scientific applications where the available data is scarce. At the end of the project, we will explore how the developed algorithms can translate to general feedforward neural networks (FNNs) based on a recently discovered mapping between RBMs and FNNs.

__Project 2: Utilizing Large Multimodal Language and Vision Models to Extract Data from Historic Documents for Orphan Wells__

__Project 2: Utilizing Large Multimodal Language and Vision Models to Extract Data from Historic Documents for Orphan Wells__

**Keywords: **large language models, fine-tuning, orphan wells

**Abstract: **Orphan wells, unclaimed abandoned oil and gas wells, present significant environmental and health hazards. They emit methane, release harmful pollutants that contaminant air and water, and even pose threats to explode. To address these environmental issues, locating and plugging (i.e., filling with cement) these orphan wells are necessary. However, the challenge lies in the fact that the United States has hundreds of thousands, possibly millions, of undocumented orphan wells spread throughout the country [1]. Abandoned by their owners, no one is accountable to manage their legacy environmental risks. The good news is that there is a wealth of historical records, such as images, maps, and handwritten documents, which contain crucial information including locations, depths, and drilling information related to the wells that could be leveraged to pinpoint the location of these wells. Large multimodal language and vision models offer a path to turn this vast dataset into actionable information. Our preliminary efforts show that large language models (LLMs) are capable of facilitating information extraction from pdf documents. However, these approaches require converting documents (images or PDFs) into text using optical character recognition (OCR) tools before inputting them into the LLM pipeline. Besides that, our current approach cannot handle other types of image-based information, such as maps. Recently, some pretrained LLMs (e.g., LLaVA [2] and LayoutLM [3]) work directly with image-based documents without using an explicit OCR workflow. In this AML project, the selected student will apply and finetune these models to extract information from orphaned well documents. Initially, the student will test the LayoutLM model with zero-shot learning (requiring no additional data) and few-shot learning techniques (using a small number of data samples to quickly adapt LLMs to new tasks). Second, our focus will shift towards finetuning the selected LLM using a multi-lab database of orphaned wells to enhance performance in extracting information from historical records. We have access to terabytes of documents to perform this finetuning. This research project aims to lay the groundwork and offer a path forward for future researchers interested in harnessing large multimodal datasets.

[1] IOGCC (2021). Idle and orphan oil and gas wells: State and provincial regulatory strategies 2021. Interstate Oil and Gas Compact Commission (IOGCC) report.

[2] Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744

[3] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020, August). Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).

__Project 3: Predictive uncertainty quantification for foundation models__

__Project 3: Predictive uncertainty quantification for foundation models__

**Keywords:** foundation models, nuclear nonproliferation, natural language processing (NLP)

**Abstract: **Foundation models (FMs) are becoming a major component of generative artificial intelligence (AI). Such models are typically prodigious deep neural networks that make use of modern architectures like the transformer. These large, pre-trained neural networks are then adapted to a variety of tasks by a method such as fine tuning that is accomplished by updating a subset of weights with task-specific data. Bootstrapping is a classical resampling technique that has been used for deriving uncertainties for predictions from machine learning models, e.g., random forests. While this technique cannot usually be applied to quantify the uncertainty for an entire FM due to the prohibitive cost of refitting, it can be applied, for instance, to fine tune some small subset of weights. Split-conformal inference is also becoming a commonly used tool for quantifying predictive uncertainties and entails deriving a residual distribution on a hold-out set of training data, and, in contrast to bootstrapping, this method does not require iterative refitting. The overarching goal of the project will be to implement and compare these two techniques, bootstrapping and split-conformal inference, for predictive uncertainty quantification regarding task adaptation using foundation models. Time permitting, we will investigate how to incorporate uncertainty derived from the selection of fine-tuning weights and consider adaptations besides fine tuning, such as few-shot learning. We will implement this work using deep neural networks such as NukeLM for text and PhaseNet for seismic.

__Project 4: Traversing Chemical Space of Ligands with Monte Carlo Tree Search__

__Project 4: Traversing Chemical Space of Ligands with Monte Carlo Tree Search__

**Keywords:** cheminformatics, Monte Carlo Tree Search (MCTS), inverse design

**Abstract: **Predictions of binding between ligands and their targets, such as metals or proteins, form an important class of problems for the design of chemical processes and new pharmaceuticals [1]. However, chemical space is combinatorially large—billions upon billions of “small” molecules have potential application in industry, environment, and medicine. Identifying promising new chemistries using traditional discovery processes is harder than finding a needle in a haystack. The 2016 successes of AlphaGo brought to light the powerful combination of reinforcement learning with classic exploration algorithms such as Monte Carlo tree search (MCTS). Even more recently, these techniques have pointed at design problems relating to chemistry [2,3]. This combination of MCTS and cheminformatics is extremely new, and the landscape of techniques is wide open.

In this project, the student and mentors will build an inverse-design framework for searching chemical space using MCTS, and use this framework to perform design applications for selective metal separation ligands [4] and protein binding site optimization. State-of-the-art cheminformatics models are powered by graph fingerprinting and/or graph neural network techniques because they are well-suited for rapidly processing molecular data. Ligand design problems for metal separations are inherently chemically complex. Metal separations typically involve a wide diversity in elements involved, simultaneous metal-ligand-solvent interactions, and higher-order effects such as temperature in experiments. Therefore, flexible, interpretable, rapid, and accurate predictive models for design are necessary. MCTS on interpretable chemical fingerprints combines a unique set of ML techniques capable of representing all these design targets, presenting a unique opportunity for exploring the chemical space of ligands. Specifically, the student will focus on developing a framework connecting existing fingerprints and an MCTS algorithm implementation for designing molecules with favorable organic phase solvation energies. The resulting proposed workflow could then be used to “grow” ligands from scratch for specific applications.

[1] Howes, L., Why small-molecule drug discovery is having a moment C&EN Oct 30, 2023, 2023.

[2] Kajita, S.; Kinjo, T.; Nishi, T., Autonomous molecular design by Monte-Carlo tree search and rapid evaluations using molecular dynamics simulations. Communications Physics 2020, 3 (1), 77.10.1038/s42005-020-0338-y

[3] Jumper, J.; et al., Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596 (7873), 583-589.10.1038/s41586-021-03819-2

[4] Taylor, M. G.; Burrill, D. J.; Janssen, J.; Batista, E. R.; Perez, D.; Yang, P., Architector for high-throughput cross-periodic table 3D complex building. Nature Communications 2023, 14 (1), 2786.10.1038/s41467-023-38169-2

__Project 5: Energy-Based Classifiers__

**Keywords: **Hopfield model, equilibrium propagation, top-down, recurrent

**Abstract:** We propose to merge physics and machine learning by constructing deep learning models in which physics itself is employed to compute the objective function. Our approach to physics-based computation for ML applications utilizes a Hopfield model with a stacked architecture, analogous to a deep neural network with symmetrical bottom-up and top-down connections (https://www.frontiersin.org/articles/10.3389/fncom.2017.00024/full). Top-down inference allows features in higher layers to influence the activation of features in lower layers, implementing a top-down attention. Such models employ a novel learning framework termed "equilibrium propagation" (EQEP) for sculpting attractors that are optimized for specific tasks. We have recently demonstrated that classifiers constructed from physics-based deep Hopfield models are more robust than equivalent standard classifiers with identical feed-forward connectivity (https://openreview.net/forum?id=WWTOAKAczk). Students will extend energy-based classifiers to include lateral competition, including local connections for encouraging sparse representations and long-range lateral connections analogous to attention models. Energy-based models employ only local Hebbian plasticity rules and thus are fully compatible with neuromorphic hardware, opening a path to real-time ML at the edge.

__Project 6: Differentiable programming for interpretable wildfire hydrology__

__Project 6: Differentiable programming for interpretable wildfire hydrology__

**Keywords: **wildland fire, hydrology, differentiable programming, convolutional neural networks, remote sensing

**Abstract: **Wildland fire prevalence is at an historic high with no signs of abating. Among other negative outcomes, increased fire activity threatens global water security through its impacts on river discharge. The consequences of severe burns on river systems can vary widely, from catastrophic flooding in some watersheds—such as followed the recent Calf Canyon/Hermit’s Peak Fire in New Mexico—to exacerbated water scarcity in others. In recent years, an explosion in the availability of accurate, satellite-based records of global fire activity has opened the door to data-driven approaches for understanding these hydrologic hazards. In this project, we will aim to leverage these new datasets to develop interpretable forecasting tools, through a first-ever application of differential programming to wildfire hydrology. Specifically, a lumped-basin rainfall-runoff model will be employed which embeds a convolutional neural network to represent the impact of vegetation and soil properties on runoff generation. The input to the neural network will be one or more time series of satellite-derived images of a basin reflecting the temporally dynamic impact of wildland fire on the ground surface. These data, alongside meteorological forcing and stream discharge records, will be curated prior to student on-boarding, allowing the student to focus immediately on model development and testing. This project will represent a pioneering effort to apply differentiable programming to the nexus of fire science and water security. The end result of this research will be a data-informed tool which improves capabilities to predict the consequences of fire on river flow while elucidating some of the physical mechanisms by which burns alter catchment hydrology.

__Project 7: Tangent kernels to understand DNNs and to quantify uncertainty of ML predictions__

__Project 7: Tangent kernels to understand DNNs and to quantify uncertainty of ML predictions__

**Keywords: **uncertainty quantification

**Abstract: **Deep Neural Networks (DNN) have been shown to yield estimates that have good properties when relating (predicting) a response variable to a vector of features, even when the dimension of the feature vector and the number of parameters is large. Statisticians on the other hand have developed a theory characterizing statistically how well one can estimate a function in terms of the complexity of the class of functions, the dimensionality of the input vector, and the number of observations. A naïve interpretation of that theory suggests that DDNs should not be as good as they are. This project seeks to reconcile the theory and the practice by studying the properties of the gradient of the model. Initial calculations have shown that these gradients provide insights into the geometry of DNNs. Complex Taylor Series provides a novel and exciting approach to expand on our initial results that we hope will lead to novel insights. The student we propose to mentor will work with us on expanding our ideas.

Students in this research project will

- Learn about the basic structure of deep neural networks.
- Learn about automatic differentiation.
- Explore numerically the geometry of the response surface implied by DNNs.
- Decompose the linearization of the DNNs by level, and explore how ideas from two stage regression can provide insights into the relative importance of each level in the DNN.
- Explore representations based on tangent kernels (ML kernels based on tangents)
- Make connections between tangents, complex Taylor Series Expansions and Uncertainty Quantification.

The outcome will be a manuscript suitable for publication in a CS workshop/conference. The student will also have the opportunity to present a talk and prepare a poster.

__Project 8: Calibration of radiation flow experiments using physics informed neural networks__

__Project 8: Calibration of radiation flow experiments using physics informed neural networks__

**Keywords: **physics informed neural networks, statistical model emulation and calibration, sparse data, joining ML and simulations

**Abstract: **This project will build statistical and machine learning tools to calibrate and validate physics-based simulations to experimental results of radiation flow in high energy-density physics experiments. The goals are to provide a toolkit that can be used to calibrate a dataset of simulations to sparse experimental data in time and space (e.g., radiographic imagery and X-ray spectroscopy) with the aim of using the calibrated model to guide the design of future experiments and better inform understanding of radiation flow processes including inertial confinement fusion and the explosion of supernovae. The pre-generated simulation data are dynamic spatio-temporal processes generated with mulit-physics codes that take a set of input parameters and generate synthetic measurements that can be calibrated to the observational data. Success in this project will be defined by the following goals of increasing complexity and challenge of completion.

1) Compare the performance of Gaussian processes and convolutional neural nets to calibrate the sparse experimental data to the computer model simulations.

2) Explore the use of recurrent neural networks to emulate and interpolate between simulation parameters and extend these networks to physics-informed networks that follow a set of governing equations.

__Project 9: Accelerating Partial Differential Equation Solver for Radiation Belt Modeling by Physics-Informed Machine Learning__

__Project 9: Accelerating Partial Differential Equation Solver for Radiation Belt Modeling by Physics-Informed Machine Learning__

**Keywords: **space physics, radiation belt, diffusion equation, PINN, PDE solver

**Abstract: **The radiation belt is a torus-shaped region in the Earth’s magnetosphere where high-energy charged particles are trapped by the planet's magnetic field, posing potential hazards to space-borne assets and human activities. It is crucial to gain a comprehensive understanding of the belt's dynamics and to develop models that can predict its behavior. At the core of many radiation belt models lies the challenging task of solving diffusion equations with highly inhomogeneous and time-dependent coefficients. These equations describe the evolution of the phase space density (PSD) function of charged particles, taking into account the interactions between the particles and various plasma waves. Conventional approaches solve these equations by either implicit or semi-implicit finite differencing methods. However, the primary challenge lies in devising a robust numerical scheme that not only maintains numerical accuracy and stability but also ensures the positivity of the PSD function in an environment where both PSD and the diffusion coefficients exhibit orders-of-magnitude variation across the simulation domain. Our proposed project aims to leverage recent advances in the field of implicit neural representations for scalar fields and physics-informed neural networks (PINN) to speed up the partial differential equation (PDE) solver in radiation belt models. Our approach involves employing a neural network to represent the PSD function and utilizing automatic differentiation (AD) to compute partial derivatives of the implicit representation concerning time and space. This approach allows us to transform the problem of solving PDEs into a task of minimizing the difference between the left and right sides of the diffusion equation with respect to the neural network's parameters. Historically, the computational cost associated with auto-differentiation and training of deep neural networks had hindered the effectiveness of such an approach. However, our preliminary study, which employed JAX for auto-differentiation, Flax for neural networks, and Optax for optimization, demonstrated that by using Fourier Features as positional encoding and 2 layers of 256 SIREN nodes, we can accurately represent solutions to the Poisson equation. The training time is comparable to a finite difference solver on a 1024 by 1024 domain using the FleCSI library (on the order of seconds to minutes). Furthermore, through careful crafting of differential operators, efficient vectorization and better training schemes, we expect to achieve performance on par or even exceed traditional finite difference solvers. A summer student will be recruited to extend our current implementation to include spatially dependent diffusion coefficients, investigate how nonlinearity impacts the accuracy of our neural solver, and explore different forms of diffusion coefficients (derived from theory and observations). The student will benchmark the new solver against LANL’s radiation belt code DREAM3D, and demonstrate that this new approach can be computationally more efficient than conventional methods like Crank-Nicolson. This is a collaborative project leveraging expertise in Space Science, Data Science, and Applied Machine Learning. The project will be led by Xiangrong Fu who has been a mentor and lecturer in the Space Weather Summer School since 2014. As an expert in numerical modeling of magnetospheric processes, he is now leading an effort to modernize DREAM3D for modeling natural and artificial radiation belts. He will oversee the whole project and mentor the student. He will be assisted by Li-Ta Lo, who is a co-lead of the Data Science summer school. Li-Ta is an expert in accelerated parallel programming and will be in charge of performance evaluation and improvement. In addition, Yen Tin Lin who specializes in data-driven learning for dynamical systems and physics-informed machine learning, will serve as a co-mentor and oversee the development of the proposed PINN, from architecture to training. The selected student will have the opportunity to work alongside mentors from each of these domains, making it a rich and interdisciplinary learning experience.

__Project 10: Extracting multiscale microstructural fingerprint by applying graph-based deep learning__

__Project 10: Extracting multiscale microstructural fingerprint by applying graph-based deep learning__

**Keywords: **materials science, graph attention network, graph-based learning, multilevel graph representation

**Abstract: **The microstructure of materials plays a pivotal role in determining the in-service performance of structural components. Traditional methods of microstructure analysis are both cost-intensive and reliant on specialized expertise. While machine learning has been successfully employed to predict material properties, existing models often fall short in comprehensively addressing the complex interplay across various microstructural scales. To overcome this challenge, we have developed METIS3D, a hierarchical graph-based microstructural analysis tool, at MST-8. The unique aspect of METIS3D lies in its ability to represent 3D microstructural information of materials as multilevel graphs. This is particularly important in understanding the detailed interactions among the grains and their constituents within materials. Previously, we have collected and processed 3D experimental data with METIS3D to obtain the graph-based representation of the 3D microstructures.

In this 10-week internship, the student will be tasked with applying advanced deep learning methods, such as Graph Attention Networks, to multilevel graph-based representations of microstructures. GAT will be utilized to effectively process and interpret the complex, non-Euclidean data inherent in these graphs. The student will train the GAT to identify and quantify critical relationships and patterns aka signatures within the microstructure, such as the interaction between different grains or phases. These extracted signatures are critical in identifying and understanding the physics-based mechanisms that govern material deformation processes, including phenomena like twin evolution, twin transmission, and plastic dissipation. At the end of the 10-week internship, the student will gain a deeper understanding of microstructural relationship, deformation mechanism, and graph-based learning for material science applications.

In a broader context, the hierarchical signatures extracted during the internship have a crucial role beyond immediate analysis: they will guide the generation of statistically representative virtual microstructures for mechanics-based simulations. The validity and reliability of these simulations will be further enhanced by cross-validating their results with experimental data, particularly focusing on mechanical properties. Through continuous refinement based on comparative analysis, this process enables precise predictions of material behavior. Importantly, this approach holds the potential for scalability to other polycrystalline materials, illustrating a significant impact on materials science and mechanics-based simulations.

__Project 11: Functional variational inference for Bayesian neural network emulators__

__Project 11: Functional variational inference for Bayesian neural network emulators__

**Keywords:** Bayesian inference, surrogate models, emulator models, model selection

**Abstract: **In an ideal world, state-of-the-art machine learning techniques, such as deep neural networks, would provide accurate measures of uncertainty. This is especially important in Los Alamos applications, such as emulation or surrogate modeling of physics simulations. In these applications, the ability to reflect uncertainty when predicting away from observed input settings is vital. Bayesian Neural Networks (BNNs) offer a solution by providing built-in uncertainty quantification, essential for high-stakes decision-making and building trust in the algorithm; however, accurate approximate inference and model selection is still an unsolved problem. This is partly because the integrity of probabilistic predictions from BNNs depends on the selected prior distribution over the parameters. Recent works suggest that widely-used default prior choices can lead to poor quantification of uncertainty, but guidance for selecting priors and the potential impact of making different prior choices is severely understudied. Because the impact of weight space prior choices for BNNs is unclear, in this project, we seek to use function space variational inference (e.g., [1]) to directly impose priors on the function space represented by a neural network. We represent the approximate functional posterior via a neural network that can flexibly encapsulate properties of many distributions [3]. To evaluate our method, we will apply these functional BNNs to a challenging surrogate estimation problem that has been previously addressed with Gaussian process models: predicting storm surge based on SLOSH simulations [2]. Such scientific ML problems are appealing because a ground-truth function is known and thus allows rigorous empirical analysis of model calibration.

[1] Rudner, T. G., Chen, Z., Teh, Y. W., & Gal, Y. (2022). Tractable function-space variational inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 35, 22686-22698.

[2] Hutchings, G., Sansó, B., Gattiker, J., Francom, D., & Pasqualini, D. (2023). Comparing emulation methods for a high‐resolution storm surge model. Environmetrics, 34(3), e2796.

[3] Lu, Y., and Lu, J. (2020). A universal approximation theorem of deep neural networks for expressing probability distributions. Advances in Neural Information Processing Systems, 33, 3094-3105.

__Project 12: Heterogeneous Transfer Learning for Quantum Chemistry__

**Keywords: **machine learning potentials, transfer learning, data fusion, quantum chemistry

**Abstract: **Large-scale atomistic simulations using first-principles methods remains an open problem due to the high costs of different quantum chemistry (QC) methods. Machine Learning Potentials (MLPs) offer a path towards achieving the accuracy of quantum chemistry methods at drastically reduced cost. Typically, MLPs are trained on datasets evaluated using a specific quantum chemistry method and they map features (atomic coordinates and species) to labels (chemical properties such as energy, forces, dipoles, etc). Generating the training data using high-fidelity quantum chemistry methodsa is computationally expensive. By training an MLP on a low-fidelity source dataset and re-training on a smaller high-fidelity target dataset, higher accuracy machine learning models can be produced. This technique is the primary way that transfer learning is applied to MLPs. Transfer learning can be broadly categorized as homogeneous, where the features and label are the same for source and targets, or heterogeneous, where the features and/or labels differ. The aim of this project is to investigate the advantages of incorporating heterogeneous transfer learning compared to using "classic" homogeneous learning applied to different chemical properties and fusion of experimental and simulated data sets.

The initial step of this project focuses on how the theoretical similarities between the different QC methods affect the sample-efficiency of homogeneous transfer-learning. The student will learn to train MLPs and then apply different transfer learning strategies to establish empirical trends of which QC methods and training procedures are the most effective for transfer learning. Standard data sets such as QM9 have been evaluated using many different QC methods and can be readily utilized. The results would be publishable and benefit the QC and MLP community by providing a cost efficient method for utilizing existing datasets when constructing new MLPs.

The main focus of the project is to apply heterogeneous transfer learning methods to MLPs across different labels, such as energy, molecular dipoles, band-gaps, etc. Quantum mechanics explains how these chemical properties are related. This serves as the theoretical motivation for heterogeneous transfer learning. This step of the project will leverage the code and results from on-going projects at LANL for multi-task learning applied to chemical systems. Additionally, the student will explore heterogeneous transfer learning to utilize reference data of experimental origin. Homogeneous transfer learning is difficult because experimental measurements frequently provide information on quantities averaged over some characteristic scale, in contrast to the atomistic simulated data. Sparsity in experimental data further complicates the problem. Therefore, heterogeneous transfer learning to experiment is a promising method to correct systemic biases which may be present in MLPs trained on QC simulations only. Ultimately, this project will be one of the earliest explorations of heterogeneous transfer learning for quantum chemistry and will systematically evaluate whether known similarities between properties and methods lead to more effective transfer learning.

** **

** **