June 3, 2025

Los Alamos contributes to unprecedented dataset to train AI models

Meta leads molecular simulations dataset effort using Lab software and tools

2025-06-03 — In this representation made with the Architector software and an example from the Open Molecules 2025 dataset, lanthanum, a rare earth metal, is surrounded by diverse bonding molecules. Lanthanum alloys are used in batteries and hydrogen gas applications.

A collaborative effort between Meta, Lawrence Berkeley National Laboratory and Los Alamos National Laboratory leverages Los Alamos’ expertise in building tools for molecular screening capabilities. The release of Open Molecules 2025, an unprecedented dataset of molecular simulations, can accelerate opportunities for machine learning to transform research in fields such as biology, materials science and energy technologies.

“A prohibitive part of molecular design has been the extreme computational cost needed to achieve quantum chemistry-level accuracy,” said Michael G. Taylor, researcher at Los Alamos and project member. “In order to train machine learning models capable of quantum chemistry-level accuracy, we need vast amounts of diverse, valid training data. Open Molecules 2025 bridges this gap with a dataset of over 100 million density-functional theory calculations that we can use to train machine learning models accurate enough for all kinds of chemical challenges.”

The dataset is key to unlocking the use of machine learning potentials for chemical applications, such as designing a new drug to fight disease or a battery cell to store energy. The employment of density functional theory calculations in the dataset enables a precise, atomic-level understanding of molecular behavior and interactions. Unique software designed by Taylor played a critical role in the ability of Open Molecules 2025 to reach its goals.

Novel software helps build the dataset

To help run the calculations and build the dataset, the collaboration leveraged the capabilities of the Architector software, designed by Taylor. Architector is a state-of-the-art software for predicting 3D structures of metal complexes. Metal complexes are chemicals in which a central metal atom is bound to an array of other molecules or atoms, and they represent important chemistry relevant to applications from biology to materials science.

Architector, as employed by Taylor and collaborators in the Lab’s Theoretical division, has mainly been applied to “F-block” elements: lanthanides like cerium and ytterbium, and actinides such as thorium and uranium. The F-block elements include many elements often referred to as rare earth elements, which are valuable for an array of industrial purposes, including high-tech applications in telecommunications, imaging, data storage and more.

The metal complexes represent an important class of chemistry explored with the Open Molecules 2025 dataset. Other classes include ion molecules such as proteins and RNA, small molecules that might be the basis of drug discovery, and electrolyte metals surrounded by different solvents. Taylor estimates that the chemistry explored by Architector represents up to a third of the entire dataset.

An investment in foundational chemistry knowledge

Meta tasked its vast computing power to run the density functional theory calculations. Considering only the rare earth molecular simulations it was able to achieve, the Open Molecules 2025 project resulted in data on approximately 20,000 structures on each of the 17 rare earth elements. The next-largest dataset available in literature has approximately 1,000 structures total per rare earth element.

The immense data generated can now be used to train other machine learning models at a fraction of the time and cost. The dataset could lead to pre-trained foundation models that can be fine-tuned with minimal added data in areas of interest. The entire Open Molecules 2025 effort, including initial machine learning models trained on the data, will be open to the public, giving researchers the ability to use data and models relevant to their research.

“Chemical design often boils down to predicting the properties of new chemistries with minimal information and computational expense,” said Taylor. “Having this dataset, with the ability to train machine learning models to do that predictive work, is potentially transformative for scientific discovery.”

In addition to Meta and Lawrence Berkeley National Laboratory, collaborators on the project include representatives from Carnegie Mellon University, Genentech, the University of California, Berkeley, New York University, Princeton University, Stanford University and the University of Cambridge.

LA-UR-25-24633