Los Alamos National Laboratory

Community Programs Office
Connections Newsletter
The Lab's Connection to the Northern New Mexico Community!

March 2012

Turning Supercomputer “Failures” into Successes

Each year, the Lab needs to replace thousands of computer nodes (in this case, computer processors) that make up part of its high-performance computing (HPC) system. These are the same systems used to run complex simulations and models involving millions of billions of calculations per second. The node replacements are necessary due to the increased efficiency of new machines that make them less expensive to operate and ultimately help keep the Lab’s overall HPC operating budget under control. Several years ago, it occurred to Gary Grider, the deputy division leader for the Lab’s HPC operations, that while the large number of older nodes might not make sense for the Lab’s continued use, perhaps they could be of use within the national computing community.

While it’s not within the Lab’s charter or budget to repurpose this equipment, it became evident that with some outside help, unique and vital resources could be made available to the computing community.

The partners that helped create this new computer resource included the New Mexico Consortium and the National Science Foundation, along with additional experts from Carnegie Mellon University and the University of Utah.

Why all this effort to reuse some “old computers?”

Supercomputers are a different animal than your ordinary desktop computer. Imagine you had two thousand of them, joined together, each concurrently working on a subset of a very complicated equation. Once each computer, or node, completes its bit of work, the results are fed into the larger equation so that subsets feed into subsets and up and up until the final answers result. It takes very specialized skills to figure out how to break up complex tasks into thousands of these subsets, and there aren’t many opportunities to work on such a large number of nodes at once. Students might be able to work on a few nodes at a time, and even professional computer scientists might only be able to work with a few hundred nodes at a time.

Scaling from a few hundred nodes to thousands that all have to interact, is no easy matter. With ten computers talking to each other, you might have a node failure. With many thousands of nodes in the mix, you will have many failures. Questions of what will happen to the system, and its specifically designed software, would optimally be worked out in advance of installation, but real-world time constraints means it can only be done with a smaller number of test nodes.

PRObE (short for Parallel Reconfigurable Observational Environment) will provide a dedicated test enviornment with thousands of nodes, rather than tens or hundreds. The opportunity to manipulate these systems in isolation will be a boon to computer science. It enables students and researchers to gain a deeper understanding of how the parts build into the whole and gives unprecedented access to large-scale computers for systems research. During operation, the users can have complete control over the hardware and can even inject failures into the system to see how the systems and software respond.

The program will operate under the auspices of the New Mexico Consortium, itself a nonprofit organization made up of the University of New Mexico, the New Mexico Institute of Mining and Technology, and New Mexico State University. The Consortium will provide coordination of the various entities involved, schedule access, and keep everything up and running from its location at the Los Alamos Research Park. Carnegie Mellon University brings considerable computing experience to the table, and the Flux Research Lab at the University of Utah is providing software that will help the researchers manage the PRObE testbed.

Of course, the concept would have never gotten off the ground without funding. Enter the National Science Foundation (NSF), an independent federal agency with an annual budget of $6.9 billion charged with promoting the advancement of science and basic research. PRObE’s goals aligned well with the recognized need to support advances in supercomputing architectures, which is a focus of NSF’s Advanced Computing Infrastructure Strategic Plan. NSF anticipates that work done through the PRObE $10 million, 5-year grant will ultimately serve not only supercomputing within the government and academic communities, but will also support the national economy.

PRObE has already provided hands-on experience to students. Local high school and undergraduate students spent last summer helping to physically assemble and individually test the 2,500 computer nodes, and others spent their winter break advancing the set-up work even further. The first large PRObE computer system is still being assembled with thousands of cables and more than 50 computer racks to form a 1,024-node cluster. A smaller version, made up of 128 nodes, is finished and ready for final testing.

PRObE also has an upcoming educational component in the form of a Computer System, Cluster, and Networking Summer Institute that starts in June. Participating students will work in small teams over nine weeks and gain experience setting up, configuring, administering, testing, monitoring, and scheduling computer systems, supercomputing clusters, and computer networks. The summer school is a joint effort between the Consortium and the Lab. Students will get onsite, hands-on experience and have a chance to directly interact with Lab staff members.

With leveraged resources from the Lab, NSF, and others, along with the expertise of researchers across the country, PRObE promises to advance the science and education of supercomputing.   

Contact Us | Careers | Bradbury Science Museum | Emergencies | Inside LANL | Maps | Site Feedback | SSL Portal | Training

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA © Copyright 2015 LANS, LLC All rights reserved | Terms of Use | Privacy Policy