Los Alamos National Laboratory Advanced Simulation and Computing (ASC) Program
Ensuring the safety and reliability of the nation's nuclear weapons stockpile

LANL Capabilities Thrive from Improved Proficiency Planning and Operations

New HPC Project Management Office bolsters operations

Contact  

  • FOUS Program Manager
  • Jason Hick
  • Email

ECCCE piping

Due to integration and collaboration with HPC facility operations, the Exascale Class Computer Cooling Equipment project was completed 10 months ahead of schedule and approximately $20 million under budget. The new piping was integrated with the existing infrastructure to avoid conflicts.

High Performance Computing (HPC) has improved its operational excellence with recent projects by implementing expert knowledge of HPC facilities, better planning, improved communication, and a task-oriented focus as driven by the new Project Management Office (PMO).

Simultaneous excellence in nuclear security, Science Technology & Engineering, operations, and community relations — these are the hallmarks of Los Alamos National Laboratory. How we do our work is as important as what we do.

The new HPC PMO is utilizing innovative strategies to produce a ten-year plan and to communicate that plan with other organizations. The coordinated involvement and structure of the PMO allows for proactive budget discussions, integration of HPC facility operations experts, and rapid response and feedback on projects that have led to greater overall success.

The coordination has helped identify opportunities to leverage LANL institutional resources. In 2019, LANL institutional utilities initiated a large maintenance project expected to complete in 2021 to sustain maximum power to the Strategic Computing Complex (SCC) to benefit deploying advanced HPC systems and to ensure a solid foundation of energy for future power increases.

Who is the PMO

During the last 18 months, key individuals with unique skill sets collaborated on HPC projects in ways never before seen, forming the new HPC PMO team. The PMO achieved its primary goal: to bring together only the resources necessary without duplicating staff or efforts while continuing to adjust resources as needs are identified. This ability to tailor the workforce, the needed skills, and the structure of the project to accommodate rapid change is a notable benefit in the HPC environment.

Within the PMO, individuals can serve various functions, allowing for the fewest number of people to fill multiple roles. This concept has also allowed the PMO to execute multiple projects simultaneously in a more efficient manner than was possible in the past. Members bring novel approaches to problem solving from their unique perspectives as all are working toward a common goal.

HPC PMO identifies staff that can fill multiple roles, such as

  • project management,
  • facilities management,
  • construction manager,
  • field, design, and/or project engineering,
  • technical subcontract administration, and
  • project controls.

Shared contingency is a benefit

Just as the resources — people and skills — are tailored and shared within the HPC PMO, so are the risks, or contingencies, which leads to a reduction in overall costs. As the HPC PMO works on various projects, they are simultaneously planning for various contingencies for each of those projects. By combining contingencies, the HPC PMO can plan for the worst-case scenario and mitigate risks across the various projects more efficiently.

This shared contingency paired with close integration of HPC facility operations is key to reducing the need for contingency funds within projects. These funds then become available for reinvestment in efficiency improvements within LANL HPC facilities, which in turn further reduces risk. Some examples of reinvested savings are below.

Integration with line item projects

The Exascale Class Computer Cooling Equipment (ECCCE) project enables at least two advanced technology systems to be warm-water cooled at the same time. The project added new evaporative cooling towers, tower water pumps, heat exchangers, tanks, piping, and accessories. It was completed 10 months ahead of schedule and approximately $20 million under budget due to increased integration and collaboration with HPC facility operations, providing knowledge and expertise to increase project success.

LANL Capital Projects and HPC Facilities partnered to provide knowledge of existing systems and reviewed daily work in order to reduce unknowns. Plan-of-the-day meetings provided details of the HPC facility to the subcontractor who was able to select the most effective means and methods.

The time saved on the project schedule and the funding saved on ECCCE will support strategic computing upgrades for the Advanced Simulation and Computing (ASC) Program by enabling strategic computing upgrades, such as the minor construction upgrades for 60 MW of computing at the SCC. Sixty (60) MW of power matches the cooling available for supporting at least two simultaneous exascale-class systems in the SCC.

BIM

A snapshot of the current BIM model of the SCC showing some of the installation design for the new supercomputer, Crossroads, in color.

Investing to improve operations

Through strategic planning and coordinating resources between the HPC PMO and LANL ASC Program, funding is identified through efficiency gains that are reinvested to further improve the HPC Balance of Plant, or other infrastructure supporting facility systems and general operations.

For example, the following Sanitary Effluent Reclamation Facility (SERF) improvements for reducing water consumption came from HPC PMO efficiency gains. Los Alamos utilizes reclaimed water from SERF for the SCC’s cooling systems, and the availability of SERF reclaimed water reduces potable water use. Analysis of SERF outages showed the most likely deterrent to providing reclaimed water for cooling was pneumatic pump sensitivity to power perturbations. Replacing pumps and installing a generator to ride through brief power outages is expected to increase SERF water availability and save millions of gallons of potable water.

Another example of a recent investment that is improving operations involves Building Information Modeling (BIM), moving LANL engineering practices from 2-D to 3-D. BIM decreases construction errors and reduces the time required to install new supercomputers. The 3-D modeling process allows for a highly accurate — within millimeters — design and layout of wiring, piping, and other facility systems. A BIM file includes an information-rich collection of data that can be accessed and collaborated on by different stakeholders remotely and rapidly — from the design teams to contractors, construction engineers, and the owners. These models incorporate live updates during construction that reduce change orders. BIM also eliminates the complication of outside contractors gaining access to LANL’s secure facility, saving weeks of team measuring while also maintaining security.

Opportunities to excel at maintenance

The bus duct replacement project is important to enable the full range of electrical distribution, both delta and wye, required in HPC systems. In order to keep the HPC facility available during the replacement of electrical substations, the project needs close coordination with HPC facilities. The HPC PMO managed this coordination and identified a number of critical schedule improvements.

In turn, this coordination by the HPC PMO enabled the project to support several major maintenance projects on the electrical substations, while saving schedule overall. This will benefit HPC and its ASC users by keeping maintenance of critical systems up-to-date while minimizing future downtime for HPC systems.

Summary

Recent projects benefited through the application of expert HPC facility knowledge and the implementation of better planning, communication, and task-oriented focus driven by the new Project Management Office (PMO) for High Performance Computing (HPC). The PMO improved HPC operational excellence and efficiency for ASC and LANL by best utilizing the expertise of individuals tailored to each project, executing projects simultaneously, and adjusting technical strengths as the projects unfold. This new approach to how we do our work minimizes risks as the project develops and allows for a tailored mitigation approach based on shared risks — improving what we do.