Automated OpenCHAMI Integration Testing and Cluster Deployment
Marcos Johnson-Noya, Alana Kihn, Madison Mejia
OpenCHAMI is a new open-source high performance system management software that brings cloud-like concepts to traditional HPC management software systems. This project implements a CI/CD pipeline using GitHub Actions to accelerate further OpenCHAMI development. The pipeline runs on an ephemeral self-hosted virtual machine (VM) which is destroyed and recreated to ensure security and reproducibility. The pipeline deploys OpenCHAMI and tests OpenCHAMI’s various microservices using the Hurl framework. After the OpenCHAMI services are deployed, the pipeline boots a fully deployed HPC cluster which can be used for further integration tests.
Cray Programming Environment Containerization
Ever J. Dominguez, Almond J. Heil
The Cray Programming Environment (CPE) is a set of vendor-supported packages that provide tools to the user for high performance software development. Updates to the CPE install their packages to a fixed location. While Cray nominally supports 'stacking'
multiple CPE releases in the same location, this poses a challenge when upgrading since production software can break when the CPE is updated to a new release. This can extend scheduled downtimes, as substantial troubleshooting must be performed as part of the upgrade process. In order to support having multiple versions of the CPE installed without conflicts, this project aims to provide CPE images that are usable from alternate
locations. New versions of the CPE come as ISO files from the vendor, and we seek to repackage them as SquashFS files so they can easily be mounted at new locations. SquashFS is a read-only, compressed filesystem which is performant at scale and can be stored as a single file, making it a good fit for applications in the HPC division. To perform the relocation, ELF executables must be modified to load libraries from the
custom location, and broken symbolic links must be fixed. Then, the modified CPE directory can be saved as a SquashFS file allowing for maintainability and reproducibility of scientific workload
Cluster Management with Containerization on Switches
Dohyun Lee, Anvitha Ramachandran, Robin Simpson
Network switches, such as those from Arista and Mellanox, often have underutilized computational resources in the form of built-in processors and memory. By leveraging these untapped resources, we can optimize functionality and efficiency of computational cluster networks. Our research focuses on deploying containers directly onto these switches to execute various auxiliary tasks ranging from metric logging to system-wide management via post-boot configuration. By doing so, we can significantly enhance the capabilities of the cluster without the need for additional dedicated hardware. Our research involved five distinct scenarios where switch utilization could have a profound impact on HPC Clusters: run cloud-init services via link-local connection; configuring a Telegraf container to export metrics; deploying a caching proxy; creating a reconfigurable IPv6 DHCP/DNS provider for VLAN; and implementing a client detection with Magellan discovery. These scenarios were containerized with podman and docker, and tested both physically on the switch virtually on a QEMU VM both running SONiC OS. Testing and findings indicate that network switches can indeed be used for these scenarios. They offer a wide range of possibilities beyond these applications. They run as expected as containers on the switches, and although there were some minor issues, work-arounds were implemented.
Overall, this is a positive result that can be further explored with more scenarios.
Charliecloud as a Kubernetes Runtime
London Bielicke, Angelica A. Loshak
Kubernetes is a popular system for automating container deployment and management of applications with components that run on various machines with different environments. HPC users require new systems like Kubernetes to support the increasing demand for novel workflows, especially in AI. Many HPC users use Slurm to manage and provision containerized jobs. However, Kubernetes could either replace or supplement Slurm in complex systems. For example, users can schedule containerized jobs, scale containers, and maintain metrics on container health using Kubernetes’ declarative approach. Furthermore, Kubernetes decouples the container runtime from container management by supporting various container runtimes, meaning that users can customize their workloads using container systems that meet their needs. HPC workloads could also benefit from Charliecloud, a lightweight, fully unprivileged container service. However, Kubernetes only supports container services that implement the Container Runtime Interface (CRI), which Charliecloud does not yet support. We developed a prototype server program with essential CRI methods that map to Charliecloud operations, so that Kubernetes can run containers using Charliecloud. Kubernetes can now communicate with Charliecloud over a server to: 1. Create and start containers, 2. Track container health metrics, 3. Run pods, a kubernetes component that creates another layer of isolation around containers, and 4. Respawn containers within a pod. Kubernetes expects certain features that Charliecloud does not use, such as network namespaces. However, Charliecloud can communicate with Kubernetes without certain expected operations. For example, the method PortForward() is a no-op, since Charliecloud containers share the host IP address. Furthermore, Kubernetes runs containers using separate operations: create and start, while Charliecloud uses a single operation. Mapping Charliecloud operations to CRI methods ensures that Charliecloud and Kubernetes make compatible assumptions. By implementing the CRI as a server in Charliecloud with 700 lines of code, modifying less than 50 lines of other Charliecloud source code, and making no changes to Kubernetes, we demonstrate that Kubernetes and Charliecloud are compatible tools. Integrating these technologies facilitates scientific advancements which require large compute power and workflows novel to HPC.
HPCInfo Improvements with Focus on Fairshare
Matthew Vandeberg
HPCInfo is a reporting tool that uses Grafana to visualize valuable cluster, project,and user activity at Los Alamos National Laboratory. This tool allows management to view cluster usage data, PIs to view allocation data for all the members of their projects, and users to track their cluster usage as well as visualize certain characteristics of their jobs. One data point that would help users better understand their cluster utilization and possibly why their jobs aren’t running on the clusters is fairshare value. As a part of my project, I successfully designed and implemented a panel to display fairshare value for a given user on a Grafana dashboard. In addition to the new fairshare panel, I was able to fix several other lingering issues with the HPCInformer codebase. One of these issues was partition states not being set correctly in one of the python scripts that collect data for HPCInfo. This involved adding another dictionary to the python script to accurately track partition states for each cluster. Another fix involved creating a new data point that tracks DST activity to one of the SQL tables used by HPCInfo. Future improvements were looked at as well such as daemonizing the server side of HPCInformer which currently operates through cron jobs as well as fully moving my changes over to production.
Communication Performance Assessment of Sapphire Rapids Architecture
Jackson Wesley
The Sapphire Rapids architecture from Intel is an approach to microarchitecture design using chiplets, discrete and modular chips assembled into a package. This new chiplet design offers powerful compute as well as cost-effective customization options and is the foundation of the Cray Shasta architecture used in modern supercomputer designs such as Crossroads, Tycho, and Rocinante at LANL. Because communication is the dominating cost in parallel applications, it is valuable to understand the on node and off node bandwidth and latency performance of these microarchitectures. In this study we present the bandwidth and latency trends over a range of message sizes for on node chiplets in on-socket and off-socket communication using Fabtests, a Libfabric testing package. Additionally, we measure on node and off node communication and compare to MPI communication measurements using the OSU Benchmarks library. We observe differences in bandwidth performance at certain packet sizes that were unanticipated, occurring at small and very large message sizes. This study provides an assessment of the communication performance of Sapphire Rapids architectures allowing for communication improvements in HPC codes of interest.
Software is for People: A Pavilion Case Study
Hank Wikle
The development of software tools is a critical component of computing, both within and beyond the field of high-performance computing (HPC). These tools not only underpin the enterprise of scientific computing, but also support the maintenance, administration, and testing of computer systems. While software tools are technical artifacts that exist for the achievement of specific technical goals, they retain a secondary role as works of technical communication that convey to contributors their logic and intent. However, this dual role, along with its implications for users and developers of the software, is often overlooked. Recentering the practice of software development on human users and contributors gives rise to a set of basic principles–namely usability, readability, and extensibility–that support the transparent communication of an application’s logic and intent to both the user and the developer. This in turn forms the foundation for an easier understanding of the software and its code base, which can ease adoption of a software tool and enable faster and more efficient development. This presentation explores these principles and presents examples of them using as a model a single feature developed for the Pavilion acceptance testing framework.
Crafting an FTP Hook for Fishing in MarFS
Jordan Hebert
As the need for an integrated and optimized workflow between user file systems and the MarFS file system (i.e Campaign) grows, the infrastructure of how users can read and write data from storage must become more flexable. This requires a daemon that can read and write a MarFS file system as well as other Posix file systems. The goal of this work is to research, configure, and eventually incorporate an FTP-based data transfer tool that would seamlessly read and write the sea of data that flows between various user file systems and MarFS systems, like Campaign.
MUSTANG: An Overview Cluster Care: Reducing Downtime with Automated Node Failure Recovery
Robin E. Preble
Maintaining the functionality of thousands of nodes in large clusters is a labor-intensive task for system administrators. Nodes often fail for common and well documented reasons with established remediation procedures. Manually intervening to fix each of these nodes is time consuming and can necessitate urgent responses, requiring employees to come in during off hours. This presentation gives an overview of Cluster Care, an automated solution to streamline the remediation process for node failures within clusters. This tool collects data on the current state of each node in the cluster and automatically executes configurable remediation procedures based on this information. Information about node states and actions performed is logged for tracking in Splunk. The system's modular design allows for extensive customization, enabling admins to configure monitoring tools, action commands, and mappings from node states to specific remediations, making it easy to adopt the tool for use in a variety
of cluster environments. With Cluster Care’s robust checks and automated remediations, users can expect to see improved availability for production workflows.
MUSTANG: A Powerful Vehicle for MarFS Object Cataloging and Retrieval
Paul D. Karhnak
MarFS is a POSIX-like file system which stores user data at Los Alamos National Laboratory on a medium-term basis. MUSTANG (MarFS Underlying Storage Tree and Namespace Gatherer) is a utility to scan a user-facing MarFS instance and its contents in parallel. MUSTANG outputs a list of unique MarFS object IDs encountered during traversal which may later be used to retrieve specific objects from backup storage. Eventually, MUSTANG is intended for use as an administrative tool in Marchive (MarFS archive) hybrid POSIX-tape file systems to efficiently dictate which tapes need to be accessed to move an object or objects into a user-facing POSIX file system. MUSTANG has demonstrated promising performance at scale for various realistic datasets.
Trusted Platform Provisioning for the OpenCHAMI Cluster Management Stack
Lucas Ritzdorf
Security is an increasingly critical factor in the modern computing landscape. Institutions live and die by their cybersecurity practices (or lack thereof), and high-performance computing is no exception to this. Concerningly, even cutting-edge supercomputers by well-known vendors lack key protective measures, necessitating awkward “security through obscurity” practices to guard against exploits such as escalation of privilege. This project implements a token-based, hardware-backed authentication system for supercomputer nodes, ensuring that only users with appropriate permissions can access protected configuration resources.
LDMS (Lightweight Distributed Metric Service) Deployment on CSM Systems
M. Aiden Phillips
The Lightweight Distributed Metric Service (LDMS) is a High Performance Computing specific piece of software meant to track system metrics at a wide range of scales; from a few nodes to tens-of-thousands of nodes. It is an open-source project created by a team at Sandia National Labs (SNL) and now has contributions from many HPC centers including LANL and from hardware vendors, such as Intel and Cray. At present, the current deployment of LDMS provided by HPE/Cray on Los Alamos National Labs (LANL) clusters lacks both functionality and updates. The current version deployed by Cray is based on a release of LDMS from 2016 (~8 years ago) and due to stability concerns we do not have it configured to run on compute nodes. The inclusion of LDMS on LANL clusters should give greater insight into issues occurring on clusters for both system admins as well as users.
For the project, a newer release of the LDMS Software needed to be stood up on testbeds, tested, packaged, and deployed. These are the tasks we will focus on, and what problems cropped up in the process; as well as what solutions were devised.
Development of a Capacity ON Demand User Interaction Toolkit (CONDUIT) Web Dashboard
Christa Collins
CONDUIT is a data transfer orchestration system for high performance computing that eases the burden on users when transferring data between networked storage systems. Previously, CONDUIT utilized an external job scheduler which caused instability issues and lacked TLS support. In addition, users could only interface with CONDUIT through the CLI or through SLURM job directives (DWS). To resolve the external job scheduler’s issues, we developed and integrated a custom internal job scheduler. To provide an alternative to the CLI and DWS we developed a web dashboard. We utilized web socket communication in the CONDUIT HTTP server to support relaying user job progression to the web dashboard. We achieved this by building a web interface written in TypeScript, using React for our frontend web framework, and writing a backend communication API in Go. This web dashboard extends the CLI and DWS by providing another way for users to interact with CONDUIT.
Effective Database Design for Efficient Workflow Orchestration
Kabir Vats
High Performance Computing (HPC) resources are valuable, and increasing efficiency by dedicating these resources to essential tasks is a vital goal in workflow orchestration. Build and Execution Environment (BEE) is a workflow orchestration and containerization software that uses a Neo4j graph database to track dependencies between tasks in a workflow. BEE currently launches a containerized instance of Neo4j for each workflow, imposing large resource costs on the system it is run on and significantly increasing the setup time for each submitted workflow. This presentation demonstrates a way to combine these graph databases into a singular database that manages all workflows at once and compares the memory usage and runtime for BEE workflows before and after this change.
Are we there yet? Predicting the Queue Wait Times and Job Runtimes for HPC Jobs
Christin Whitton
This project uRlizes historical data from the Grizzly cluster, a medium-sized supercomputer here at LANL, to predict both the job runRme (what the user roughly predicts when submiUng their job with their “wallclock limit”) and to predict the minutes a job will wait in the queue. SLURM data from 2018 and 2022 were used to build models to predict these variables. MulRple models built with different machine-learning algorithms, including Random Forest and Adaboost, were compared. Using the built models, we show that machine learning algorithms offer an improved esRmate over the user-predicted runRme and over a simulated queue wait-Rme.
Leveraging Lustre to Implement Incremental Indexing in GUFI
Migeljan Imeri
When dealing with filesystems on the order of terabytes or petabytes, trivial metadata operations such as looking up a file can suddenly become quite lengthy to perform. These lengthy metadata operations can negatively impact the user experience when they’re attempting to manage their data and lower the performance of the file system as these operations are taking place. The Grand Unified File-Index (GUFI) is a tool that indexes filesystems, allowing greatly improved performance when querying large parallel file systems. Currently to update this index, a full filesystem walk is necessary, which can be inefficient if only a small number of modifications are made in the filesystem. Using the Lustre changelogs feature, which records modifications to the filesystem, we can point GUFI to reindex only specific directories, rather than having to do a full filesystem walk. Leveraging these changelogs should allow for significantly faster index updates. Faster index updates would allow for more up-to-date indexes, as we could perform these reindexes more frequently.
Enhancing Workflow Manager and Resource Manager to Support Elastic Scientific Workflows in HPC Systems
Rajat Bhattarai
As scientific workflows grow increasingly complex, integrating AI tasks with traditional high-performance computing (HPC) simulations, dynamic resource management, and elastic execution become essential for optimizing resource utilization in HPC supercomputers. Furthermore, dynamic resource management is necessary for workflows since the computational requirements may not be known when workflows are submitted to HPC systems for execution, and may change over time. Current resource management systems and workflow managers in HPC systems provide limited support for dynamic resource allocation. Static allocations of resources can lead to an overprovisioning of resources, with the risk of underutilization, or an underprovisioning of resources, resulting in workflows terminating prematurely. In this work, we identify some key enhancements needed in HPC software stacks, specifically in Workflow Managers and Resource Managers, in order to support elastic workflows. We develop and evaluate a prototype for dynamic resource management based on the elastic PMIx-enabled Parsl workflow manager and a customized hierarchical scheduler built on top of Slurm. Through experiments and case studies involving some real applications, we demonstrate that dynamic resource management and elastic workflows can improve system and workflow performance by enhancing resource utilization and reducing workflow execution times.
Echo State Networks: An Approach to Non-Intrusive Anomaly Detection in Manufacturing
Kendric Hood
This poster and presentation investigates the applicability of Echo State Networks (ESN) for developing non-destructive tests (NDT) with non-intrusive load monitoring (NILM) in manufacturing. Specifically, we evaluate its performance with Gas metal arc welding (GMAW) compared to other models. Our findings demonstrate that ESNs can effectively utilize raw data without requiring subject matter experts for preprocessing or feature engineering. For GMAW we show that the power drawn by the welder is sufficient to accurately identifying anomalies with an ESN model. ESNs can learn from only one data point, one example weld. This is due to their use of the pseudo-inverse matrix method for model training. Allowing implementation of NILM in a wide range of manufacturing processes where large amounts of training data is unavailable or impractical to collect. Our comparative analysis reveals that alternative models, such as transformers, demand significantly more data and are unrealistic in scenarios with limited data availability. Through a comparative analysis of different artificial neural networks (ANN), we show that ESNs can out preform common models in terms of accuracy, implementation overhead, and complexity.
Edge-Disjoint Spanning Trees on Star-Product Networks
Daniel Hwang
Network topologies, or the design of networks, can be represented using the mathematical structure of graphs. In order to construct larger networks from two smaller networks, we can take their graph product. More specifically, we can take their star product, which is a generalization of the Cartesian product, to form star-product networks. In this presentation, we give a brief description of star-product networks and their significance when it comes to network design. Moreover, we also introduce the concept of edge-disjoint spanning trees on an arbitrary graph and our project’s goal of maximizing the number of edge-disjoint spanning trees on star-product networks. We will also present a poster describing our constructions of edge-disjoint spanning trees in more detail.
Benchmarking Effects of Erasure Scheme and MPI Configuration on MarFS Throughput
Janya Budaraju, Paul Karhnak, Zachary Snyder
MarFS is an open-source, medium-term campaign storage platform used in Los Alamos National Laboratory (LANL) supercomputing clusters. MarFS performs parallel file operations to read to and write from a mount point, employing multi-layer erasure coding through an Intel Intelligent Storage Acceleration Library (ISA-L) Reed-Solomon error correction implementation to create redundancy and resiliency. Currently, optimizing MarFS erasure scheme and Message Passing Interface (MPI) parameters for a specific cluster requires close hardware familiarity; we created a software suite to benchmark MarFS throughput across multiple configuration parameters, abstracting away hardware-level considerations. In addition, we identified performance patterns and validated assumptions about ISA-L erasure coding in HPC workloads. Our results offer detailed insight into ISA-L erasure coding performance on our cluster and demonstrate our tools’ viability. These tools we developed provide a key starting point for optimizing MarFS performance, accommodating growing storage system performance needs at LANL.
Charliecloud’s Successful Prototype Integration with Slurm: A Promising Approach with Some Strings
Layton McCafferty, Nicholas Volpe, Hank Wikle
Containerization is becoming increasingly important in HPC. Container technologies leverage Linux kernel isolation mechanisms to promote software package flexibility, application portability, and customization of user software stacks. Charliecloud brings these benefits to HPC, providing a lightweight, fully unprivileged container runtime. Slurm, another key tool in HPC, is a workload management tool responsible for scheduling the allocation of resources and jobs across multiple interconnected nodes. In 2021, Slurm 21.08 added support for container workflows via the --container flag, which provides an interface for interacting with standardized container bundles via any container runtime compliant with the Open Container Initiative (OCI) standard. Since ease of use can make or break adoption of software tools, it is important that the Charliecloud runtime integrates smoothly with Slurm. Issues related to ease of use can pose barriers to adoption of Charliecloud, which in turn prevents users from reaping the benefits that it provides. In collaboration with developers at SchedMD, our team successfully prototyped an approach to integrating Charliecloud with Slurm’s --container flag and underlying features. While this new approach is a viable solution to the original problem, the somewhat convoluted and intricate configuration process places a potential burden on both system administrators and users. Specifically, this new approach imposes limitations in two key ways: (1) the approach may require an upgrade to Slurm 23.02, a process which at present is error-prone and may not always be possible; and (2) the approach requires options and arguments passed to Charliecloud’s runtime to be hard-coded in a configuration file. We believe that this prototypical approach shows promise and that future work may be able to eliminate these limitations.
Monitoring Clusters Using Extended Berkeley Packet Filter (eBPF)
Weston Cadena, Alexis Ng, M. Aiden Phillips
Modern High Performance Computing (HPC) systems vary immensely in size, software, and purpose; creating vast differences in production workflows and performance. Troubleshooting system calls, both in kernel and user space, within HPC has traditionally not been straightforward. With metrics that were previously impossible to view, eBPF delivers highly detailed and objective information clarifying system performance [3]. eBPF tools are used for performance analysis and observability at a low cost and overhead by attaching probes to kernel and user space system calls. These probes provide non-invasive access into kernel and user routines without disrupting processes. eBPF Compiler Collection (BCC) [4] is a set of tools derived from eBPF that provide insight into the kernel and user stack and software applications using tracing, sampling, and snooping. These tools could be used for improved system monitoring, application profiling, and general troubleshooting. This provides users avenues to improve simulation efficiency and give administrators more in-depth answers about HPC systems. This project is aimed to determine the viability of BCC tools in an HPC environment. With BCC tools, we were able to characterize the following:
- NFS Latency and Bandwidth
- CPU stacks and nature of running processes
- Intercommunication of parallel processes over TCP
- Cache hits and misses over time
Evaluating Lustre Network Performance over InfiniBand and RoCE
Matthew Vandeberg, David Medin, Benjamin Schlueter
With the increasing performance of Ethernet, the possibility of replacing an InfiniBand network with a RoCE-based Ethernet network has become more feasible. Currently, high-performance computing relies on highly parallel file network access to maximize computational performance. To accomplish this, a Lustre file system is often used alongside an InfiniBand network which provides the network speed required by many HPC applications. This project aims to evaluate the implementation difficulties and performance differences of replacing a traditional InfiniBand- based Lustre network with Ethernet that takes advantage of RDMA using RoCE.
Benefits of Time Series Data Tables for HPCInfo
Kenton Romero
Seeing the trees for the forest: Describing HPC Filesystems with the Grand Unified File-Index (GUFI)
Jenna Kline
High performance computing (HPC) filesystems are extremely large, complex, and difficult to manage with existing tools. It is challenging for HPC administrators to describe the current structure of their filesystems, predict how they will change over time, and the requirements for future filesystems as they continue to evolve. Previous studies of filesystem characteristics largely predate the modern HPC filesystems of the last decade. The Grand Unified File Index (GUFI) was used to collect the data used to compute the characteristics of six HPC filesystems indexes from Los Alamos National Laboratory (LANL) representing 2.8 PB of data, containing 36 million directories and 600 million files. We will present a methodology that uses GUFI to characterize the shape of HPC filesystems to help system administrators to understand their key characteristics.
Development of a Capacity ON Demand User Interaction Toolkit (CONDUIT) Job Launch Mechanism
Christa Collins
Post-Exascale Star Product Networks and Allreduce Spanning Trees
Aleyah Dawkins
Networks are based on mathematical graphs. One important family of networks is based on the star product graph. This is a generalization of network topologies based on Cartesian products (such as HyperX), and includes networks such as SlimFly and PolarFly that target post-exascale systems. For these star-product networks to be useful, they must support collectives such as Allreduce (including broadcast and reduction) and others. These problems map immediately to the problem of finding a large number of more or less edge-disjoint spanning trees in the network graph. In this talk, we look at results that construct a maximal number of truly edge-disjoint trees on Cartesian networks. We attempt to generalize these results to the class of star-product networks. Success here would mean a general method of constructing spanning trees enabling efficient Allreduce that would apply to all networks in this emerging post-exascale family of networks.
Creating, Debugging, and Optimizing Roles in Shasta Keycloak for Improved Administrator Privilege Separation
Airam Flores
A year in the life of a charliecloud developer
Lucas Caudill
Improving MPI with Rust
Jacob Tronge
MPI continues to play an important role in HPC applications. At the same time, existing implementations fail to provide guarantees of safety and correctness, instead leaving many important error conditions to be checked by the user. Many of these errors, such as mismatched types or mismatched collective arguments, can lead to memory corruption, segmentation faults, and undefined behavior, which typically comes from a lack of memory safety guarantees. From a different perspective, the MPI implementations themselves often encounter similar problems but from within the library itself. Implementations are increasingly required to adapt to new hardware and programming environments, requiring extensive development and testing that can leave room for memory safety related errors. One way to solve these problems is to work with newer languages that are designed to guarantee memory safety: the Rust programming language is one such language that attempts to guarantee memory safety while maintaining performance close to that of C. In this presentation I’ll give a brief overview of two different prototypes written in Rust that attempt to solve these problems and then show results that indicate that performance is close enough to the original versions to merit further consideration of Rust for HPC applications.
Detecting Spatter in Laser Powder Bed Fusion with Computer Vision
Sean Tronsen
Additive manufacturing (AM) allows for the fabrication of components with designs with designs previously impossible to build in one step using traditional methods such as casting or injection molding. The key difference being that parts created using an AM process are built layer by layer, enabling the use of internal support structures and complex geometries. Metal and ceramic parts can be synthesized using a process known as laser powder bed fusion (LPBF) which involves selectively fusing material on a bed of powdered substrate using both a laser and a scanning mirror. The scanning mirror aims the energy deposited by the beam to fuse powdered material and create each layer of the build. Laser powder bed fusion fabrication is not without issues and our work aims to help improve the process by providing additional methods to help detect failure early on. We accomplish this by leveraging techniques in machine learning and computer vision to locate anomalies present in images that capture the state of the powder bed. Our goal additionally involves creating methods to analyze how material properties and machine configurations affect the rate at which anomalies are produced during fabrication.
Cray EX40 (Chicoma) Cluster Intrusion Detection Project
Daniel Wild
Security changes or configurations can often reduce the performance of a supercomputer or cluster (Shah, 2023). Analysis of a cluster’s external network traffic provides an opportunity to identify potential malicious traffic, cluster misuse, or configuration problems without causing a negative performance impact. Using a mirror port, this project captured the external network traffic to and from the Cray EX40 (Chicoma) cluster for three months and analyzed it using two open-source intrusion detection tools, Suricata (Suricata, n.d.) and Zeek (Zeek, 2020). These intrusion detection tools were compiled and installed from source. Ansible roles and installation scripts were developed to automate future deployment and maintenance on production systems. The tools were tuned for high performance computing requirements using eBPF filters integrated in the build to bypass elephant flows and reduce packet loss. This project successfully identified security concerns such as excessive (approximately 1610) Secure Socket Shell connection attempts over a short (approximately 12 hour) time interval from a single source as well as four invalid certificates.
This project also identified several cluster configuration issues including anomalous switch and node Domain Name Service (DNS) queries, outbound Hypertext Transfer Protocol traffic with Automatic Private Internet Protocol Addressing and Transmission Control Protocol errors across the network. Anomalous node DNS queries were so prevalent that they encompassed approximately 97% of all DNS traffic within the network. Intrusion detection tools monitoring external cluster network traffic can provide security while enabling insights into configuration issues that can potentially increase cluster performance and improve the user experience.
Pavilion test searching
Frank Keithley
Elastic Workflows with PMIx
Rajat Bhattarai
Scientific workflow applications are growing in complexity. Elastic workflow applications that can change the number of processors while being executed promise improved application and system performance. The current High-Performance Computing (HPC) software infrastructure that includes resource managers(RM), workflow managers(WMS) and application runtime does not support malleable applications and workflows. In this work, we investigate the challenges and requirements for an elastic workflow identifying shortcomings in middlewares, RMs, or WMS themselves that impact the ability to support malleable applications. We also present our early experience with using PMIx as an advanced middleware to support one of the popular workflow management systems, Parsl to enable fine grained dynamic resource management. Our evaluation indicates that fine grained elastic resource management results in an improved system and application performance in system utilization and application turnaround time.
Writing UMT Pavilion configs for Crossroads Acceptance Testing
Shivam Mehta
Before a new high performance computing system is installed, the vendor provides a report for the new system. This report contains the specifications of the new system such as flops, performance, power consumption, cache storage, bandwidth etc. These specifications do not include performance metrics of real-world applications. The testing applications the vendors use are not a good measure of performance metrics as their applications are optimized for the new specific architecture. Usage of real-world applications allows us to compare the performance metrics of the new system with the current system. Ensuring that these applications perform efficiently and satisfies the requirements of the specifications is called acceptance testing.
Several simulation and benchmark applications are developed for the upcoming crossroads acceptance testing that represent workloads of all the national labs. One of these simulations is the Unstructured Mesh Transport (UMT) application. UMT is an Advanced Simulation and Computing (ASC) proxy application developed by Lawrence Livermore National Laboratory that solves a thermal radiative transport equation using discrete ordinates. To perform this simulation on Crossroads, we need to build UMT using Pavilion. Pavilion is a framework to run and analyze tests targeted for HPC systems. Once the UMT is built using Pavilion, we can modify the number of nodes and tasks per node to run the simulation on; this will provide us with metrics that we can compare to our current system as well as defined testing specifications.
Exploring Rust in High-Performance Computing for Mitigating Errors and Improving Security
Jake Tronge
Most existing HPC applications and libraries are written in C, C++ or Fortran. Research into programming errors and security vulnerabilities have revealed that many of the errors that appear in programs written in these languages are related to memory errors, such as buffer-overflows, double-frees and data corruption. These can be traced back to the fact that these languages perform little to no checks which ensure that memory is being properly used within code. Within the past decade there has been a lot of research into using ”memory-safe” languages; these perform checks either at compile-time or at runtime that are designed to ensure that memory is being used correctly. This eliminates many, if not all, of these types of memory-related issues that plague most low-level and performance-driven software. In HPC, since testing and debugging is made many times more complicated by the need for parallelism and the use of specialized hardware, memory-safe languages may offer some serious improvements to development and security issues. Rust is one promising memory-safe language that was originally developed by the Mozilla Foundation for improving memory-safety in the Firefox web browser, but has since spread to many other projects and organizations. Our work here is looking into how Rust can be used in HPC, comparing it with existing codes through benchmarking, as well as finding places where Rust might be able to offer improvements one component at a time. Initially we’ve taken a look at improving existing MPI-bindings to Rust and noting where Rust diverges from existing languages. In future work we plan to research other ways where Rust might improve HPC software and development.
Porting the Energy Exascale Earth System Model to the Chicoma LANL HPC Platform
Timothy Goetsch and Franklin Keithley
The Energy Exascale Earth System Model (E3SM) is an Earth system modeling, simulation, and prediction suite used to meet the scientific needs of the nation and mission needs of the Department of Energy (DOE). Through funding by the DOE, the Institutional Computing (IC) Program provides laboratory resources, such as compute time on the High Performance Computing (HPC) supercomputers at Los Alamos National Laboratory (LANL), to scientists and engineers through a peer reviewed proposal process. Chicoma will soon become IC's only HPC resource for funded projects, therefore, it's imperative to port the E3SM suite in support of climate scientists' research at LANL. Along with scientific software support, the Programming & Runtime Environments Team (PRETeam) of HPC-ENV strives to support the installation and management of E3SM on LANL HPC resources. We have ported E3SM’s build system to support Chicoma’s AMD Rome EPYC processors against one Cray Programming Environment (CPE) by leveraging the Intel OneAPI compiler suite and Math Kernel Libraries (MKL) and HPE Cray’s HDF5, PNetCDF, NetCDF, and MPICH installations. In the future, porting to the GPU partition and adding support for more CPEs will improve the usability, versatility, and performance of E3SM, and this will enable the scientific endeavors underway to further the nation’s predictive capabilities of the Earth’s climate and environmental systems in order to deliver future sustainable energy solutions.
Tropical Neural Networks
Jose Ortiz
Relating Epigenetic Information to the Structure of DNA Using Deep Learning
Vanessa Job
In order for genes to be expressed, their encodings in DNA must be accessible. DNA is packaged around protein spools, forming a structure called chromatin, where DNA is wound or unwound from these spools depending on which regions are being accessed, (i.e., which genes are being expressed). Modifications called epigenetic tags attach to the spools; these tags influence which regions of DNA are exposed. The structure of chromatin is dynamic: different pieces of DNA may be exposed depending on the cell’s environment. Some changes in structure are normal, for instance exposure of clock genes at different stages of the circadian cycle, while other changes in structure indicate disease such as cancer. The goal of our research is to build a machine learning system to relate the structure of chromatin to epigenetic tags. To do this, we relate data from structural assays (which determine the current configuration of chromatin) to epigenetic assays (which indicate location of epigenetic tags on the genome). Since assays require costly materials and the time of an expert to conduct, using a machine learning system to predict the result of one assay from another has potential to accelerate the pace of biological discovery. Additionally, such a system will enable virtual experiments which will elucidate the relationship between epigenetic tags and the structure of chromatin.
Evaluating TCP Protocol Performance on High-Speed Networks
Noah Jones, Jerrod Parten and Lucas Ritzdorf
High performance computing (HPC) clusters rely on specialized low-latency, high-bandwidth communication networks to enable fast data transfer between compute nodes, with minimal processor overhead. This is implemented using a co-processor in the network card with direct access to system memory, which allows for modification of RAM contents without involving the CPU. These cards are connected via high-speed cabling and switches to form a cohesive network fabric with an architecture fundamentally incompatible with that of traditional Ethernet. Software interoperability layers exist, though they reintroduce the processor overhead that these networks were designed to avoid, increasing latency and reducing bandwidth. Since many HPC services, particularly high-performance filesystems and data transfer mechanisms, communicate via internet protocol (IP) networks, this reduced speed can be detrimental to system performance and utility. Here, we use a cluster of ten compute nodes and a single master node, connected by a Mellanox InfiniBand fabric, to evaluate the base performance of IP over InfiniBand (IPoIB) connections and fine-tune system parameters to maximize IPoIB link throughput. We also evaluate the effects of IPv6 addressing, and of relaying data to an Ethernet network through an intermediate input/output node. After system tuning, we consistently match vendor performance estimates (achieving roughly 3.2 times the out-of-box bandwidth), and our results suggest that IPoIB bandwidth on a capable host system can begin to approach that of a raw InfiniBand connection. The performance gains demonstrated here may enable future HPC systems to more efficiently communicate with IP networks using existing high-speed fabrics, reducing costs and maintenance requirements associated with dedicated Ethernet hardware.
Charliecloud's Git-based Cache is Competitive with Alternatives
Z. Noah Hounshel, Ashlynn Lee and Ben Stormer
The essential software stack of an application is often bundled together in a "container" for ease of portability and use. Systems for building containers frequently use a build cache to decrease build time. Two industry standard container build systems are Podman and Docker. Standard Docker requires its users to have administrator privileges in order to build and run containers, and while Podman contains a "rootless" mode, it is not fully unprivileged as it uses setuid executables. Charliecloud is a LANL-developed container build system designed for unprivileged use. Docker and Podman use a build cache based on the OverlayFS file system, whereas Charliecloud uses a Git-based cache. Charliecloud's Git-based cache is a recent development and its performance is largely untested. We measured the build speed of container images for Charliecloud, Docker, and Podman to determine the practicality of Charliecloud's Git-based build cache.We built six container images of differing size and complexity using Charliecloud, Docker, and Podman.
We built the images with varying levels of the build cache being "filled" in the respective systems, including with Charliecloud 's cache disabled. Hyperfine, a command line benchmarking tool, was used to run each test repeatedly and gather time data from each run.
We found that Charliecloud's cache generally resulted in slower build times than Docker and Podman's but was not meaningfully slower than Podman's. We also found that Podman resulted in the most inconsistent build times of the three technologies. On our compute nodes, we found that Charliecloud was roughly 2 times slower than both Docker and Podman for cold cache tests, with the average Charliecloud build time being 290 seconds ( averaged across all tests). For hot cache tests, Charliecloud was roughly 2 times slower than Podman, but 10 times slower than Docker with the average Charliecloud build time being roughly 5 seconds (averaged across all tests).
Our results suggest that Charliecloud's Git-based cache system is a viable form of container build caching. Charliecloud's cache has room for several optimizations, but was not so much slower than Podman 's caching system to be problematic in most workflows. Potential optimizations involve fixing a bug that fails to cache "COPY" commands and checking Dockerfile hashes to reduce Git tree traversal time.
Performance Analysis of Non-Volatile Memory Express Over Fabrics (NVMeoF) Using Infiniband and Ethernet
Christa Collins, Joseph Sarrao and Zach Wadhams
With the increase in complexity of scientific computer codes, the way in which data is transferred and stored on the high-performance computers (HPC) needs to become more dynamic. Non-volatile memory express over fabrics (NVMeoF) allows for this flexibility. While past storage structures have included static storage over the network for each server node, NVMeoF allows access to all storage media over various network types such as Infiniband, RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE) and TCP. Thus, providing us with the capability of dynamically allocating storage pools that are specially designed for running jobs on a subset of worker nodes. This project’s primary objective is to analyze the performance of NVMeoF over various high-speed networks such as Infiniband, RoCE and TCP. We analyzed the data throughput and input/output operations per second (IOPS) of NVMeoF with each network type using the IO benchmarking tool FIO. We compared these results with the same test to local NVMe storage as a baseline
Streamlining Machine Learning for Molecular Dynamics by Interfacing Python with LAMMP3
Steven Anaya
Charlieclouds Git-based Cache is Competitive with Alternatives
Z. Noah Hounshel, Ashlynn Lee and Ben Stormer
Evaluating TCP Protocol Performance on High-Speed Networks
Noah Jones, Jerrod Parten and Lucas Ritzdorf
HPC Network Security Analytics Using Virtual Appliances
Victoria Sasaoka
Automating and Customizing the Node Health Check Tool for Support System
Aedan Wells
Implementing Kexec with Ironic to Reduce HPC System Downtime
Kam Killfirst
Reboot time in High Performance Computing systems continue to grow as the amount of resources being managed increases with technological improvements. Using Kexec to fast boot into a new kernel and ramdisk reduces the time that these systems are off. Traditional rebooting processes take time as the system performs the functions of the power-on self-test before booting into the operating system. This presentation will explain how using Kexec with Ironic can reduce the downtime of High Performance Computing systems.
Number Representations and their Applications to Hardware Devices
Andrew Alexander and Matthew Broussard
As Moore's Law falters, scientists are seeking out novel devices to optimize the performance, energy usage, and space of computation. However, the physics behind these devices may not be intrinsically suited to the same binary data representations as traditional transistor-based hardware. We investigate various number representations and map them to developing technologies in order to optimize future projects, illustrating that it may be critical to look beyond the standard 2’s complement binaryinfrastructure when pursuing new computer architectures
Machine Architecture Impact on Application Performance
Nicklaus Przybylski
Relating Epigenetic Information to the Structure of DNA Using Deep Learning
Vanessa Job
Perils of the One-Size-Fits-All Kernel: A Fast, Secure Search for FileSystem Metadata
Prajwal Challa
Robust Architectures for Arithmetic Circuits via Quantum Sampling
Vanessa Job and Nathan Kodama
Can it scale? : Metadata Performance Testing of Lustre Dynamic Namespace
Megan Booher & Seema Kulkarni
Lustre is a parallel file system used in high performance computing for its massive scalable storage. Metadata Servers (MDS) are used to store metadata in the dynamic name space (DNE). Utilizing DNE, multiple MDSs can be used to scale performance which allows for greater computational ability. The previous version, DNE v1, requires the users manual instruction on distributing the namespace across Metadata Targets (MDT). This is difficult for users not well versed in systems and requires human intervention. The new version, DNE v2, automates this distribution task. In previous versions of Lustre DNE v2 performed significantly slower than DNE v1. The purpose of this research is to benchmark the performance differences between DNE v1 and v2 in the latest version of Lustre, and based on our findings, suggest which technology should be adopted for HPC operations. Our test environment consisted of 5 MDSs, 2 OSSs and 6 client nodes. Mdtest was used to evaluate the scaling of metadata performance. The tests show that DNE v2 does not scale linearly when more MDTs are used, unlike v1 where performance speed increases when using more MDTs. It is suggested that LANL continues to use DNE v1 due to its strength in positive scaling that will aid in efficient high performance computing in divisions across the lab.
Exploring the Trusted Platform Module to Establish Mutual Trust in High Performance Computing
Devon Bautista & Rebecca Whitten
When using computers to process sensitive data, one needs to be able to trust that the senders and receivers of that data are authentic. One way to provide that trust is with cryptographic proof that a sender and receiver are who they claim to be. Mutual Transport Layer Security (mTLS) is a protocol that builds off of TLS to provide mutual cryptographic proof-of-identity between a sender and receiver. One major difficulty in cryptographic systems, in general, is secure key storage. This is a particular challenge in HPC, where nodes are typically stateless, meaning they don’t have any persistent storage. The Trusted Platform Module, or TPM, is a secure, independent cryptoprocessor that provides many cryptographic functions, including secure key storage, in its own separate, non-volatile storage. This can be a good solution even for nodes with persistent storage too. Keys and certificates used for mutual authentication, like in mTLS, can be stored in this tamper-resistant piece of hardware without the need for secondary storage. This project explores how the TPM can be used to securely store keys and perform cryptographic operations to establish mutual trust between nodes using mTLS. We explore how to interact with the TPM via various software stacks and evaluate useful applications of the TPM, including how to implement mTLS using the TPM to store secrets and enforce mutual authentication.
Managing Configuration Secrets from Ansible using Hashicorp Vault
Susan Foster & Raafiul Hossain Within
High Performance Computing (HPC) At Los Alamos National Laboratory, the management of secrets is of utmost importance. Currently, HPC uses Ansible Vault for storage and management of secrets, but there are deficiencies with Ansible Vault's design that are not suitable for HPC’s evolving needs. The best solution to this issue is replacing Ansible Vault with Hashicorp Vault to increase security and storage ability of configuration management and secrets. Due to the constraints of the physical hardware, it was important to be able to containerize any solution, thus the deployment of Vault was integrated with Docker. We were successfully able to recreate Ansible Vault’s capabilities, while adding additional layers of security and flexibility provided by Hashicorp Vault.
Characterizing the impact of compiler and MPI version differences in Containers with Spack
David Bernado & Martha Dix
Containers are becoming an increasingly common solution to help meet the need for software flexibility on HPC systems. They allow for user-defined software stacks where users can install and manage their own software configurations on HPC resources. While working with containers on HPC machines, LANL scientists have noted that different compiler and MPI versions inside the container may affect the containerized application’s results and/or performance. Previous studies have examined performance differences between container implementations and bare metal, but they have not explored how specific compiler and MPI implementation version combinations affect software and performance. We addressed this gap by analyzing the results of the CTS-2 benchmark scientific mini-apps, LAGHOS and HPCG, inside containers built with a matrix of different GCC and OpenMPI versions using the Spack package manager. The acceptance tests passed for every container combination but not without observable differences. Our results for HPCG indicated that the GCC and OpenMPI version combinations had little effect but there was a slight decrease in performance overall. Our results for LAGHOS indicated that the GCC version slightly affected performance while the OpenMPI version did not. Our results suggest specific compiler and MPI versions may lead to slightly improved performance. Future work includes: 1) experimenting with different compiler and MPI implementations, e.g., intel, intel-mpi, etc., 2) using other figure of merit acceptance tests for next-generation hardware in containers, e.g., SNAP, and Quicksilver, 3) experimenting with different compiler flags to help optimize the compiler, e.g., binutils, nvptx, etc.
Analyzing Server-side Scalability of Image Filesystems & Attachment Technologies
Timothy Bargo, Aedan Wells, and Michelle Yoon
High Performance Computing often deals in managing hundreds to thousands of compute nodes to solve large, complex problems. As we push the boundaries of compute, we continue to optimize the performance of all components of the cluster. A common method of compute cluster deployment is to utilize a master server to provide operating system images to the compute nodes. On large compute clusters this deployment method can lead to large workloads on the master server. In our work we compared different image filesystem types and attachment technologies to determine the most performant and scalable method to deploy a compute cluster. Our research demonstrated that the SquashFS filesystem and the Ceph RBD image attachment technology produce lower CPU and network loads in comparison to other combinations of SquashFS, Ext4, ISOFS, and XFS filesystems with Ceph RBD, NFS, NFS loopback, and iSCSI attachment technologies. Our results provide guidance in selecting the most scalable combination of technologies to deploy compute clusters.
Offloading Calculations to Computational Storage Devices: Spark and HDFS
Cunningham, Goldstein, Hammock, Janz, Liu and Rimerman
As the amount of data in the world grows, researchers have sought solutions to process data more quickly. One potential solution that has been explored in a High Performance Computing (HPC) context is using Computational Storage Devices (CSDs) to process data closer to where it is stored. In order to evaluate the functionality of these devices in a HPC environment, our team used Apache Spark and Hadoop Filesystem in order to offload computations and benchmark the drives. We wrote tests that perform arithmetic and matrix operations on Trinity sensor data. We ran our Spark experiments while scaling up the number of CSDs and cores used for each benchmark, and compared the amount of time elapsed for each computation. Our results show that the CSDs can be effectively used to offload tasks from the host machine because their compute power scales well and they can effectively work on workloads of various file sizes. However, we also determined that these devices have hardware reliability and workflow issues that make them unready for a production HPC environment.
Using Computational Storage Devices: OpenMP/MPI and Charliecloud
Cunningham, Goldstein, Hammock, Janz, Liu and Rimerman
As the amount of data in the world grows, researchers have sought solutions to process data more quickly. One potential solution that has been explored in a High Performance Computing (HPC) context is using Computational Storage Devices (CSDs) to process data closer to where it is stored. In our previous experiments, the performance of the six Computational Storage Devices (CSDs) using Apache Spark and Hadoop showed limited applications and complex overhead. To address this overhead, we turned to other computational tools such as Python, C++, and MPI. In addition, we benchmarked less computationally intensive tasks like building container images with Charliecloud. Our results show that Spark contributes significantly to the speed of computation on CSDs. We find the CSDs are ineffective for offloading our operations while our host was under stress. Finally, using CSDs for offloading Charliecloud image building is a promising potential solution for HPC users that we recommend is further researched.
SquashFS & FUSE for Better HPC Containers
Megan Phinney
Charliecloud is a light-weight container implementation for high performance computing. The typical filesystem image formats for Charliecloud are SquashFS and tar archives. SquashFS is a compressed, read-only filesystem that unprivileged users can mount in user space with SquashFUSE; it is the preferred image format due to its various efficiencies. The current SquashFS workflow is non-ideal due to user complexity and difficulties with HPC job schedulers. We have designed a new workflow that requires us to link SquashFUSE to Charliecloud to enable the new mount/unmount procedure of only needing a single user command. Also, an additional persistent process is needed to service the FUSE requests called the FUSE loop: once the containerized application process finishes, it unmounts the SquashFS and ends the FUSE loop. Last summer, we created a working prototype with a modified version of the SquashFUSE code. This summer I am converting our prototype for production use in Charliecloud using SquashFUSE’s new shared library. Our new SquashFS workflow is more user friendly, cleans up after itself and is more compatible with HPC job schedulers. We were able to reduce user commands from 3 to 1, increase reliability and decrease mount/unmount time by more than 50%.
Integration of the PENNANT mini-app into the Pavilion Test Harness
Timothy Goetsch
The Los Alamos National Laboratory (LANL) High Performance Computing (HPC) support teams test end-user applications on production Department of Energy (DOE) National Nuclear Security Administration (NNSA) supercomputers. LANL implements the FLAG physics application to carry out radiation-hydrodynamics simulations for research in fields such as crater impacting. PENNANT is an unstructured mesh physics mini-app, designed for advanced architecture research, that incorporates mesh data structures and implements physics algorithms adapted from FLAG intended to simulate some of its memory access patterns. Like various other mini-apps, PENNANT serves as a proxy application to FLAG for tuning and optimization efforts. Due to PENNANT's small size, respective to FLAG, it’s lightweight and practical nature also make it valuable for evaluating new hardware and programming models for unstructured mesh physics applications. Even so, manual testing of PENNANT’s supported build configurations is time consuming. LANL’s Pavilion Test Harness addresses this issue by enabling the creation of portable, abstract test definitions. This project focuses on building a Pavilion test to verify that PENNANT builds and runs on bare-metal LANL production systems while analyzing its performance and portability. Harnessing PENNANT under Pavilion supports continuous development and integration for developers and captures performance profiles for support teams to use in continuous application monitoring. Future work will involve comparing bare-metal performance with containerized performance using Charliecloud, LANL’s in-house container build and runtime environment.
MarFS and libNE Utility Development
Daniel Perry
As the capabilities of high performance computing (HPC) supercomputers continue to grow, so do the sizes of the datasets being computed on these machines. Object storage systems have proven themselves capable of scaling to adequately accommodate storing large amounts of data while also enabling high-speed accesses. However, many users and applications still expect a POSIX filesystem interface, as opposed to the Representational State Transfer (REST) semantics utilized by most object stores. MarFS is an open-source software developed at Los Alamos National Laboratory (LANL) which implements a near-POSIX interface over a scalable multi-component object store that offers few compromises. The laboratory’s current production system provides 60PB of storage with access speeds approaching 25 GB/sec. LANL is currently in the middle of a complete rewrite of the MarFS code base to provide stability, functionality, and performance improvements to the filesystem. An essential component of this new implementation is libNE, a library that handles the multi-component functionality of MarFS through parallel erasure coding to provide both highperformance transfers and failure tolerance. Part of libNE is its data abstraction layer (DAL), which allows the use of “hot-swappable” underlying storage systems. In addition to other development work within libNE, we implemented several DALs within the library, such as an AWS S3 and recursive DAL, along with others for testing/benchmarking purposes to complement the default DAL which interfaces with POSIX-based filesystems. MarFS is intended to be interfaced with in one of two ways: either users access it interactively through a filesystem in userspace (FUSE) mount, or data is transferred through batch jobs handled through pftool, a parallel file transfer tool also developed by LANL. We also created interfaces which allow both of these utilities to interact with the filesystem. These interfaces are designed to ease future development and integration efforts for long-term support and maintenance.
Network Monitoring and Analytics with SFlow
Conner Whitfield
Monitoring is a vital part of any production environment, as metrics provide insight into errors, failures, high utilization, and more critical data that is essential for ensuring system stability. sFlow is one of these metric tools that demonstrates production value for various monitoring capabilities for the Los Alamos National Laboratory (LANL) high-performance computing (HPC) production network environment. sFlow provides a wide variety of monitoring metrics when deployed on switch stacks, such as Arista and Cumulus, including CPU utilization, memory utilization, various interface I/O metrics, and more. This variety of metrics, combined with sFlow’s Prometheus support and capability for custom metrics, shows that sFlow meets the production monitoring needs, integrates with existing Splunk monitoring infrastructure, and can replace custom monitoring scripts used to gather this same information. This presentation will describe the processes and challenges of implementing a production network monitoring solution that utilizes sFlow, Telegraf, sFlow-RT, and Prometheus.
Dust Destruction in Core Collapse Supernovae
Sarah Stangl
Performance Analysis of Common Loop Optimizations
Brian Gravelle
In High Performance Computing, developers tune applications, especially computationally intensive kernels, for specific systems. In this presentation, we combine two methods for conducting performance analysis: Roofline visualization and hardware counter analysis. The Rooflines allow the user to understand the performance of the application relative to the hardware’s potential while the hardware counters enable a deep understanding of how a computational kernel makes use of the CPU. We discuss the background of these methods and demonstrate their use to gain insight into a matrix multiplication benchmark running on an A64FX CPU from Fujitsu.
Machine learning for physics simulation anomaly detection
Adam Good
Multi-physics hydrodynamic direct numerical simulations (DNS) are often computationally intensive, requiring significant computational resources to complete. For simulations requiring thousands of processors, the proba- bility of anomalies occurring during a simulation is not insignificant. Since these simulations often run for a long time without human validation, such undetected anomalies can be costly. We present results of our application of ML based techniques for anomaly detection to hydrodynamics simulations. By treating the intermediate output of hydrodynamic simulations as images or videos, we borrow ML techniques from computer vision for the task of anomaly detection. We generated a training dataset using CLAMR, a cell-based adaptive mesh refinement application which implements the shallow water equations. Modifications were done to the application to obtain a wider range of experiments for our dataset. We generated a range of experiments who’s states can be learned using computer vision techniques. Additionally, those same experiments could be run with anomalies injected at runtime so our models could be trained to differentiate between nominal and anomalous simulation states. We also present ML models using PetaVision, a neuromorphic computing simulation toolkit, as well as other autoencoders, and demonstrate that they can predict the state of a simulation at a succeeding time step based on the states of a number of preceding time steps. Additionally, we use these autoencoders with a classifier to determine if a given simulation state is anomalous. Our experiments show that out models can predict simulation state accurately enough for the classifier to detect anomalies despite notable differences between predictions.
Exploring OpenSNAPI Use Cases and Evolving Requirements
Brody Williams
Tropical Matrix Factorization
Jose Ortiz
Tropical Geometry is a fairly new branch of mathematics where the usual addition and multiplication operations are replaced by the minimum/maximum and the usual addition, respectively. Tropical arithmetic is faster and more robust than classical arithmetic, and results in a piece-wise linear geometry which is seen to be an approximation of classical geometry. As a result, the classical notion of convexity becomes a piecewiselinear representation which is generally distinct. This project is an exploration of applications of tropical geometry, with a focus on tropical matrix factorization and associated algorithms. We establish a link between the tropical matrix factorization problem and the problem of finding generators for a tropical convex hull and propose a method which finds a factorization in O(nm2) time for a special case, improving on previous methods. We discuss possible applications to inexact computing.
Enhancing the MPI Sessions Prototype for Use on Exa-Scale Systems
Tom Herschberg
One of the new features to be included in the upcoming Message Passing Interface (MPI) 4.0 specification is MPI Sessions. A major goal of Sessions is to provide a more flexible way for applications to allocate and use MPI resources, and thereby potentially expand the application space which can make use of the high performance messaging capabilities provided by MPI implementations. Therefore, it is important to have a working MPI Sessions prototype that can be used to study the performance and behavior of the new MPI Sessions functionalities. Such a prototype has already been developed and implemented in Open-MPI, but it is currently only functional over a limited number of network stacks. In particular, the prototype cannot currently make use of the network stack (OFI libfabric) expected to be the interface of choice on DOE exa-scale systems such as the Argonne Aurora and Oak Ridge Frontier systems. My work this summer involved modifying the Sessions prototype to make it compatible with OFI LibFabric, which offers better performance on these exa-scale systems. With these modifications, the MPI Sessions prototype can be tested and studied on existing systems with LibFabric support including NERSC Cori, Argonne theta, and LANL Trinity
Survey of Tools to Assess Reduced Precision on Floating Point Applications
Quinn Dibble
Investigating the Efficacy of Unstructured Text Analysis for Failure Detection in Syslog
Katy Felkner Each node of a supercomputer produces a detailed log of its operation, called a syslog. It is impossible for system administrators to review all syslog data produced by the thousands of compute nodes associated with a single HPC machine. However, analysis of these logs to detect and predict failures is crucial to maintaining the health of supercomputers. The majority of prior work using machine learning to study syslog has relied heavily on the semi-structured nature of system logs, and there has been less work in examining syslogs as unstructured, purely textual natural language data. We show that treating syslog output as unstructured natural language text without regard for numeric variables does not perform well, and that researchers must exploit the structure within syslog data to produce more useful results. In order to extract features from syslog text, we employ several popular word embeddings and then cluster both word and message level vectors using K-Means and DBSCAN. Finally, we prepared a dataset for supervised learning by aggregating the syslog into 15-minute time windows and extracting the distribution of clusters within that window. Our predictive models performed achieved a relatively low maximum AUC of .59 using a gradient-boosted random forest. This performance barely out-performs random guessing, but does suggest the presence of signal that could be amplified in future work. We also make available our datasets generated using a virtual compute cluster to simulate failures. We conclude that the incorporation of domain knowledge into predictive models, as well as the use of numerical features and structural information in syslog data, rather than a unilateral application of natural language processing techniques must be crucial to build deployable, trustworthy tools.
Deploying Machine Learning Workflows into HPC environment
Ragini Gupta
Modern high performance computing (HPC) applications require complex workflows for running scientific simulation and big data analytics tasks across cluster nodes. With the increasing demand to solve more complicated problems and processing large amount of data in HPC, machine learning is gaining a large momentum. Most machine learning algorithms are focused on utilizing high performance technologies that help boost the performance of big data and data mining frameworks. Typically, on a front-end node of HPC cluster, a workflow intertwined with dependent jobs is submitted which follows a strict, historic process. These jobs are distributed across thousands of compute nodes and once the job finishes the output from these nodes is aggregated into an output file rendered back to the user. In this project, we explored a new adaptable workflow for running machine learning jobs that are intertwined in a loosely coupled fashion to execute different machine learning stages in any user-specific application. An open standard workflow specification language, Common Workflow Language (CWL) is employed to define the pipeline framework for an end-to-end machine learning algorithm by expressing workflow in a portable way combining disparate command line tools for different jobs and passing files around in top-to-bottom scheme. These workflows can be further built with HPC’s portable container environment called, Build and Execution Environment (BEE). Since machine learning algorithms are driven by tuning parameters, these parameters are embedded as arguments in command line tool descriptions to achieve accurate and optimized results. The project provides a proof of concept for executing machine learning algorithms as a workflow with BEE for a given HPC application.
Managing Dynamic Workflows in BEE
Steven Anaya
The BEE Workflow Engine is a tool to manage and execute workflows on HPC systems. Workflows written in Common Workflow Language (CWL) are parsed by BEE, stored in a Neo4j graph database, and may then be executed using a workload manager such as Slurm. The challenge of managing these workflows, however, is that they may include steps which have a number of input dependencies (i.e., files) unknown prior to execution. In order to support as much of the CWL specification as possible, BEE must be able to handle this complexity. The way BEE represents and manages workflows is evolving to tackle this problem as well as to improve how workflows are visualized.
Embracing Open Firmware in HPC for Faster and More Secure Provisioning
Devon Bautista
Firmware is the first piece of software that is run when a computer powers on and its primary job is to initialize the computer’s hardware, determine which operating system (OS) to boot, and boot that OS. The firmware in most motherboards shipped from hardware vendors, both consumer and commercial, is often closed-source and contains many insufficiently-audited drivers. With the ubiquity of the Unified Extensible Firmware Interface (UEFI) specification in firmware implementation, firmware has become even more complex, often assuming the role of an OS. For instance, Intel’s EDKII firmware, the GRUB bootloader, and the Linux kernel all have their own network, filesystem, and USB drivers. All of these redundancies complicate and lengthen the boot process, and present a greater attack surface which can bypass the checks of even the OS. Here, it is shown that by replacing proprietary vendor firmware with a Linux kernel, a project which has reputable drivers, undergoes heavy scrutiny, and is updated frequently, one can get a more secure, flexible, and resilient boot on HPC nodes. This allows for node provisioning to occur earlier in the boot process in firmware rather than after the OS boots, as well as greater control over how provisioning is implemented. Since Linux supports many platforms, this allows more homogeneous firmware images to be managed in a scalable way by the maintainer of the nodes instead of relying on the vendor. This project applies existing efforts to make firmware open source, including LinuxBoot and u-root, to HPC and large clusters. Being able to control what the firmware does provides one of the last missing pieces of open source in an HPC stack.
Performing Survival Analysis on HPC System Memory Error Data
Stephen Penton
Statistical analysis of time-to-event variables presents a unique challenge due to their nature. As traditional regression techniques are not sufficient to fully capture time-to-event information, survival analysis is a branch of statistics used to answer questions about a population’s lifetime including the rate at which individuals experience specific events. These techniques require specific data and there are various models that can be utilized to perform different analyses. We present a tool that ingests data from a user and is able to produce an initial set of results for survival analysis at varying levels of complexity. We demonstrate its capabilities by testing it on system memory error data of two LANL machines: Cielo and Trinitite. We show the tools use within an HPC specific domain to provide insight into the impact of specific features on when memory errors occur.
Memory Trace Analysis using Machine Learning
Braeden Slade
Arm instruction emulator (ArmIE) provides its users with the capability to compile SVE code with the Arm Compiler and run the SVE binary without SVE-enabled hardware. Memory traces are produced in a binary file during the emulator’s run time. These files contain information about each trace including whether an access was a read or a write, and what memory address was accessed. Using Python libraries like Pandas and Pyspark, it was possible to decode, and plot the data to get a general idea of what the data looks like. After this, early stages of machine learning techniques were applied to the data using libraries like Keras, Tensor Flow, and Scikit-learn. Using decision trees, neural networks, and clustering techniques, models are generated that can predict with a certain degree of accuracy what address was accessed, as well as whether it was a read or write. This talk will encompass the methods and tools mentioned above and how they were applied to accomplish these goals, as well as discuss the next steps moving forward for the coming months.
Exploring the Feasibility of In-Line Compression on HPC Mini-Apps
Dakota Fulp
To meet the demands of solving ever larger and more complicated problems, high-performance computing (HPC) systems have grown in size and complexity and are capable of solving previously intractable problems. However, as applications become more capable, the size of their data increases. These large data sets lead to significant transfer times, causing bottlenecks as applications attempt to work with them. These bottlenecks reduce the system's efficiency by reducing the amount of productive work done. Lossy compression is capable of reducing the size of data while introducing a small, user-controllable, amount of error into the data. In this study, we show the effects of using in-line compression within two HPC mini-apps to reduce the size of the data being work on within the application. We analyze the effects of using in-line compression on overall accuracy, storage, and throughput. Our analysis shows that, while significant improvements to the in-line lossy compression algorithm are needed, the algorithm can reduce the amount of storage required while introducing a small amount of inaccuracy and overhead, all of which are controlled by the user. We aim to further improve the algorithm by reducing the overhead introduced while enhancing its usability within HPC applications. Specifically, we aim to develop a standard MPI interface for the compressed data, enable the in-line compression to be used with OpenMP, and improve the algorithms API to handle dynamic data types better.
No-Cost and Low-Cost Methods of Reducing Floating Point Error in Sums by Vanessa Job
Because floating point addition on finite precision machines is not associative, mathematically equivalent floating-point summations can yield different computational results. Depending on the ordering and grouping chosen, rounding errors propagated across timesteps can be substantial, leading to significant inaccuracy in final results. We examine sums in two codes, an adaptive mesh refinement hydrocode and a chemical reaction network. We reduce error by generating proper ordering and grouping of the sums and verify on typical simulation runs. Our techniques show accuracy comparable to Kahan sums without extra overhead. We present heuristics that could improve accuracy for many codes. With minimal effort, researchers can apply these techniques and see improvement in accuracy with little or no overhead and minimal disruption to the code base.
Data placement and movement in a heterogeneous memory environment
Onkar Patil
Heterogeneous memory architectures are increasingly becoming more prominent in upcoming HPC systems. High Bandwidth Memory (HBM) is available on most GPGPUs to enable fetching of large amount of data from DRAM based memory. Non-volatile, byte-addressable memory (NVM) has been introduced by Intel in the form of NVDIMMs named Intel® OptaneTM DC PMM. This memory module has the ability to persist the data stored in it without the need for power. These memory technologies expand the memory hierarchy into a hybrid/heterogeneous memory system due the differences in access latency and memory bandwidth from DRAM, which has been the predominant byte-addressable main memory technology. The Optane DC memory modules have up to 8x the capacity of DDR4 DRAM modules which can expand the byte-address space up to 6 TB per node. Many applications can now scale up their problem size given such a memory system. Although, the complexity of data allocation, placement and movement can make the use of heterogeneous memory difficult for HPC application programmers. Existing codes were written for homogeneous memory system and rewriting them would be a big challenge in terms of portability. Our aim is to move the onus from the application programmer to the compiler to modify and adapt the HPC applications for a heterogeneous memory system. The compiler framework analyzes the code at the IR level. It narrows down on the dynamic allocations in the code and replaces the allocation function calls like malloc()/realloc() with equivalent function calls from the SICM library. SICM (Simple Interface Complex Memory) library is an interface to allocate memory on different memory devices available on a given compute node. It is a bare-metal library that utilizes NUMA and jemalloc libraries to create arenas where memory can be allocated and the arenas can be moved between the different memory devices. However, it required an initialization a finalizing procedure before allocating memory in a program which could be a daunting task. Also, the SICM library was not performance-aware, i.e. it was not able to classify the NUMA nodes based on their memory performance. We added a small memory characterization script that ran micro-benchmarks to measure the memory performance of each NUMA node for every CPU group and also measured the memory transfer speed between NUMA nodes for every CPU group. This classification is read by the SICM library during runtime to be completely performance-aware in a heterogeneous memory system. We introduced additional wrapper APIs that would enable easy allocation based on the required performance from the system for a given data structure. The compiler framework utilizes these wrapper APIs to transform the existing codes to use the SICM library and allocate memory in a performance aware manner on a heterogeneous system. It consists of a transformation pass that analyzes the code on a Module, Function and Instruction level to narrow down on the initialization, finalization and allocation/free points in the code. It then inserts the IR code equivalents of the SICM APIs to transform the code. The transformed IR or bitcode is then linked into an executable and then executed on the targeted system.
bueno: Benchmarking, Performance, and Provenance
Jacob Dickens
This presentation explores an alternative approach to the application benchmarking process. Application benchmarking is an important part of the lifecycle of any production program, but many contemporary approaches are complicated by a diversity of ever-changing software stacks. Furthermore, the challenges of measuring application performance are exacerbated as programs change during their development. As we will discuss, application benchmarking can potentially benefit from integration with lightweight containers, a form of packaged programs typically managed by container runtimes. Our goal is to provide reproducible, automated benchmarking environments across different systems to alleviate an otherwise tedious, fallible, and challenging task. Thus, we present bueno: a software framework providing tools that can improve the user’s ability to perform application benchmarking and corresponding analysis. The ultimate goal of which is to provide an extensible, straightforward toolset to aid in automated testing with environmental provenance, as conducting reproducible application analyses without repeatability is problematic. Replicating the application environment, ideally mirroring the native configuration, is essential for meaningful comparative analysis. Application provenance and environment repeatability become especially important when benchmarking across multiple platforms or distributions. We continue to take great care while developing bueno, namely studying the vital aspects of reproducible benchmarking and prioritizing support for a broad array of scientific applications.
Parallelization and vectorization of nuDust
Ezra Brooker/Sarah Stangl
Using a suite of 1D core-collapse supernovae (CCSNe) models of varying progenitor masses and explosion energies, we model dust grain formation in the late time ejecta-phase. With this suite of dust grain formation models, we hope to answer these important questions: 1) Does dust yield depend on the explosion energy and progenitor of the CCSN? 2) Can we use observations of dust as a tracer in young supernova remnants (yS-NRs) to obtain information on the explosion and the progenitor star? In an effort to accelerated this research, we extended the parallelism and vectorization capabilities of the open-source code nuDust. Exploiting the inherent data parallelism of Lagrangian hydrodynamics, we scaled nuDust from single-process execution to large-scale parallel execution on LANL HPC machines. We report on our methods for parallelism and results, and show promising initial work using on vectorization and off-loading using numba, a library for the just-in-time compilation of NumPy routines.
Perils of the One-Size-Fits-All Kernel: A Fast, Secure Search for File System Metadata
Prajwal Challa
As we progress beyond the peta-scale storage era, the traditional file metadata management tools and utilities available for distributed and parallel file systems are unable to scale to the billions of files commonly stored in large HPC and cloud data centers. Existing tools that provide parallel capabilities for searching and sifting through large file systems are generally only available for administrators due to the relaxed access controls those tools provide. To better enable users and administrators to locate and manage massive data sets distributed across storage systems and data centers, GUFI (Grand Unified File Index) provides a fast, secure parallel search for locating data sets of interest. GUFI leverages mostly traditional techniques such as embedded databases and indexing to provide a high-performance metadata search capability. However, as we have improved the performance of our metadata indexing service we have identified several bottlenecks within the Linux kernel that limit the performance of our parallel query tools at extreme scales. In this talk we present a detailed description of the GUFI architecture and why it can be securely accessed by users. We then describe why enabling secure user access to the index inevitably runs into performance limitations within current Linux kernels. Finally, we describe an engineering approach for eliminating the kernel-based bottlenecks without requiring additional hardware and describe the performance of GUFI both before and after the application of our performance engineering.
The first virtual Supercomputing Institute
Richard Snyder
Due to COVID-19 the Supercomputing Institute was unable to occur in person, however the staff decided that the show would go on. This talk will cover the challenges of online teaching in the context of provisioning and configuring a Linux based supercomputer, as well as some of the tools and methods that we used to overcome these challenges.
Technical Project Management Migration to the Cloud and Confluence Documentation
Morgan Jones
In this presentation are the details to discuss my goals as a Summer Student intern for Technical Project Management in HPC-DO. These goals elaborate on the project management tools and features used to enhance overall project organization. These goals also explain the relevance of record keeping and documentation of project information in Confluence and G-Suite for TPM use in past, current, and future projects.
Easier JupyterLab Instances for HPC Users
Dylan Wallace
Currently, HPC users wanting to use a JupyterLab instance have one of two options, and both involve tunneling the JupyterLab session over an SSH connection. The easiest method is to launch the JupyterLab instance on a front-end node, but the downside to this is that the front-end nodes are shared amongst users, so self-restraint of consuming computational resources in the JupyterLab instance is required on the user’s part. The other is to tunnel the JupyterLab instance from an interactive SLURM allocation. This second option is cumbersome as it requires a "double tunnel" over two SSH connections, and the issue of lack of immediacy if the interactive job can’t be granted at submission time. We would like to make it much easier for HPC users to have JupyterLab instances on our HPC resources. First, we opted to use Apache Mesos, a cluster manager, to host JupyterLab instances and run their corresponding jobs on its worker nodes. However, due to lack of support for Apache Mesos, we’ve recently opted to use a JupyterLab platform powered by Jupyter Enterprise Gateway (JEG) on a Kubernetes cluster. JEG provides optimal resource allocations by enabling Jupyter kernels to be launched in their own Kubernetes pods, allowing notebooks to use minimal resources. By default, Jupyter runs kernels locally - potentially exhausting the server of resources. By leveraging the functionality of Kubernetes, JEG distributes kernels across the compute cluster, dramatically increasing the number of simultaneously active kernels. In this talk, we make the case for a JEG-on-Kubernetes system to provide JupyterLab sessions for our HPC users.
Towards CFD Fault Detection and Resolution Scaling with Machine Learning
Adam Good
Computational Fluid Dynamics (CFD) is a powerful technique that has resulted in many benefits to society, including the creation of accurate climate models and preventing aneurism ruptures with improved endovascular coil designs. CFD calculations take significant time to run and usually rely on high performance computing clusters. Unfortunately, this leaves these simulations vulnerable to computational faults which may cause errors in output. We aim to improve and harden CFD calculations by applying various machine learning techniques for image/video anomaly detection, scaling, and reconstruction. We will begin by applying machine learning to CFD visualizations at varying resolutions with computational faults injected using the PINFI tool as a proof of concept, and plan to fine-tune the most promising techniques to work on raw numeric CFD data. Our first step is to build datasets appropriate for our work via modifications to CLAMR, a testbed tool for simulating the shallow water problem using adaptive mesh refinement. At this point we have modified CLAMR to standardize simulation time in simulations at different resolutions, and are currently separating the domain of the simulation from the mesh resolution. Additionally, we have automated PINFI’s injection process, and we are refining the injection process for efficiency and reliability.
The Future of Stereo 3D Data Analysis and Visualization
John Dermer
Data visualization is an important tool to analyze scientific information. With the availability of desktop stereo 3D at LANL becoming limited it is important to look at alternative methods of visualization such that LANL does not lose the ability to have individual stereo 3D data analysis tools. With the advancement in virtual reality(VR) hardware and software this presents an opportunity to evaluate VR visualization as a replacement for traditional desktop stereo data visualization. VR headsets are now in use in multiple facilities for a variety of uses such as; training, cooperative work environments, and data visualization. VR headsets offer an opportunity to not only maintain existing environments but to improve on their capabilities. I will discuss my work in this field
Memory Address Decoding and Fault Analysis
Dylan Wallace
Memory Address Decoding and Fault Analysis Analyzing memory error logs can help us to identify and fix previously unknown faults in memory. Using Cray’s memory error log tool, xthwerrlog, we’re able to see when an error occurred and what address it’s associated with but we aren’t able to dive deeper into DRAM for fault identification. For instance, all the ECC is done in the integrated memory controller (iMC) so there’s no direct way to tell if the fault was in the iMC, the DIMM, or the channel between them. So, how do we determine where the error occurred? The address bits contain the location of the row, bank, bank group, etc. of the bad bit(s) when we are looking at a physical address. If those bad bits are consistently the same ones going bad we can say something about the fault being in the same physical location with good probability. For instance, we can statistically determine what sort of fault we are seeing. If we see lots of faults on a DIMM that vary only in the column value, then we can call that a column fault, while that same logic applies to the DRAM row, bank group, and bank. Using a set of Intel documentation, we were able to decode physical addresses given to us by xthwerrlog and statistically determine the location of those faults in memory. In this talk, we look at some of the statistics we have been able to gather from this work.
Investigating Hard Disk Drive Failure Through Disk Torture
Daniel Perry
Hard disk drives (HDDs) are typically expected to fail after several years of use, with the probability of a disk failing increasing as the disk ages. High performance computing (HPC) storage systems at Los Alamos National Laboratory (LANL) experienced instances where HDDs began failing at rates much higher than those expected at these systems’ points in their hardware’s lifecycle. In order to determine the causes for these past failures, and to mitigate similar failures on future systems, we began to investigate how external variables such as temperature, vibration, and disk workload affects the health of HDDs. Our test bed for experimentation is a Dell R7425 server connected to four 84 bay JBOD enclosures containing 4 terabyte Seagate HDDs, totaling 336 hard disks. The initial stages of this ongoing project are composed of two main objectives: establishing a monitoring framework to collect HDD failure data and creating artificial workloads to simulate high-stress environments. Our implemented monitoring system collects disk temperature, health, and performance data in addition to temperature data from the enclosures. The collected data is stored in a Prometheus time series database, where it can be accessed through Grafana visualization tools. Development of high-stress disk operation workloads led to an in-depth analysis of the internal physical geometry of the test HDDs, to facilitate the creation of tests focusing a disproportionate workload on one disk head.
Virtualizing the Network for Testing & Development
Conner Whitfield & Robby Rollins
Testing and experimentation is a key aspect to the future development of a network by ensuring that new configurations and topologies work properly. However, these forms of research require either a physical or virtual solution. A physical approach can require on-site management and investment in potentially unused or unfit equipment. Instead, a virtual solution is more fit for this application, as virtualization presents reduced cost, easier remote interaction, and the ability to efficiently clone or modify existing setups. Currently, Los Alamos National Laboratory (LANL) high-performance computing (HPC) division does not have a representation of their production networking infrastructure to experiment with different network configurations. With a network that’s as large as Trinity, with 20,000 nodes, it can be very difficult to make a scale model of the network to test. In order to tackle this issue we set up a Kernel-based Virtual Machine (KVM) server where we created virtual machines (VMs) for Arista, Cumulus Linux, and CentOS. In our project we show that by using VMs we’re able to simulate a large network at a fraction of the cost of using dedicated test infrastructure. With this virtual network we were then able to gather different sFlow metrics with Prometheus and pull that information into Grafana. Through creation of a virtual network similar to currently deployed networks, it was found that virtualization presents unique challenges while still allowing for network testing and experimentation. These challenges come in the form of limitations such as kernel restrictions on how network bridges operate, and how KVM’s command line tool interacts with open VM sessions. Despite these restrictions, we were able to construct a virtual network that closely resembled one of the network configurations that LANL HPC currently uses. We were then able to test sFlow and Prometheus features inside of this virtual network. KVM's flexibility and accessibility along with other features help to create a highly modular environment for network emulation for testing new use cases in a nonproduction environment. KVM is a great virtualization tool for networking administrators to be able to test and emulate different network configurations that they want to bring into production.
Analyzing Frameworks for HPC Systems Regression Testing
Berkelly Gonzalez & Sadie Nederveld
Systems testing is an important part in the lifecycle of a high-performance computing system that occurs during DSTs as well as when an unexpected problem occurs. System administrators need a way to quickly test the system when something goes wrong, as well as a way to perform regression testing. In regression testing, they track the health of the system in a more in-depth manner with the goal of detecting problems with a system before they become problems for the users. Frameworks that are available for testing on HPC systems were generally created for application testing, acceptance testing, or node health testing, leaving uncertainty as to whether they could be used in a systems regression testing context. In our project, we set out to either find a framework that can perform HPC system regression testing, particularly on management nodes, or determine if a new framework needs to be created. During our project, we first explored existing frameworks designed specifically for HPC systems regression testing. When we found that there were no such frameworks, we tested three LANL-based non-regression HPC frameworks: Pavilion, Interstate, and Node Health Check (NHC). We implemented these frameworks to evaluate their potential to be used for system regression testing based on a list of desired characteristics. Through this, we found that none of the frameworks we tested had all of the desired characteristics of a systems regression testing framework. However, although neither Pavilion nor NHC were designed for the purpose of systems regression testing, they both have the potential to be used for this purpose. Both frameworks have Splunk integration for tracking the health of the system over time and can be easily extended with custom test scripts. This gives each of them enough of the characteristics necessary to function as a systems regression testing framework. Going forward, either Pavilion or NHC could be effectively integrated into the systems regression workflow. On the other hand, while either Pavilion or NHC could be used on its own, there are characteristics where one is stronger and the other weaker. Pavilion does not come with any test templates or pre-written tests that could be used for systems testing, but it can be run from a CM and target other nodes on the system. NHC comes with a variety of tests that can be used for systems testing, but it has to be directly run on each node to perform the check. Therefore, a second potential route forward is to use Pavilion and NHC together by having Pavilion run NHC. Pavilion will run scripts in various languages and has the best result parsing of the frameworks tested, while NHC has many useful built-in tests. By using them together, each framework can complement the other. Furthermore, using pre-existing frameworks that are open source, as opposed to creating a new framework, has additional benefits, including ongoing development and support. We conclude that it is unnecessary to create a framework that is specific to systems regression testing, because existing tools can be used in combination to achieve the desired result.
A Virtual Cluster Monitoring Toolkit for Bottleneck Analysis
Natasha Frumkin & Christian Marquardt
Monitoring computing clusters and detecting potential bottlenecks is a necessary part in designing and engineering efficient, reliable systems. HPC-Collab, a tool for constructing virtualized HPC clusters, automatically configures a virtual cluster with a specified cluster topology and software configurations. However, HPC-Collab suffers from scalability issues. Due to long installation times of about two hours for a fifteen node cluster, using HPC-Collab for spinning up proof-of-concept experimental clusters is time-intensive. Our project collects useful metrics from the virtual cluster nodes as they are provisioned, and correlates virtual cluster activity with the host. We designed a monitoring infrastructure which collects relevant system data from the virtual clusters as they are created. At the same time, we collect host machine metrics to correlate activity from virtualized nodes with underlying IO patterns, network traffic, and memory usage. Using additional helper scripts, data from multiple virtual nodes is gathered and visualized through custom dashboards. This enables system performance analysis at each stage of the process. From our experiments, we have identified major bottlenecks are primarily due to network bandwidth and large overhead from the underlying virtualization providers. Contrary to originally inferred, we have proven that neither random IO patterns of guest-to-host layered file systems nor CPU usage are bottlenecks in the provisioning process. Additionally, we have provided quantitative measurements to verify how long each provisioning takes as well as where in the provisioning process we see large fluctuations in network traffic and RAM usage. As a result of our work, cluster provisioning time was reduced by 50% for the standard reference cluster model. In the future, we plan to provide fine-grained graphs which we directly match up with the phases in virtual cluster provisioning that are particularly slow. Ultimately, we hope to integrate automatic monitoring into the HPC-Collab project and provide developers with a visualization tool for quick and intuitive bottleneck analysis. This aligns with hpc-collab’s project goal: promoting engineered, rather than artisanal, cluster construction.
Auto-Mounted SquashFS for Charliecloud Containers
Anna Chernikov & Megan Phinney
Charliecloud is a light-weight container workflow for high performance computing. The typical filesystem image formats for Charliecloud are SquashFS and tar archives. SquashFS is a compressed, read-only filesystem that unprivileged users can mount in user space with SquashFUSE. It is the preferred image format due its efficiency, file deduplication features, and faster distribution time. The current SquashFS workflow is complicated because there is no library the runtime can interact with at the program level; the user must manually mount and unmount the SquashFS. To simplify the workflow, we converted the FUSE filesystem operations from the existing SquashFUSE executables to library functions. We reference the library functions in the Charliecloud container runtime source code to handle mounting, unmounting, and filesystem operations for the SquashFS image. To compare the performance of the two workflows, we measured the duration of the mount, execution, and unmount phases. Our experiment results show the run time of the new user-friendly SquashFS workflow is comparable to the old workflows. The new SquashFS Charliecloud workflow reduces user complexity at no cost in performance.
Integration of the ECP Proxy Apps Suite into the Pavilion Test Harness
Christine Kendrick, Yolanda Reyes & Anaira Quezada
Researching solutions for critical challenges such as clean energy and nuclear studies, the Exascale Computing Project (ECP) was developed by the U.S. Department of Energy (DOE) to increase the capabilities of High Performance Computing (HPC) systems to execute the DOE mission of supporting science, engineering and nuclear stockpile stewardship. Exascale computing is the next goalpost for the DOE HPC centers to remain globally competitive. The ECP Proxy Applications Project models the performance-critical computations of large applications and employs modern parallel programming methods targeting Exascale systems. We integrated several ECP proxy applications using Los Alamos National Laboratory’s Pavilion HPC Test Harness by developing portable test configurations to generalize build commands, runtime inputs, and capture performance results. Programmable inputs in the test definitions facilitate the adaptability of proxy applications through test permutations and results comparison. This ability to parameter sweep on input data and build configurations simplifies determining the optimal runtime configuration. Combined into a Pavilion test suite, these applications run a wide variety of multi-dimensional mathematical operations that benchmark machine performance, thereby demonstrating Pavilion’s capability to build, run, and provide analyzable results of the ECP proxy applications. By integrating these applications, we demonstrate Pavilion’s strengths in supporting HPC benchmarking for Exascale computing.
Stay GUFI with Performance Regression Testing
Skylar Hagen
The Grand Unified File Index (GUFI) is an index that contains file system metadata and is designed to allow users and administrators to securely and efficiently query this metadata, while minimizing the impact on resources within HPC storage systems. Metadata is becoming increasingly more important because massive numbers of files are being stored on file systems today and file counts are only going to continue increasing. GUFI prides itself on vastly outperforming standard metadata tools, thus tracking performance improvements, or more importantly degradations, over time is vital. In order to properly detect these variations, a baseline must be established for a standard set of operations. Once a baseline is established, our solution will take the current build of GUFI and document select performance values for a fixed set of operations following each commit. Statistics from multiple runs of the same operations will be collected and can then be compared against previously recorded statistics to determine if there have been significant changes in performance. This talk will provide an overview of the current design for performance regression testing in the GUFI software suite.
Evaluating Hardware Compression Offload in a Lustre File System
Mariana Hernandez
Advancements in solid-state storage media, network, and PCIe performance, has allowed HPC storage system designers to begin practically designing an efficient bandwidth tier. Although SSDs have become affordable, this increased speed often means a reduction in capacity per device. When designing a storage system with an optimal performance level and limited capacity, compression becomes a vital component of the design. In measuring existing compression algorithms in ZFS, we find that these compression techniques drastically slow performance while providing high levels of compression or minimally impact performance while providing little compression benefit. To attain both fast and efficient compression of scientific datasets, compression can be offloaded to specialty hardware, such as Eideticom’s NoLoad FPGA. We benchmarked two types of compression, GZIP and LZ4, on an in-memory ZFS based Lustre file system. Our results show that offloading compression to the FPGA proved to both minimize impact on performance and provide high compression efficiency. This talk will present challenges in storage node efficiency and provide insight into the potential efficiency gains using computational storage, while also describing the difficulties around configuring such a system.
Integration of The Energy Exascale Earth System Model (E3SM) into The Pavilion Test Harness
Timothy Goetsch
The Los Alamos National Laboratory (LANL) High Performance Computing (HPC) support teams test end-user applications on production Department of Energy (DOE) National Nuclear Security Administration (NNSA) supercomputers. The Energy Exascale Earth System Model (E3SM) scientific application furthers our nation’s predictive capabilities of the Earth’s climate and environmental systems to deliver future sustainable energy solutions. Manual1 testing of E3SM’s supported build configurations is time consuming. LANL’s Pavilion Test Harness addresses this issue by enabling the creation of portable, abstract test definitions.2 This project focuses on building a Pavilion test to verify that E3SM builds and runs on LANL production systems while analyzing its performance and portability. Harnessing E3SM under pavilion supports continuous development and integration for developers and captures performance profiles for support teams to use in continuous application monitoring. Subsequently, LANL HPC testing teams will support running E3SM under Pavilion for continual evaluation of our systems’ ability to support the DOE’s Climate and Environmental Science Division’s (CESD) workload.
OpenSNAPI: Toward a Unified API for SmartNICs
Brody Williams
The end of Moore’s Law and Dennard Scaling has produced a renaissance in the field of computer architecture. Unable to continue leveraging silicon-level processor improvements to further enhance performance and scalability, system architects have been forced to explore other options. In this new era of heterogeneous architectures and hardware/software codesign, a new class of devices known as “accelerators” has emerged. Independently designed for optimized execution of distinct workloads, these devices have proven critical to the continued advancement of application performance. SmartNICs, accelerator devices integrated with a network controller, have conventionally been utilized to offload low-level networking functionality. However, newer SmartNIC variants, which incorporate a system-on-chip (SoC) with traditional designs, are challenging this precedent. Leveraging significantly augmented resources, these new devices offer increased versatility and the potential to more effectively complement a given architecture’s CPU. In this talk, we introduce the motivation underlying acceleration, explore the fundamentals of SmartNICs, and discuss traditional use cases. We also detail our initial efforts to investigate the feasibility and benefits of SmartNICs as general-purpose accelerators. We present the OpenSNAPI project created to define a uniform application programming interface (API) for this emerging class of devices. Finally, we provide a brief tutorial regarding development of SmartNIC-accelerated applications on Los Alamos National Laboratory’s SmartNIC-enabled platforms.
Hunting for Bottlenecks in ZFS Failure Recovery using NVMe Drives
Trevor Bautista
Thanks to currently available storage technology, modern data storage systems are capable of very high bandwidth. Determining which data protection schemes to use for high bandwidth storage systems heavily depends on disk bandwidth and capacity, which determine the amount of time it takes to fully recover data from drive failures. Within HPC, storage systems must balance throughput and data protection according to the goals of each system. With the recent affordability and performance increase of Non-Volatile Memory Express (NVMe) SSD technologies, the significance of disk bandwidth as a potential bottleneck is drastically lowered. However, it is unclear if the overlying filesystem, ZFS, exhibits inherent bottlenecks when not limited by disk bandwidth. Here we hunt for ZFS rebuild and resilver performance bottlenecks when used with NVMe devices. Using various realistic ZFS configurations, we simulate disk failure and measure both the amount of data rebuilt/resilvered and the amount of time this operation takes. These configurations vary from a default production ZFS configuration with no external I/O load to a rebuild-favored ZFS configuration with a heavy user workload. With our results, we expect to provide storage system designers with a better understanding of ZFS rebuild/resilver performance bottlenecks.
Comparative Analysis of Metric Collecting Software
David Huff
High Performance Supercomputers are designed for massive computations requiring fast processing, network, and storage performance. At Los Alamos National Laboratory the High Performance Computing Monitoring Team works on developing, implementing and maintaining new infrastructure and tools to support the HPC data centers.The Monitoring Team runs low overhead metric software on the supercomputers allowing them to collect data without hindering scientific software performance. With the emergence of new metric software, we studied the impact and benefits of running these tools in a production HPC environment: Prometheus node_exporter, Lightweight Distributed Metric Service (LDMS), Telegraf, and Fluent-bit. We evaluate the pros and cons of these tools to meet our data needs on production clusters.
Generating HPC Job Profiles and Expectations with Time-Series Data
Brett Layman
In the field of High Performance Computing it can be challenging to quantify the expected performance of job, especially if that job utilizes multiple nodes executing in parallel. Current techniques for job analysis allow users to track their code via function calls and to determine what resource their code is bound by. However, these techniques often demand a significant amount of time and effort on the part of the developer to fully understand the limitations of their code and to investigate issues. We are demonstrating a system for automatically collecting a job’s relevant time series data, and using that data to produce complete job profiles for every job on a machine. In our system, we are using LDMS data such as CPU utilization, memory, and bandwidth. These job profiles not only encompass raw time series data, but also preprocessed statistical series, and can be readily visualized by simply navigating to a web page. In addition to visualizing a profile on its own, profiles from past successful runs can be used to generate a statistical expectation for a job’s behavior, which can be visualized as a cloud path. When a job deviates significantly from this path, this indicates anomalous behavior. It’s also worth noting that we have developed a machine learning technique to classify jobs into workload types when that information isn’t reliably supplied by the user. Providing a readily available live visualization of a job’s performance allows users and administrators to take immediate action when a job is idling or in a failed state, and as a result save valuable compute hours. The utility of job profiles extends beyond providing a window into an individual job’s performance. For example, job profiles taken together across a whole system can help us to visualize and automatically detect system-wide anomalies. Also, job expectations can be applied to system benchmark jobs to monitor for system changes. Finally, detection of anomalous job behavior can be used to label log data and improve log analytics. In other words, encapsulating this type of data into job profiles enables a variety of new avenues for data analysis and investigation. To summarize, we have built a system that automatically generates statistical expectations for a job’s performance based on metric time series data, and makes that information readily available through web visualization for all jobs on a system.
Using Statistical Methods to Validate Hardware Performance Monitors
Brian Gravelle
High Performance Computing has benefited from rapid increases in the performance of processor architecture, but many applications must be updated to take advantage of new features. Hardware Performance counters are commonly used to examine how an application is using a particular system. Unfortunately, these HPMs differ significantly between processor vendors and generations. These differences inhibit developers trying to study performance on multiple systems, or trying to understand performance on a new system. In this talk, we present preliminary results defining and validating counter sets that expose performance metrics common to all modern HPC CPUs. These counters sets will help improve the portability of performance analysis. We present our novel benchmark for identifying counters of interest and results on multiple types of HPC nodes.