Projects
Many of Los Alamos National Laboratory’s High Performance Computing (HPC) codes are heavily memory bandwidth bound. These codes often exhibit high levels of sparse memory access which differ significantly from many HPC codes but share some similarity to other major market workloads such as graph analysis and sparse table joins in database applications.
Historically, the floating point operations per second (FLOPs) of a given architecture served as a reasonable proxy to total application performance in high performance computing (HPC). As FLOP instructions per cycle (IPC) has increased by orders of magnitude over that of memory IPC and as our HPC codes have grown increasingly complex with highly sparse data structures, this is no longer the case. In most respects, dense FLOPs are a solved problem and memory bandwidth for dense and, especially, sparse workloads is the primary performance challenge in HPC architecture. To address this challenge the Laboratory is working on a number of technologies to better understand sparsity in our applications, potential hardware and software solutions, and how our applications may need to change to leverage these solutions. This deep codesign of complex applications and memory technologies is a primary focus area as we shape our next generation architectures.
* Shipman et.al., “The Future of HPC in Nuclear Security” , to appear in IEEE – Special Issue on the Future of HPC, 2023
Analysis shows that Laboratory applications are largely bottlenecked on the memory subsystem
To better understand this bottleneck we have developed a set of tools and techniques to characterize memory access. These techniques rely upon program instrumentation to capture memory access patterns. Examples of memory access patterns in our xRAGE and FLAG codes can be found at github.com/lanl/spatter.
These patterns can be used to drive system benchmarks using the Spatter tool.
Our tool to capture these patterns, known as gs_patterns is currently under release
Single-Node Weak Scaling (Solid=Gather, Dashed=Scatter): Spatter benchmark with FLAG memory access patterns
Spatter can be used to access current hardware technologies as well as future hardware (simulated) without relying upon full scale applications.
To obtain more detailed information on our application’s data structures and access patterns we have developed a complementary set of proxy applications. Previous physics “proxy” efforts revolved around creating simplified models of the entire application, but show very different memory access patterns than the original applications. We have developed EAP Patterns (EAP/AMR, Fortran) and UME (LAP/unstructured ALE, C++) extract a subset of the data structures and algorithms used in these applications, exposing the challenges of multi-level indirection, complex iteration patterns, and non-trivial data access mechanisms.
- UME focuses on a regional zone-gradient algorithm that is central to advection operations. This algorithm is applied to many fields in each of many material regions. Several mesh connectivity options are provided to allow exploration of the impact of connectivity on memory access patterns.
- EAP Patterns focuses on the cell gradients subroutine which is a face-based loop over a semi-structured AMR mesh (quad/oct-tree with minimum leaf chunk size of 2d)
- Download EAP Patterns
- UME will be released shortly
These proxies illustrate the use of one and two levels of indirection as illustrated below.
-
-
- Goal: Prototype for evaluating sparse memory acceleration across hardware, approaches, and techniques
- Integrate with other micro-benchmarks and proxies for evaluation
- Client/Controller design to service sparse memory requests (0, 1, and 2 levels of indirection)
-
Existing metadata search solutions that provide low-latency, searchable indexes for data centers have two traditional problems: first, these systems do not enforce POSIX metadata permissions and second, the metadata search time is proportional to the total number of metadata entries indexed within the system. To provide a fast, secure metadata index that enables both users and administrators to efficiently search across all file systems within a data center we have developed the GUFI.
GUFI provides a single unified index for searching across multiple file systems. The index is constructed by scanning each of the file systems within the data center and creating a set of clustered databases that are sharded in a way that enforces standard POSIX file system permissions. Because GUFI strictly enforces POSIX metadata permissions it can be directly accessed by users and administrators. The index GUFI builds is used to accelerate interactive command line queries and as part of a web-based metadata search across an entire data center. Because the underlying index is stored in a set of embedded databases that each understand SQL, GUFI is able to support advanced query types that would be far too time-consuming with traditional metadata query tools. All of these capabilities are provided with transparent parallelism that is designed to achieve high levels of performance when accessing solid-state disks that store the index.
The essence of the design replicates the directory structure of the source tree with no files. In each index directory GUFI places an SQLite database that contains information about the directory and about the files within that directory. All the index trees from the different source file/storage systems are put together in a single search tree. The permissions of each database are set to be the same as the directory so that posix file metadata is protected exactly as in the source trees.
Queries are just SQL queries that are distributed to many thread and these threads walk the index tree using breadth first search in parallel. Users with millions of files can find information in seconds and administrators can look at the entire set of billions of files in a few minutes maximum.
GUFI is open source and is available to download at GitHub
Read the Supercomputing paper.
ZFS is an open-source file system and volume manager that offers high data integrity and a rich feature set all though software defined storage pools. LANL uses ZFS in multiple storage tiers from Lustre scratch storage systems to the colder campaign storage tier. With LANL’s migration from rotating media disks to low latency, high bandwidth NVMe flash devices, it is imperative that ZFS be able to leverage these new devices to capture all the available performance they provide. When LANL first started evaluating ZFS with NVMe SSDs, a significant portion of the available device bandwidth was missing in the range of forty five percent for writes and seventy percent for reads in various ZFS storage pool configurations. This disparity between device capabilities and ZFS performance mainly had to do with design elements of the ZFS file system, which were initially built to mask slow rotating hard disk drives. LANL has been working on two projects to improve ZFS performance when using the NVMe devices. First, adding direct-io support to ZFS, and the second is to allow ZFS to harness the capabilities of NVMe computational storage devices.
While the direct-io project has allowed LANL to better leverage NVMe devices with ZFS, there are still shortcoming with performance when using data protection schemes and data transformations. LANL’s ZFS filesystems use checksums, erasure coding, and compression to add data integrity and shrink the overall I/O footprint of datasets. However, these operations are not only computationally expensive, but the combination of these operations result in multiple passes over the data in memory making them memory bandwidth intensive. As an example, using two devices of erasure coding with gzip compression and checksums results in only six percent of all available ZFS performance through the ARC. To remedy this, LANL has added functionality allowing ZFS to offload checksums, erasure coding, and compression to computational storage devices. This functionally was developed in two parts. The first is a layer in ZFS called the ZFS Interface for Accelerators (Z.I.A.), which allows users to specify what operations they would like to be offloaded to computational storage devices and works with ZFS data structures. The second is an open-sourced kernel module, which Z.I.A. leverages, called the DPU-SM. The DPU-SM provides generic hooks for any file system to communicate in a standard way to computational storage devices while allowing for a variety of offload providers to register their implementations in a consumable fashion. Using this new patch set a 16x speedup improvement can be gained with ZFS storage pools using checksums, erasure coding, and gzip compression.
Both the direct-io and Z.I.A. patches are currently open pull requests to OpenZFS master. The addition of both will allow ZFS to fully leverage the capabilities of NVMe device.
Paper at ZFS Interface for Accelerators available at osti.gov.
Offloading some processing activities to on or near data storage devices/sets of devices.
Projects associated with processing of data near storage where the processing does not need to know anything about the data/format:
- Accelerated Box of Flash (ABOF 1)
The ABOF 1 project was done to demonstrate offloading data/format agnostic functions that are expensive to provide in a standard host-based solution. The primary expensive attribute addressed was memory bandwidth. Functions like erasure, compression, and encoding can be quite memory bandwidth intensive requiring many times the bandwidth of the storage devices themselves. The ABOF 1 project was a partnership with Eideticom, Nvidia, Aeon, SK hynix, and LANL. The concept was to build a demonstration of offloading erasure, compression, and encoding from the main host processor for a popular kernel based file system, ZFS.
See the Press Release and Presentation - Key Value Computational Storage Device (KV-CSD)
This KV-CSD project demonstrated an ordered key value store at the device level (not just a hash). This work was a partnership between SK hynix and LANL.
See the Press Release