EMC3

Efficient Mission Centric Computing Consortium

EMC3 employs various tools and resources to investigate ultra-scale computing architectures, systems, and environments that can achieve higher efficiencies in extreme scale mission-centric computing.

The primary focus is on the most demanding multi-physics applications involving largely unstructured/sparse problems that require a balance of compute, memory size, memory bandwidth, memory latency, network, and I/O. Working together, consortium members from the US Industrial Base “Users” to HPC component and system developers, will collaborate together to attack this challenging problem space. Achieving higher efficiency in ultra-scale computation supports National Security interests and the US Industrial Base needs

Projects

Deep Codesign of Advanced Memory Technologies

Many of Los Alamos National Laboratory’s High Performance Computing (HPC) codes are heavily memory bandwidth bound. These codes often exhibit high levels of sparse memory access which differ significantly from many HPC codes but share some similarity to other major market workloads such as graph analysis and sparse table joins in database applications.

Historically, the floating point operations per second (FLOPs) of a given architecture served as a reasonable proxy to total application performance in high performance computing (HPC). As FLOP instructions per cycle (IPC) has increased by orders of magnitude over that of memory IPC and as our HPC codes have grown increasingly complex with highly sparse data structures, this is no longer the case. In most respects, dense FLOPs are a solved problem and memory bandwidth for dense and, especially, sparse workloads is the primary performance challenge in HPC architecture. To address this challenge the Laboratory is working on a number of technologies to better understand sparsity in our applications, potential hardware and software solutions, and how our applications may need to change to leverage these solutions. This deep codesign of complex applications and memory technologies is a primary focus area as we shape our next generation architectures.

* Shipman et.al., “The Future of HPC in Nuclear Security” , to appear in IEEE – Special Issue on the Future of HPC, 2023

Analysis shows that Laboratory applications are largely bottlenecked on the memory subsystem

*Shipman et.al., “Assessing the Memory Wall in Complex Codes”, Workshop on Memory Centric HPC, Supercomputing 2022.

*Shipman et.al., “Early Performance Results on 4^th Gen Intel Xeon Scalable processors with DDR and Intel Xeon Processors codenamed Sapphire Rapids with HBM” arXiv:2211.05712

To better understand this bottleneck we have developed a set of tools and techniques to characterize memory access. These techniques rely upon program instrumentation to capture memory access patterns. Examples of memory access patterns in our xRAGE and FLAG codes can be found at github.com/lanl/spatter.

These patterns can be used to drive system benchmarks using the Spatter tool.

Our tool to capture these patterns, known as gs_patterns is currently under release

Single-Node Weak Scaling (Solid=Gather, Dashed=Scatter): Spatter benchmark with FLAG memory access patterns

Spatter can be used to access current hardware technologies as well as future hardware (simulated) without relying upon full scale applications.

To obtain more detailed information on our application’s data structures and access patterns we have developed a complementary set of proxy applications. Previous physics “proxy” efforts revolved around creating simplified models of the entire application, but show very different memory access patterns than the original applications. We have developed EAP Patterns (EAP/AMR, Fortran) and UME (LAP/unstructured ALE, C++) extract a subset of the data structures and algorithms used in these applications, exposing the challenges of multi-level indirection, complex iteration patterns, and non-trivial data access mechanisms.

UME focuses on a regional zone-gradient algorithm that is central to advection operations. This algorithm is applied to many fields in each of many material regions. Several mesh connectivity options are provided to allow exploration of the impact of connectivity on memory access patterns.
EAP Patterns focuses on the cell gradients subroutine which is a face-based loop over a semi-structured AMR mesh (quad/oct-tree with minimum leaf chunk size of 2^d)
- Download EAP Patterns
- UME will be released shortly

These proxies illustrate the use of one and two levels of indirection as illustrated below.

- - Goal: Prototype for evaluating sparse memory acceleration across hardware, approaches, and techniques
  - Integrate with other micro-benchmarks and proxies for evaluation
  - Client/Controller design to service sparse memory requests (0, 1, and 2 levels of indirection)

Grand Unified File Index (GUFI)

Existing metadata search solutions that provide low-latency, searchable indexes for data centers have two traditional problems: first, these systems do not enforce POSIX metadata permissions and second, the metadata search time is proportional to the total number of metadata entries indexed within the system. To provide a fast, secure metadata index that enables both users and administrators to efficiently search across all file systems within a data center we have developed the GUFI.

GUFI provides a single unified index for searching across multiple file systems. The index is constructed by scanning each of the file systems within the data center and creating a set of clustered databases that are sharded in a way that enforces standard POSIX file system permissions. Because GUFI strictly enforces POSIX metadata permissions it can be directly accessed by users and administrators. The index GUFI builds is used to accelerate interactive command line queries and as part of a web-based metadata search across an entire data center. Because the underlying index is stored in a set of embedded databases that each understand SQL, GUFI is able to support advanced query types that would be far too time-consuming with traditional metadata query tools. All of these capabilities are provided with transparent parallelism that is designed to achieve high levels of performance when accessing solid-state disks that store the index.

The essence of the design replicates the directory structure of the source tree with no files. In each index directory GUFI places an SQLite database that contains information about the directory and about the files within that directory. All the index trees from the different source file/storage systems are put together in a single search tree. The permissions of each database are set to be the same as the directory so that posix file metadata is protected exactly as in the source trees.

Queries are just SQL queries that are distributed to many thread and these threads walk the index tree using breadth first search in parallel. Users with millions of files can find information in seconds and administrators can look at the entire set of billions of files in a few minutes maximum.

GUFI is open source and is available to download at GitHub

Read the Supercomputing paper.

ZFS Improvements

ZFS is an open-source file system and volume manager that offers high data integrity and a rich feature set all though software defined storage pools. LANL uses ZFS in multiple storage tiers from Lustre scratch storage systems to the colder campaign storage tier. With LANL’s migration from rotating media disks to low latency, high bandwidth NVMe flash devices, it is imperative that ZFS be able to leverage these new devices to capture all the available performance they provide. When LANL first started evaluating ZFS with NVMe SSDs, a significant portion of the available device bandwidth was missing in the range of forty five percent for writes and seventy percent for reads in various ZFS storage pool configurations. This disparity between device capabilities and ZFS performance mainly had to do with design elements of the ZFS file system, which were initially built to mask slow rotating hard disk drives. LANL has been working on two projects to improve ZFS performance when using the NVMe devices. First, adding direct-io support to ZFS, and the second is to allow ZFS to harness the capabilities of NVMe computational storage devices.

While the direct-io project has allowed LANL to better leverage NVMe devices with ZFS, there are still shortcoming with performance when using data protection schemes and data transformations. LANL’s ZFS filesystems use checksums, erasure coding, and compression to add data integrity and shrink the overall I/O footprint of datasets. However, these operations are not only computationally expensive, but the combination of these operations result in multiple passes over the data in memory making them memory bandwidth intensive. As an example, using two devices of erasure coding with gzip compression and checksums results in only six percent of all available ZFS performance through the ARC. To remedy this, LANL has added functionality allowing ZFS to offload checksums, erasure coding, and compression to computational storage devices. This functionally was developed in two parts. The first is a layer in ZFS called the ZFS Interface for Accelerators (Z.I.A.), which allows users to specify what operations they would like to be offloaded to computational storage devices and works with ZFS data structures. The second is an open-sourced kernel module, which Z.I.A. leverages, called the DPU-SM. The DPU-SM provides generic hooks for any file system to communicate in a standard way to computational storage devices while allowing for a variety of offload providers to register their implementations in a consumable fashion. Using this new patch set a 16x speedup improvement can be gained with ZFS storage pools using checksums, erasure coding, and gzip compression.

Both the direct-io and Z.I.A. patches are currently open pull requests to OpenZFS master. The addition of both will allow ZFS to fully leverage the capabilities of NVMe device.

Paper at ZFS Interface for Accelerators available at osti.gov.

ZFS changes at GitHub.

Computational Storage

Offloading some processing activities to on or near data storage devices/sets of devices.

Projects associated with processing of data near storage where the processing does not need to know anything about the data/format:

Accelerated Box of Flash (ABOF 1)
The ABOF 1 project was done to demonstrate offloading data/format agnostic functions that are expensive to provide in a standard host-based solution. The primary expensive attribute addressed was memory bandwidth. Functions like erasure, compression, and encoding can be quite memory bandwidth intensive requiring many times the bandwidth of the storage devices themselves. The ABOF 1 project was a partnership with Eideticom, Nvidia, Aeon, SK hynix, and LANL. The concept was to build a demonstration of offloading erasure, compression, and encoding from the main host processor for a popular kernel based file system, ZFS.

See the Press Release and Presentation
Key Value Computational Storage Device (KV-CSD)
This KV-CSD project demonstrated an ordered key value store at the device level (not just a hash). This work was a partnership between SK hynix and LANL.

Publications

Get involved

There are a few ways you can get involved:

Co-Fund research and development which can include funding/co-funding and mentorship to computer science and mathematician graduate students on applicable research topics of interest. May also include lab test and evaluation elements (LANSCE).
Co-Locate Staff at Ultrascale Systems Research Center (USRC) or co-fund USRC staff to conduct research on topics of interest.
Sabbatical location for staff to work on desired projects.
Technical Interchange meetings (TIMS) or Focus area discussions. These can range from quarterly technical meetings, to casual technical discussions or special topic days.
Participation in forthcoming “Efficient Mission Computing” workshops and conferences.

Collaboration Areas

Areas include, but are not limited to:

Storage and Network: data management, data movement, data manipulation and library middleware packages to enable all of these.
Compute: processor/memory complex for focused balanced application performance, enhancements to communication libraries. (MPI, OpenSHMEM, OpenUCX)
Simulation, Optimization and Benchmarking: scaling efficiencies.
Efficient System Management: next generation data driven management/job execution, anomaly analysis, system and environmentally aware scheduling, system resiliency improvements—all focused on creating more reliable and sustainable HPC systems.