Los Alamos National LaboratoryCrossroads
A critical element for improved predictive capability.

Benchmarks and Performance Analysis

A critical element for improved predictive capability

Contacts  

Crossroads Benchmarks, Micro-Benchmarks, & ASC Code Suite

Assuring that real applications perform efficiently on Crossroads is key to their success. A suite of benchmarks and several ASC Simulations Codes have been developed for RFP response evaluation and system acceptance. These codes are representative of the workloads of the NNSA laboratories. 

Crossroads Benchmarks

  • SNAP  [Summary (pdf), Source]
    A proxy for the performance of a modern discrete ordinates neutral particle transport application.
  • HPCG  [Summary (pdf), Source]
    High Performance Conjugate Gradient benchmark.
  • PENNANT  [Summary (pdf), Source]
    A mini-application for 2D, unstructured, finite element mesh with arbitrary polygons.
  • MiniPIC  [Summary (pdf), Source]
    A Particle-In-Cell proxy application that solves the discrete Boltzman equation in an electrostatic field in an arbitrary domain with reflective walls.
  • UMT  [Summary (pdf), Source]
    A proxy application that performs three-dimensional, non-linear, radiation transport calculations using deterministic (Sn) methods.
  • VPIC  [Summary (pdf), Source* - A, B, C, D, E, F, VPIC_Results_Summary (pdf)]
    A 3D relativistic, electromagnetic Particle-In-Cell plasma simulation code. 
    *NOTE:  VPIC source is split into 6 files that must be reassembled into a single xzip file.
    To reassemble:  cat vpic_crossroads.tar.xz.* >& vpic_crossroads.tar.xz
  • Branson  [Summary (pdf), Source (using ParMetis), Source with Metis (using Metis)]
    A proxy application for the Implicit Monte carlo method, to model the exchange of radiation with material at high temperatures. 

Micro-Benchmarks

The following microbenchmarks will be used in support of specific requirements in the RFP.

  • DGEMM 
    The DGEMM benchmark measures the sustained floating-point rate of a single node.
  • IOR
    IOR is used for testing performance of parallel file systems using various interfaces and access patterns.
  • Mdtest
    A metadata benchmark that performs open/stat/close operations on files and directories.
  • STREAM
    The STREAM benchmark measures sustainable memory bandwidth using four simple vector kernels.
  • MPI Benchmarks

ASC Simulation Code Suite

In addition to the Crossroads benchmarks, an ASC Simulation Code Suite representing the three NNSA laboratories will be used to judge performance at time of acceptance (Mercury from Lawrence Livermore, PARTISN from Los Alamos, and SPARC from Sandia). NNSA mission requirements forecast the need for a 6X or greater improvement over the ASC Trinity system (Haswell partition) for the code suite, measured using SSI. Final acceptance performance targets will be negotiated after a final system configuration is defined. Source code will be provided to the Offeror, but it will require compliance with export control laws and no cost licensing agreements.

Note: Each code will require special handling. Refer to section 3.5.4 of the Crossroads 2021 Technical Specs (pdf)

  • Mercury: Lawrence Livermore National Laboratory
    • For details on how obtain the code and the relevant paperwork, vendors should contact Dave Richards.
  • PARTISN: Los Alamos National Laboratory
    • For details on how obtain the code and the relevant paperwork, vendors should contact Jim Lujan

Scalable System Improvement (SSI) metric

Scalable System Improvement (SSI): An Application Performance Benchmarking Metric for HPC

Scalable System Improvement (SSI) provides a means to measure relative application performance between two high-performance computing (HPC) platforms.  In defining SSI, it was desired to have a single metric to measure performance improvement for a wide variety of application and platform characteristics, for example capability, throughput, strong scaling, weak scaling, system size, etc. It is also desirable to provide parameters that allow architecture teams and benchmark analysts to define the workload characteristics and to weight benchmarks independently, a desirable characteristic in procurements that represent more than one organization and/or varied workloads.

Given two platforms using one as a reference, SSI is defined as a weighted geometric mean using the following equation.

SSI Equation

Where:

  • M - total number of applications,
  • c - capability scaling factor,
  • U - utilization factor = (n_ref / n) x (N / N_ref),
    • n is the total number of nodes used for the application,
    • N is the total number of nodes in the respective platform,
    • ref refers to the reference system,
  • S - application speedup = (t_ref / t) or (FOM / FOM_ref),
  • w - weighting factor.

The capability factor allows the design team to define weak scaled problems. For example, if for a given application the problem size (or some other metric of complexity) is four times larger than the problem run on the reference system c_i would be 4 for that application.

The utilization factor is the ratio of the platform utilizations used in obtaining the reported time or figure of merit (FOM). The utilization factor rewards using fewer nodes (n) to achieve a given speedup, and it also rewards providing more nodes in aggregate (N).

Speedup is calculated using an application specific figure of merit. Nominally, speedup is defined as the ratio of the execution times. Some applications define a different FOM such as: a dimensionless number, time per iteration for a key code segment, grind time, floating-point operations per second, etc. Speedup rewards a faster time, or a higher FOM.

A necessary condition of the SSI calculation is that speedup (S) must be >= 1.0. The reason for this condition is a user expects a turn-around time to be at least the same as on a previous generation machine. In addition, one could run a given benchmark on an unreasonably small number of nodes on the target system in order to minimize node-hours (and avoid scaling effects for example) and hence increase SSI.

The weighting factor allows an architecture team or benchmark analyst to weight some applications heavier than others. If all applications have equal weight, the weighted geometric mean is equivalent to the geometric mean.

Analyzing the SSI calculation, it can be observed that SSI is maximized by minimizing (n x t) or (n / FOM).

SSI is best illustrated with an example. This example uses data obtained from a workshop publication comparing NERSC’s Hopper (Cray XE6) and Edison (Cray XC30) platforms.[3] Application names, nodes counts and timing are summarized in the following table.

  Hopper (6,384 node) Edison (5,576 nodes)
  # Nodes Time (sec) # Nodes Time (sec)
FLASH 512  331.62  512 142.89
GTC  1200  344.10  400 266.21
MILC  512  1227.22  1024 261.10
UMT  512  270.10  1024 59.90
MiniFE  512  45.20  2048 5.10

The weighted geometric mean can be easily calculated in a spreadsheet using the following form.

Weight

Where: x = cUS.

While the original study was a strong scaling analysis, for illustrative purposes we’re going to assume that the UMT and MiniFE benchmarks were run at four times the problem size on Edison and hence c=4. The weights are assigned arbitrarily, again for illustrative purposes.

SSI  3.61
  w c U S cUS
FLASH 1 1 0.87 2.32 2.03
GTC 4 1 2.62 1.29 3.39
MILC 4 1 0.44 4.70 2.05
UMT 2 4 0.44 4.51 7.88
MiniFE 2 4 0.22 8.86 7.74
Appendix: Which Mean to Use

There are a few excellent references on which Pythagorean mean to use when benchmarking systems.[2,3] Fleming states that the arithmetic mean should NOT be used to average normalized numbers and to use the geometric mean instead. Smith summarizes that “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.” However, the SSI metric normalizes each benchmark first and then calculates the geometric mean for the following reasons.

  • The geometric mean is best when comparing different figures of merit. One might think that the use of speedup is a single FOM, but for SSI each application’s FOM is independent. Hence we cannot add results together to calculate total time, nor total work, nor total rate as is recommended by Smith and as would be needed for correctness in the arithmetic and harmonic means.
  • The geometric mean normalizes the ranges being averaged so that no single application result dominates the resultant mean. The central tendency of the geometric mean emphasizes this more in that it is always less than or equal to the arithmetic mean. 
  • The geometric mean is the only mean which has the property the geometric mean of (Xi/Yi) = geometric mean of (Xi) / geometric mean of (Yi), and hence has the property that the resultant ranking is independent of which platform is used for normalization when calculating speedup.
References
  1. Cordery, M.J.; B. Austin, H. J. Wasserman, C. S. Daley, N. J. Wright, S. D. Hammond, D. Doerfler, "Analysis of Cray XC30 Performance using Trinity-NERSC-8 benchmarks and comparison with Cray XE6 and IBM BG/Q", PMBS2013: Sixth International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, November 11, 2013.
  2. Fleming, Philip J.; John J. Wallace, "How not to lie with statistics: the correct way to summarize benchmark results". Communications of the ACM 29 (3): 218–221, 1986.
  3. Smith, James E., "Characterizing computer performance with a single number". Communications of the ACM 31 (10): 1202–1206, 1988.

 

SSI Reference Values

The spreadsheet called out in the Technical Requirements document for the calculation of SSI can be found here. The table below provides the reference values obtained on Trinity Haswell. The values in the table are tentative and are subject to change until the final Request for Proposals (RFP) is issued.

 

SSI Reference Values
Application # of Nodes (n)
Time or FOM
# MPI ranks # OMP threads
SNAP 4096 183.36 sec 65536 2
PENNANT 4096 1.459503 E11 zones/sec 131072 1
HPCG 4352 40232.40 Gflops/sec 139264 2
VPIC 4096 5.89 E12 particles/sec 262144 1
MiniPIC 2048 1.8906 E9 updates/sec 32768 2
UMT 3906 1.37071 E12 unknowns/sec 125000 1
Branson 3456 393.55 seconds 110592 1

General Run Rules

Crossroads Application Performance: Instructions and Run Rules

Introduction

Application performance is a key driver for the DOE’s NNSA computing platforms. As such, application benchmarking and performance analysis will play a critical role in evaluation of the Offeror’s proposal. The ACES application benchmark suite has been carefully chosen to represent characteristics of the expected Crossroads workload, which consists of solving complex scientific problems using diverse computational techniques at large-scale and high levels of parallelism. The applications will be used as an integral part of the system acceptance test and as a continual measurement of performance throughout the operational lifetime of the system.

An aggregate performance measure, Scalable System Improvement (SSI), will be will be used in evaluating the application performance potential of the Offeror’s proposed system.

Run Rules

Specific run rules for each benchmark are included with the respective benchmark distribution, which also supplies benchmark source code, benchmark specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results.

The application benchmarks represent the NNSA workload and span a variety of algorithmic and scientific spaces. The list of application benchmarks is contained in the RFP Technical Requirements document.

Distribution

Each benchmark is a separate distribution and contains a README.ACES file describing how to build and run the code as well as any supporting library requirements. Note that each respective README.ACES contains its own instructions and run rules and thus must be considered as a supplement to this document. If there is a discrepancy between the two documents, information in the README.ACES takes precedence. If anything is unclear, please notify the ACES team.

Problem Definitions

For each application, multiple problem sizes will be defined:

  • The small problem is intended to be used for node level performance analysis and optimization
  • A medium problem may be defined if the ACES team feels it would be of benefit to the Offeror for investigating inter-node communications at a relatively small scale
  • The reference (or large) problem will be of sufficient size to represent a workload on current platforms

The reference (large) problem will be used in the calculation of the SSI baseline parameters. The baseline SSI will be derived from the current generation NNSA/ASC Trinity platform. Reference times (or figures of merit) and platform specifics will be detailed in the ACES provided spreadsheet.

ACES will define the problem sizes to be used by the Offeror in determining their proposed system benchmark results. This includes the weights (w) and capability factors (c) used in the SSI calculation. These problem definitions and factors will be used by the Offeror in the calculation of SSI for the reference (large) problem, depending on scalability and application drivers for the respective program workloads.

For any given problem, the Offeror is allowed to decompose the problem (using strong scaling) as necessary for best performance on their proposed system, with the exception of: (1) any constraints inherent in the codes and (2) any rules pertaining to the calculation of SSI.

Base Results

The base set of results must utilize the same programming method provided in the ACES distribution of the respective benchmark application. The Offeror is allowed to use any version of a given programming method (e.g. the MPI and OpenMP standards) available and supported for the proposed system that provide the best results and meets any other requirements specified in the RFP Technical Requirements document.

The base case is necessary to provide a point of reference relative to known systems and to ensure that any proposed system can adequately execute legacy codes. The base case will be used to understand baseline performance for the applications and to understand the potential for application performance improvement when compared against the optimized case. The following conditions must be met for base case results:

  • The full capabilities of the code are maintained, and the underlying purpose of the benchmark is not compromised
  • Any libraries and tools used for optimization, e.g. optimized BLAS libraries, compilers, special compiler switches, source preprocessors, execution profile feedback optimizers, etc., are allowed as long as they will be made available and supported as part of the delivered system
  • Any libraries used must not specialize or limit the applicability of the benchmark nor violate the measurement goals of a particular benchmark
  • All input parameters such as grid size, number of particles, etc., must not be changed
  • All results must pass validation and correctness tests.

Optimized Results

The optimized set of results allows the Offeror to highlight the features and benefits of the proposed system by submitting benchmarking results with optimizations beyond those of the base case. Aggressive code changes that enhance performance are permitted. The Offeror is allowed to optimize the code in a variety of ways including (but not limited to):

  • An alternative programming model
  • An alternative execution model
  • Alternative data layouts.

The rationale and relative effect on performance of any optimization shall be fully described in the response.

Submission Guidelines

The Offeror must provide results for their proposed platform for all applications defined by the RFP Technical Requirements document.

All benchmark results for the proposed system shall be recorded in a spreadsheet, which will be provided by ACES. If results are simulated, emulated and/or performance projections are used, this must be clearly indicated and all methods, tools, etc. used in arriving at the result must be specified. In addition, each surrogate system used for projections must be fully described.

The Offeror shall submit electronic copies of the benchmark spreadsheet, benchmark source codes, compile/build scripts, output files and documentation of any code optimizations or configuration changes as described in the run rules. Preferably on a USB thumb drive, CD, or similar medium. Do not include files that cannot be easily read by a human (e.g. object files, executables, core dump files, or large binary data files). An audit trail showing any changes made to the benchmark codes must be supplied and it must be sufficient for ACES to determine that the changes made conform to the spirit of the benchmark and do not violate any specific restrictions.

Change Log

Updates on 8/20/2018

  • Updated Branson source code to fix a bug that caused a hang when the number of processors was not a multiple of 8

Updates on 8/27/2018

  • Updated SSI_Xroads.xlsx with correct FOM for MiniPIC
  • Updated Summary_MiniPIC & README.ACES in MiniPIC-xroads-v1.0.0.tgz to reflect correct FOM for Large problem and Target problem
  • Updated Summary_UMT & README.ACES in umt-xroads-v1.0.0.tgz to reflect correct FOM for 1 Node test case

Updates on 9/10/2018

  • Updated IOR - Removed randomOffset from the Load2 POSIX SharedFile input deck

Updates on 9/11/2018

  • Updated Branson input file proxy_large.xml, line 48 to set y_end=1.2 (not 1.0) for uniform spacing.

Updates on 9/24/2018

  • Updated PENNANT Summary and README.ACES to clarify run rules on modifying the number of nodes used for the reference problem.

Updates on 11/20/2018

  • Updated MiniPIC Summary & README.ACES to specify shasum of the Trilinos commit used along with its branch and URL for download.
  • Updated SSI_Xroads.xlsx - Fixed formula in "Applications" tab, cell K18 to sum from K11:K17; Fixed "Example" tab (added content to cells H21:H27 so cells I11:I17 have the right data to perform the formula, fixed cells G11:G17))
  • Added VPIC_Results_Summary.pdf to provide additional VPIC scaling results
  • Updated UMT tarball - Added a single-node 32-rank input deck (ATS3/grid32MPI_3x3x4.cmg),  result on Trinity Haswell node is 8.57334e+08.

Updates on 1/28/2019

  • PARTISN FOM problem scaling instructions.  File available by request.
  • Updated SSI_Xroads.xlsx - modified FOM for Branson (new value) and VPIC (new units)
  • Updated Branson
    • New source (github version tagged 0.81) to change from dynamic to static RMA. 
    • A new FOM is associated with this version.
    • See new Summary_Branson.pdf for full list of changes. 
  • Updated VPIC
    • Modified VPIC FOM from seconds to particles/second.  This change was to allow flexibility in mesh construction for different node architectures. 
    • New Summary_VPIC.pdf reflecting change and instructions for mesh construction (section 4, 6a and 7)

Updates on 2/11/2019

  • Updated Branson
    • summary-branson.pdf reflects change to memory per node calculations to use 'number of nodes' instead of 'n ranks per node' for particle and mesh memory
    • Additional Branson source provided on Crossroads website (branson-0.82.tar.gz) that uses Metis instead of ParMetis

Benchmarks FAQ

  • Can we modify the benchmarks to include extra OpenMP or OpenACC pragmas?
    For the baseline run, the Offeror may modify the benchmarks to include minimal OpenMP pragmas or OpenACC directives as required to provide a baseline run on the proposed system, but the benchmark must remain a standard-compliant program that maintains existing output subject to the validation criteria described in the benchmark run rules.  Minimal implies no change to code structure.
    For the optimized run, the Offeror is encouraged to fully develop their use of OpenMP or OpenACC.

  • Is there duplication in the Crossroads benchmarks (e.g. UMT and SNAP are both transport codes and MiniPIC and VPIC are both PIC codes)?
    Each Crossroads benchmark was carefully chosen to represent a workload, programming model or method of importance to ASC applications. 

    Although UMT and SNAP are both transport codes, UMT treats photon transport and SNAP treats neutral particle transport.  It can be misleading to someone not doing transport since both codes describe using discrete ordinates and solving the Boltzmann transport equation (discrete ordinates just means it’s discretizing in both xyz-domain and angular variables to specify direction of radiation, so both codes use similar language e.g. flux, energy groups, angles) but are indeed different.  While they both solve the Boltzmann equation, SNAP is a proxy for neutral particle, linear transport, and UMT is for thermal radiative transport. The specific discretizations and mesh descriptions between codes of these two types creates unique numerical challenges, including mesh sweep scheduling (graph efficiency), data needs, data layout, and computational intensity. 

    VPIC and MiniPIC are both particle-in-cell applications, but using different assumptions and programming approaches and infrastructure.  Some key differences are: VPIC is a heavily used production application, implemented to use 3 levels of parallelism: asynchronous MPI at the top-level, thread parallelism via OpenMP or Pthreads at the mid-level and vectorization via explicit use of vendor specific vector intrinsics at the lowest-level.  MiniPIC uses the Kokkos framework and Trilinos library, which are essential components to many ASC workloads.

  • Is it permitted to change the use of Kokkos parallel-for regions from Team-Based Parallelism to Flat-Parallelism or vice versa for the Baseline Benchmark Runs?
    Yes. This is analogous to changing the directives of OpenMP that are supplied in the APEX benchmark suite. For the Baseline benchmark runs, the Offeror is permitted to change the parallelism strategy provided the rest of the kernel/computation code remains unchanged.

  • Is it permitted to change the behavior of Trilinos packages when performing benchmarking of MiniPIC for the Baseline Benchmark Runs?
    Yes. However, the changes should be made within each Trilinos package and not be a restructuring which crosses over package boundaries. Packages are denoted by the directory immediate under “packages” in the Trilinos source tree. The purpose of Trilinos is to provide a comprehensive suite of packages that have clear abstractions between them. Modifying within the package, and therefore using the abstraction design, is therefore permitted. Numerical reproducibility must be maintained with any modifications.