Benchmarks and Performance Analysis
A critical element for improved predictive capability
Crossroads Benchmarks, MicroBenchmarks, & ASC Code Suite
Assuring that real applications perform efficiently on Crossroads is key to their success. A suite of benchmarks and several ASC Simulations Codes have been developed for RFP response evaluation and system acceptance. These codes are representative of the workloads of the NNSA laboratories.
Crossroads Benchmarks
 SNAP [Summary (pdf), Source]
A proxy for the performance of a modern discrete ordinates neutral particle transport application.  HPCG [Summary (pdf), Source]
High Performance Conjugate Gradient benchmark.  PENNANT [Summary (pdf), Source]
A miniapplication for 2D, unstructured, finite element mesh with arbitrary polygons.  MiniPIC [Summary (pdf), Source]
A ParticleInCell proxy application that solves the discrete Boltzman equation in an electrostatic field in an arbitrary domain with reflective walls.  UMT [Summary (pdf), Source]
A proxy application that performs threedimensional, nonlinear, radiation transport calculations using deterministic (Sn) methods.  VPIC [Summary (pdf), Source*  A, B, C, D, E, F, VPIC_Results_Summary (pdf)]
A 3D relativistic, electromagnetic ParticleInCell plasma simulation code.
*NOTE: VPIC source is split into 6 files that must be reassembled into a single xzip file.
To reassemble: cat vpic_crossroads.tar.xz.* >& vpic_crossroads.tar.xz  Branson [Summary (pdf), Source (using ParMetis), Source with Metis (using Metis)]
A proxy application for the Implicit Monte carlo method, to model the exchange of radiation with material at high temperatures.
ASC Simulation Code Suite
In addition to the Crossroads benchmarks, an ASC Simulation Code Suite representing the three NNSA laboratories will be used to judge performance at time of acceptance (Mercury from Lawrence Livermore, PARTISN from Los Alamos, and SPARC from Sandia). NNSA mission requirements forecast the need for a 6X or greater improvement over the ASC Trinity system (Haswell partition) for the code suite, measured using SSI. Final acceptance performance targets will be negotiated after a final system configuration is defined. Source code will be provided to the Offeror, but it will require compliance with export control laws and no cost licensing agreements.
Note: Each code will require special handling. Refer to section 3.5.4 of the Crossroads 2021 Technical Specs (pdf)
 Mercury: Lawrence Livermore National Laboratory
 For details on how obtain the code and the relevant paperwork, vendors should contact Dave Richards.
 PARTISN: Los Alamos National Laboratory
 For details on how obtain the code and the relevant paperwork, vendors should contact Jim Lujan
 SPARC: Sandia National Laboratories
 Download Militarily Critical Technical Data Agreement DD2345 (pdf), follow instructions.
 Download Participant Data Sheet (doc), follow instructions.
 Both forms are required before a license can be granted to use Sandia National Laboratories SPARC codes or accompanying input problems.
Note: If your institution already has a DLA Logistics Information Service approved DD2345 License, please send a copy to Sandia National Laboratories with your completed Participant Data Sheet.
Questions? Contact Simon Hammond or Jim Laros
Scalable System Improvement (SSI) metric
Scalable System Improvement (SSI): An Application Performance Benchmarking Metric for HPC
Scalable System Improvement (SSI) provides a means to measure relative application performance between two highperformance computing (HPC) platforms. In defining SSI, it was desired to have a single metric to measure performance improvement for a wide variety of application and platform characteristics, for example capability, throughput, strong scaling, weak scaling, system size, etc. It is also desirable to provide parameters that allow architecture teams and benchmark analysts to define the workload characteristics and to weight benchmarks independently, a desirable characteristic in procurements that represent more than one organization and/or varied workloads.
Given two platforms using one as a reference, SSI is defined as a weighted geometric mean using the following equation.
Where:
 M  total number of applications,
 c  capability scaling factor,
 U  utilization factor = (n_ref / n) x (N / N_ref),
 n is the total number of nodes used for the application,
 N is the total number of nodes in the respective platform,
 ref refers to the reference system,
 S  application speedup = (t_ref / t) or (FOM / FOM_ref),
 w  weighting factor.
The capability factor allows the design team to define weak scaled problems. For example, if for a given application the problem size (or some other metric of complexity) is four times larger than the problem run on the reference system c_i would be 4 for that application.
The utilization factor is the ratio of the platform utilizations used in obtaining the reported time or figure of merit (FOM). The utilization factor rewards using fewer nodes (n) to achieve a given speedup, and it also rewards providing more nodes in aggregate (N).
Speedup is calculated using an application specific figure of merit. Nominally, speedup is defined as the ratio of the execution times. Some applications define a different FOM such as: a dimensionless number, time per iteration for a key code segment, grind time, floatingpoint operations per second, etc. Speedup rewards a faster time, or a higher FOM.
A necessary condition of the SSI calculation is that speedup (S) must be >= 1.0. The reason for this condition is a user expects a turnaround time to be at least the same as on a previous generation machine. In addition, one could run a given benchmark on an unreasonably small number of nodes on the target system in order to minimize nodehours (and avoid scaling effects for example) and hence increase SSI.
The weighting factor allows an architecture team or benchmark analyst to weight some applications heavier than others. If all applications have equal weight, the weighted geometric mean is equivalent to the geometric mean.
Analyzing the SSI calculation, it can be observed that SSI is maximized by minimizing (n x t) or (n / FOM).
SSI is best illustrated with an example. This example uses data obtained from a workshop publication comparing NERSC’s Hopper (Cray XE6) and Edison (Cray XC30) platforms.[3] Application names, nodes counts and timing are summarized in the following table.
Hopper (6,384 node)  Edison (5,576 nodes)  
# Nodes  Time (sec)  # Nodes  Time (sec)  
FLASH  512  331.62  512  142.89 
GTC  1200  344.10  400  266.21 
MILC  512  1227.22  1024  261.10 
UMT  512  270.10  1024  59.90 
MiniFE  512  45.20  2048  5.10 
The weighted geometric mean can be easily calculated in a spreadsheet using the following form.
Where: x = cUS.
While the original study was a strong scaling analysis, for illustrative purposes we’re going to assume that the UMT and MiniFE benchmarks were run at four times the problem size on Edison and hence c=4. The weights are assigned arbitrarily, again for illustrative purposes.
SSI  3.61  
w  c  U  S  cUS  
FLASH  1  1  0.87  2.32  2.03 
GTC  4  1  2.62  1.29  3.39 
MILC  4  1  0.44  4.70  2.05 
UMT  2  4  0.44  4.51  7.88 
MiniFE  2  4  0.22  8.86  7.74 
Appendix: Which Mean to Use
There are a few excellent references on which Pythagorean mean to use when benchmarking systems.[2,3] Fleming states that the arithmetic mean should NOT be used to average normalized numbers and to use the geometric mean instead. Smith summarizes that “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.” However, the SSI metric normalizes each benchmark first and then calculates the geometric mean for the following reasons.
 The geometric mean is best when comparing different figures of merit. One might think that the use of speedup is a single FOM, but for SSI each application’s FOM is independent. Hence we cannot add results together to calculate total time, nor total work, nor total rate as is recommended by Smith and as would be needed for correctness in the arithmetic and harmonic means.
 The geometric mean normalizes the ranges being averaged so that no single application result dominates the resultant mean. The central tendency of the geometric mean emphasizes this more in that it is always less than or equal to the arithmetic mean.
 The geometric mean is the only mean which has the property the geometric mean of (Xi/Yi) = geometric mean of (Xi) / geometric mean of (Yi), and hence has the property that the resultant ranking is independent of which platform is used for normalization when calculating speedup.
References
 Cordery, M.J.; B. Austin, H. J. Wasserman, C. S. Daley, N. J. Wright, S. D. Hammond, D. Doerfler, "Analysis of Cray XC30 Performance using TrinityNERSC8 benchmarks and comparison with Cray XE6 and IBM BG/Q", PMBS2013: Sixth International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, November 11, 2013.
 Fleming, Philip J.; John J. Wallace, "How not to lie with statistics: the correct way to summarize benchmark results". Communications of the ACM 29 (3): 218–221, 1986.
 Smith, James E., "Characterizing computer performance with a single number". Communications of the ACM 31 (10): 1202–1206, 1988.
SSI Reference Values
The spreadsheet called out in the Technical Requirements document for the calculation of SSI can be found here. The table below provides the reference values obtained on Trinity Haswell. The values in the table are tentative and are subject to change until the final Request for Proposals (RFP) is issued.
Application  # of Nodes (n) 
Time or FOM 
# MPI ranks  # OMP threads 
SNAP  4096  183.36 sec  65536  2 
PENNANT  4096  1.459503 E11 zones/sec  131072  1 
HPCG  4352  40232.40 Gflops/sec  139264  2 
VPIC  4096  5.89 E12 particles/sec  262144  1 
MiniPIC  2048  1.8906 E9 updates/sec  32768  2 
UMT  3906  1.37071 E12 unknowns/sec  125000  1 
Branson  3456  393.55 seconds  110592  1 
General Run Rules
Crossroads Application Performance: Instructions and Run Rules
Introduction
Application performance is a key driver for the DOE’s NNSA computing platforms. As such, application benchmarking and performance analysis will play a critical role in evaluation of the Offeror’s proposal. The ACES application benchmark suite has been carefully chosen to represent characteristics of the expected Crossroads workload, which consists of solving complex scientific problems using diverse computational techniques at largescale and high levels of parallelism. The applications will be used as an integral part of the system acceptance test and as a continual measurement of performance throughout the operational lifetime of the system.
An aggregate performance measure, Scalable System Improvement (SSI), will be will be used in evaluating the application performance potential of the Offeror’s proposed system.
Run Rules
Specific run rules for each benchmark are included with the respective benchmark distribution, which also supplies benchmark source code, benchmark specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results.
The application benchmarks represent the NNSA workload and span a variety of algorithmic and scientific spaces. The list of application benchmarks is contained in the RFP Technical Requirements document.
Distribution
Each benchmark is a separate distribution and contains a README.ACES file describing how to build and run the code as well as any supporting library requirements. Note that each respective README.ACES contains its own instructions and run rules and thus must be considered as a supplement to this document. If there is a discrepancy between the two documents, information in the README.ACES takes precedence. If anything is unclear, please notify the ACES team.
Problem Definitions
For each application, multiple problem sizes will be defined:
 The small problem is intended to be used for node level performance analysis and optimization
 A medium problem may be defined if the ACES team feels it would be of benefit to the Offeror for investigating internode communications at a relatively small scale
 The reference (or large) problem will be of sufficient size to represent a workload on current platforms
The reference (large) problem will be used in the calculation of the SSI baseline parameters. The baseline SSI will be derived from the current generation NNSA/ASC Trinity platform. Reference times (or figures of merit) and platform specifics will be detailed in the ACES provided spreadsheet.
ACES will define the problem sizes to be used by the Offeror in determining their proposed system benchmark results. This includes the weights (w) and capability factors (c) used in the SSI calculation. These problem definitions and factors will be used by the Offeror in the calculation of SSI for the reference (large) problem, depending on scalability and application drivers for the respective program workloads.
For any given problem, the Offeror is allowed to decompose the problem (using strong scaling) as necessary for best performance on their proposed system, with the exception of: (1) any constraints inherent in the codes and (2) any rules pertaining to the calculation of SSI.
Base Results
The base set of results must utilize the same programming method provided in the ACES distribution of the respective benchmark application. The Offeror is allowed to use any version of a given programming method (e.g. the MPI and OpenMP standards) available and supported for the proposed system that provide the best results and meets any other requirements specified in the RFP Technical Requirements document.
The base case is necessary to provide a point of reference relative to known systems and to ensure that any proposed system can adequately execute legacy codes. The base case will be used to understand baseline performance for the applications and to understand the potential for application performance improvement when compared against the optimized case. The following conditions must be met for base case results:
 The full capabilities of the code are maintained, and the underlying purpose of the benchmark is not compromised
 Any libraries and tools used for optimization, e.g. optimized BLAS libraries, compilers, special compiler switches, source preprocessors, execution profile feedback optimizers, etc., are allowed as long as they will be made available and supported as part of the delivered system
 Any libraries used must not specialize or limit the applicability of the benchmark nor violate the measurement goals of a particular benchmark
 All input parameters such as grid size, number of particles, etc., must not be changed
 All results must pass validation and correctness tests.
Optimized Results
The optimized set of results allows the Offeror to highlight the features and benefits of the proposed system by submitting benchmarking results with optimizations beyond those of the base case. Aggressive code changes that enhance performance are permitted. The Offeror is allowed to optimize the code in a variety of ways including (but not limited to):
 An alternative programming model
 An alternative execution model
 Alternative data layouts.
The rationale and relative effect on performance of any optimization shall be fully described in the response.
Submission Guidelines
The Offeror must provide results for their proposed platform for all applications defined by the RFP Technical Requirements document.
All benchmark results for the proposed system shall be recorded in a spreadsheet, which will be provided by ACES. If results are simulated, emulated and/or performance projections are used, this must be clearly indicated and all methods, tools, etc. used in arriving at the result must be specified. In addition, each surrogate system used for projections must be fully described.
The Offeror shall submit electronic copies of the benchmark spreadsheet, benchmark source codes, compile/build scripts, output files and documentation of any code optimizations or configuration changes as described in the run rules. Preferably on a USB thumb drive, CD, or similar medium. Do not include files that cannot be easily read by a human (e.g. object files, executables, core dump files, or large binary data files). An audit trail showing any changes made to the benchmark codes must be supplied and it must be sufficient for ACES to determine that the changes made conform to the spirit of the benchmark and do not violate any specific restrictions.
Change Log
Updates on 8/20/2018
 Updated Branson source code to fix a bug that caused a hang when the number of processors was not a multiple of 8
Updates on 8/27/2018
 Updated SSI_Xroads.xlsx with correct FOM for MiniPIC
 Updated Summary_MiniPIC & README.ACES in MiniPICxroadsv1.0.0.tgz to reflect correct FOM for Large problem and Target problem
 Updated Summary_UMT & README.ACES in umtxroadsv1.0.0.tgz to reflect correct FOM for 1 Node test case
Updates on 9/10/2018
 Updated IOR  Removed randomOffset from the Load2 POSIX SharedFile input deck
Updates on 9/11/2018
 Updated Branson input file proxy_large.xml, line 48 to set y_end=1.2 (not 1.0) for uniform spacing.
Updates on 9/24/2018
 Updated PENNANT Summary and README.ACES to clarify run rules on modifying the number of nodes used for the reference problem.
Updates on 11/20/2018
 Updated MiniPIC Summary & README.ACES to specify shasum of the Trilinos commit used along with its branch and URL for download.
 Updated SSI_Xroads.xlsx  Fixed formula in "Applications" tab, cell K18 to sum from K11:K17; Fixed "Example" tab (added content to cells H21:H27 so cells I11:I17 have the right data to perform the formula, fixed cells G11:G17))
 Added VPIC_Results_Summary.pdf to provide additional VPIC scaling results
 Updated UMT tarball  Added a singlenode 32rank input deck (ATS3/grid32MPI_3x3x4.cmg), result on Trinity Haswell node is 8.57334e+08.
Updates on 1/28/2019
 PARTISN FOM problem scaling instructions. File available by request.
 Updated SSI_Xroads.xlsx  modified FOM for Branson (new value) and VPIC (new units)
 Updated Branson
 New source (github version tagged 0.81) to change from dynamic to static RMA.
 A new FOM is associated with this version.
 See new Summary_Branson.pdf for full list of changes.
 Updated VPIC
 Modified VPIC FOM from seconds to particles/second. This change was to allow flexibility in mesh construction for different node architectures.
 New Summary_VPIC.pdf reflecting change and instructions for mesh construction (section 4, 6a and 7)
Updates on 2/11/2019
 Updated Branson
 summarybranson.pdf reflects change to memory per node calculations to use 'number of nodes' instead of 'n ranks per node' for particle and mesh memory
 Additional Branson source provided on Crossroads website (branson0.82.tar.gz) that uses Metis instead of ParMetis
Benchmarks FAQ
 Can we modify the benchmarks to include extra OpenMP or OpenACC pragmas?
For the baseline run, the Offeror may modify the benchmarks to include minimal OpenMP pragmas or OpenACC directives as required to provide a baseline run on the proposed system, but the benchmark must remain a standardcompliant program that maintains existing output subject to the validation criteria described in the benchmark run rules. Minimal implies no change to code structure.For the optimized run, the Offeror is encouraged to fully develop their use of OpenMP or OpenACC.  Is there duplication in the Crossroads benchmarks (e.g. UMT and SNAP are both transport codes and MiniPIC and VPIC are both PIC codes)?
Each Crossroads benchmark was carefully chosen to represent a workload, programming model or method of importance to ASC applications.Although UMT and SNAP are both transport codes, UMT treats photon transport and SNAP treats neutral particle transport. It can be misleading to someone not doing transport since both codes describe using discrete ordinates and solving the Boltzmann transport equation (discrete ordinates just means it’s discretizing in both xyzdomain and angular variables to specify direction of radiation, so both codes use similar language e.g. flux, energy groups, angles) but are indeed different. While they both solve the Boltzmann equation, SNAP is a proxy for neutral particle, linear transport, and UMT is for thermal radiative transport. The specific discretizations and mesh descriptions between codes of these two types creates unique numerical challenges, including mesh sweep scheduling (graph efficiency), data needs, data layout, and computational intensity.
VPIC and MiniPIC are both particleincell applications, but using different assumptions and programming approaches and infrastructure. Some key differences are: VPIC is a heavily used production application, implemented to use 3 levels of parallelism: asynchronous MPI at the toplevel, thread parallelism via OpenMP or Pthreads at the midlevel and vectorization via explicit use of vendor specific vector intrinsics at the lowestlevel. MiniPIC uses the Kokkos framework and Trilinos library, which are essential components to many ASC workloads.

Is it permitted to change the use of Kokkos parallelfor regions from TeamBased Parallelism to FlatParallelism or vice versa for the Baseline Benchmark Runs?
Yes. This is analogous to changing the directives of OpenMP that are supplied in the APEX benchmark suite. For the Baseline benchmark runs, the Offeror is permitted to change the parallelism strategy provided the rest of the kernel/computation code remains unchanged. 
Is it permitted to change the behavior of Trilinos packages when performing benchmarking of MiniPIC for the Baseline Benchmark Runs?
Yes. However, the changes should be made within each Trilinos package and not be a restructuring which crosses over package boundaries. Packages are denoted by the directory immediate under “packages” in the Trilinos source tree. The purpose of Trilinos is to provide a comprehensive suite of packages that have clear abstractions between them. Modifying within the package, and therefore using the abstraction design, is therefore permitted. Numerical reproducibility must be maintained with any modifications.