Running VPIC on Roadrunner Unraveling the Mysteries of Plasma
By Ben Bergen, Brian Albright, Kevin Bowers, Lin Yin, Pat Fasel, and Paul Weber
Plasma makes up 99 percent of the matter in the visible universe. Dense plasma is the stuff of stars, with tenuous plasma filling the space between stars.
Plasma is created by ionizing gas atoms—usually by heating them—to form a hot mix of positively charged atomic nuclei (ions) and negatively charged electrons. On Earth, plasmas can be seen in lightning and auroras but for the most part are not found because temperatures are relatively cool.
Common applications for plasma include neon signs and fluorescent tubes. However, artificial plasmas, those that exist only briefly during experiments, hold the key to harnessing fusion energy or in exploiting the properties of ultra-intense lasers.
To better understand the nature and behavior of plasmas, Los Alamos scientists developed VPIC (vector particle-in-cell), a computer code that simulates plasma behavior more efficiently than any other code. VPIC simulations enable researchers to study plasmas in ways that exceed conventional theory- and experiment-driven approaches.
Using VPIC, scientists at Los Alamos have conducted the following studies:
- Performed intense-laser interactions with plasmas in laser-driven fusion experiments.
- Used ultra-intense lasers to accelerate plasma electrons to produce x-rays or plasma ions, which can be used to treat cancer, detect contraband nuclear materials, or "spark" fusion in preheated and precompressed fusion fuel.
- Used high-power plasma diodes to produce intense bursts of x-rays for imaging.
- Studied magnetic-confinement fusion.
- Examined magnetic reconnection, a physical process common through the cosmos. This process produces bursts of energy-charged particles during solar flares,
To further enhance VPIC, Los Alamos scientists modified the computer code to run on the supercomputer known as Roadrunner (see sidebar). Such modifications have taken VPIC to a new level, resulting in new discoveries, some of which are addressed in this article.
To simulate plasma behavior, VPIC follows the motions of simulated particles as simulated electric and magnetic force fields push the particles. Each particle represents thousands of plasma ions and electrons. The force fields can be applied externally to the simulated volume or are generated by charges or the motion of the charges in the simulated particles themselves. VPIC tracks particle motion in three dimensions, accounting for increases in particle masses as their speeds approach that of light, according to Einstein's theory of special relativity.
The Numbers Add Up
VPIC employs vector computing, in which data are operated upon in units known as vectors that contain many pieces of data (usually real numbers). The number of pieces of data in a vector makes up the vector size. In principle, a single vector can contain many thousands of pieces of data. However, most modern supercomputers have vectors that are much smaller (e.g., with four data) called "short" vectors—such a size enables the vectors to process graphics data economically.
In many modern computing architectures, the execution units—the units that actually carry out the floating-point computations—are wide enough to accommodate multiple data. These data have the same instruction applied to them simultaneously, thus increasing throughput. This is a low-energy strategy for increasing performance.
In VPIC's case, the short vectors contain only four pieces of single-precision data that add up to 16 bytes. To adapt VPIC to run on Roadrunner, scientists began by modifying the code completed in 2007. The code used an abstraction layer called "v4," written in the high-level computer language C++. The v4 abstraction layer enables VPIC to support several different architectures with a single implementation.
Using the C++ compiler, VPIC expresses algorithms in a high-level abstraction using v4 operators, which are then converted to the correct "machine language" for the given architecture. The approach gives VPIC a real advantage in handling the diversity of different architectures currently being used in supercomputing. This feature also made it easier for scientists to adapt VPIC to Roadrunner.
Handling data intelligently is key in obtaining the best supercomputing performance. As the theoretical maximum number of operations that can be done in a second has steadily increased in supercomputers, the performance of memory subsystems has not kept up, specifically in terms of bandwidth and latency.
Usually given in gigabytes per second, bandwidth is the maximum rate at which a microprocessor moves data to or from memory. Latency refers to time delays associated with moving data to or from memory. Poor bandwidth and latency cause modern microprocessors to spend most of their time waiting for data.
To circumvent bandwidth and latency limitations, VPIC reduces the number and size of data accesses. In the Roadrunner implementation of VPIC, the real numbers representing the positions, momenta, charges, and currents of all particles are stored in the local memory of a special processing element's (SPE's) core during the update of a given region of data. The storage is in contiguous blocks whose size equals the maximum size that can be directly transferred to or from local memory.
Such storage yields the highest percentage of the local memory's bandwidth, thus making it much more efficient. Once data are stored in local memory, VPIC's carefully hand-optimized algorithms can use the data as many times as necessary before they are written back to main memory, a process that takes place on the Cell Broadband Engine (Cell) chip away from the SPE.
Adapting VPIC for Roadrunner
In addition to porting VPIC's basic algorithms to the Cell architecture, scientists needed to make two primary structural enhancements to the code so that it would run on Roadrunner.
The first structural enhancement addressed Roadrunner's hierarchical nature, involving multiple layers of computing elements and memory. The goal was to enable data to remain resident on a Cell accelerator, thus avoiding the movement of large blocks of data across a slow connection between the Opterons (server and workstation processors) and the Cell chips.
To overcome this potentially serious bottleneck, scientists developed a messaging "relay" (much like a telecommunication switch relay) that forwards messages to a Cell chip's Opteron host processor and then on to the other Cell chips. This technique essentially flattens the machine's communication topology, making it seem to the Cell chips as if they are "talking" to each other (Figure 1). The code's original design made this adaptation to Roadrunner straightforward. Subsequent performance analysis has shown that any added latency in the relay is insignificant.
Figure 1. Roadrunner's underlying physical network structure does not allow direct communication between different Cell processors. To enable Cell-to-Cell communication, a relay forwards messages from one Cell processor to another.
The second structural enhancement involved adding a data-parallel thread-management framework with abstractions that enable a single-source implementation to launch and control execution threads both on the Cell and on homogenous, multicore processors. In other words, portions of the VPIC algorithm are broken into small work units that are independent of one another and are dispatched onto separate computational "threads."
In the VPIC framework, it makes no difference if these threads run on a Cell SPE or even on separate cores of a homogeneous, multicore supercomputer. This feature is particularly advantageous on traditional supercomputers, as it yields greater flexibility in allocating node resources (Figure 2). Moreover, this feature could increase scalability by reducing the size of the communications network (reducing the number of Message-Passing Interface ranks) required to run a large simulation. Los Alamos scientists have demonstrated this VPIC feature on the Cray XT5 Kraken supercomputer at Oak Ridge National Laboratory. This supercomputer uses six execution threads per chip.
Figure 2. VPIC enables parts of the algorithm, such as the particle advance, to be broken down into small, independent work units that can be dispatched onto individual "threads" of computations. As a result, VPIC can conduct many operations in parallel.
When Los Alamos scientists began this project three years ago, VPIC was already well poised to take advantage of Roadrunner's unique capabilities. This was no accident but rather the result of deep thinking about current trends in computer-architecture design, as well as careful planning in implementing a future-proofed code.
VPIC's development helped expose three principal areas for which new tools and techniques could improve the efficient use of computing resources in the future: efficient data movement, thread control for data and task-level parallelism, and the need for portable and low-level kernel specifications. An associated challenge exists in finding new programming models that are flexible enough to support a variety of architectural characteristics and capabilities.
To address these hurdles, the computer-science community has begun to develop new language standards, such as OpenCL (Open Computing Language), a cross-platform development framework for modern processors. One challenge facing VPIC will be for researchers to find innovative ways to express the physics of simulations to facilitate advanced discovery and enhance humanity's knowledge of the universe.
Simulating Plasma Behavior
A microprocessor typically stores one piece of data in a fixed amount of computer memory amounting to four bytes, for the most common representation of a number with a decimal point (as opposed to an integer). This way of storing "real" numbers is called single-precision floating-point representation (a reference to floating-point operations, or flops). Figure 3 shows how a single-precision floating-point number is stored in computer memory.
Figure 3. A microprocessor typically stores one piece of data in a fixed amount of computer memory amounting to four bytes. This way of storing "real" numbers is called single-precision floating-point representation, as illustrated here.
To simulate plasma behavior, a particle-in-cell plasma code first defines a three-dimensional Cartesian grid that fills the simulated volume in which the simulated plasma evolves over simulated time (Figure 4). The smallest volumes defined by the grid are known as "cells" (not to be confused with the Cell Broadband Engine). The code also defines "field states" at staggered locations on the grid.
Figure 4. The particle-in-cell method evolves kinetic plasma by representing particles on a Cartesian grid.
VPIC uses a staggered (Yee) mesh to assign these locations. The field states are solutions to two of Maxwell's four electromagnetic-field equations. At the start of a simulation, the code adds small parcels—the simulated particles—to some of the cells, as dictated by the initial physical conditions of the experiment being simulated.
The code then steps through time in tiny increments. At the start of a new time step, the code uses each particle's current position, velocity, and the value of the time increment to calculate the particle's position change during the time increment. The code also uses the electric and magnetic forces acting on each particle to calculate how the particle's momentum (the product of its relativistic mass and velocity) changes.
Because the particles are charged, each one produces an electrical current as it moves through the simulated volume. The code uses the new positions and values of the charges and currents in each cell to calculate changes to the electric and magnetic fields throughout the simulated volume. The code then increases time by another time increment, with the entire process repeated again and again until the simulation is complete.
Breakthrough in Stimulated Raman Scattering
Research in inertial-confinement fusion (ICF) has both weapons and energy applications. In experiments conducted at the National Ignition Facility (NIF), 192 laser beams implode a fusion-fuel-filled spherical capsule suspended inside a cylindrical gold holhraum (Figures 5 and 6). The goal is to ignite the fusion fuel so that it releases substantially more energy than the lasers pump into it. (Read more about NIF on page 20.)
Figure 5. A closeup view of a fusion-fuel-filled spherical capsule known as a holhraum, the German word for "cavity."
Before ignition takes place, the lasers vaporize and ionize the gas within the holhraum. Intense laser light striking the resultant plasma leads to a phenomenon known as stimulated Raman scattering (SRS), which amplifies periodic density variations in the plasma. Large density variations in the plasma reflect subsequent laser light, thereby reducing the implosion's drive energy and symmetry. Either effect significantly reduces fusion yield.
To compress the fuel capsule symmetrically requires nearly uniform laser intensity. NIF's laser beams obtain such intensity by passing each beam through a random-phase plate, which breaks the beam into an ensemble of laser speckles. To predict the effects of SRS on ICF experiments, scientists must understand the onset and saturation of SRS in a single laser speckle.
Until recently, the essential nonlinear physics governing SRS growth was a mystery. Using VPIC simulations run on Roadrunner, scientists have demonstrated how nonlinear SRS physics affects laser penetration and energy deposition during a fusion experiment. These simulations modeled large, three-dimensional plasma volumes at unprecedented time and space scales and over a range of laser intensities. Each simulation typically used 4,096 processors.
Figure 6. During NIF ignition experiments, 192 laser beams implode a fusion-fuel filled spherical capsule. Lasers striking the capsule help form a phenomenon known as stimulated Raman scattering, which reflects significant amounts of laser light. Reflection reduces the quality of the implosion and therefore the fusion yield.
During the simulations, scientists found that SRS reflectivity within a single speckle exhibited nonlinear behavior—the reflectivity quantifies how much the plasma reflects the incident laser light. A sharp onset at a threshold intensity, in which the reflectivity increased abruptly over a small range of intensity, was followed by a plateau at higher laser intensity, in which the SRS instability nonlinearly saturated (see inset in Figure 7). Single-speckle experiments at the Los Alamos Trident Laser Facility have since validated this simulated behavior.
Figure 7. This figure shows VPIC simulations of plasma during two bursts of SRS growth. In the inset, VPIC simulations of SRS development reveal instability and saturation resulting from laser intensity.
Researchers ran the largest of the VPIC SRS simulations on 16 Roadrunner-connected units using 11,520 processors—nearly the full Roadrunner system. This simulation employed a record 0.4 trillion particles and 2 billion computational cells. It ran for 58,160 time steps (~1019 floating-point operations), long enough for two bursts of SRS to grow from noise levels to significant amplitudes at a laser intensity near the SRS onset.
Figure 7 shows isosurfaces of electrostatic field associated with these bursts. The wave fronts exhibit bending or "bowing" (a phenomenon arising from nonlinear electron trapping), as well as self-focusing that breaks up the phase fronts.
For the first time, these simulations have enabled researchers to understand the essential nature of the nonlinear onset and saturation of SRS. Current research focuses on determining whether neighboring speckles can interact by exchanging hot electrons—or laser or plasma waves—to better reflect laser light. This kind of study is possible only on very large (petascale-class) machines like Roadrunner, where kinetic simulations of laser-plasma interaction in three dimensions at realistic laser-speckle and multi-speckle scales can be performed with unprecedented size, speed, and fidelity.