The Invisible Neutron Threat
A neutron produced by a cosmic ray and traveling at nearly the speed of light strikes a military C-141B Starlifter carrying over 100 troops at 37,000 feet over the Sea of Japan. Immediately the pilot notices something is wrong. Very wrong. The plane is suddenly banking to the right and is in danger of going out of control. What is happening?
Is a single subatomic particle capable of causing such a big problem? The answer is yes: a microchip in a plane's flight controller can malfunction and produce an erroneous command after being struck by a neutron. These neutrons, like ghosts, can pass through materials without being noticed. At aircraft cruising altitudes, about 2,000 of them per second penetrate each square yard of the aircraft's surface, passing through the passengers, seats, and onboard electronics and exiting on the other side. What happens when a high-energy neutron collides head-on with a silicon atom's nucleus in a transistor of the onboard electronics?
For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors known as single-event upsets (SEUs). These are "soft" errors—no permanent damage is done—but a single digit in computer memory suddenly changes, or a logic circuit produces an erroneous result that may hang up (or crash) an application. The neutron's head-on collision with a nucleus is what does the mischief. It produces a burst of electric charge that causes a single transistor—the basic building block of the integrated circuits patterned on the surface of a microchip—to flip from the OFF state to the ON state.
The rate at which SEUs occur in a microchip is proportional to the number of neutrons reaching the microchip per second, called the neutron radiation intensity. In the atmosphere, the neutron intensity keeps increasing with altitude up to 60,000 feet and then levels out, and the rate of SEUs follows along. At 30,000 feet, for example, both the neutron intensity and the SEU rate are 300 times higher than they are at sea level. Unfortunately, neutrons are so penetrating that there is no practical way to shield critical equipment on an aircraft. So, the military and the aerospace industry have developed mitigation strategies.
On October 7, 2008, an Airbus A330-303 operated by Qantas Airways was en route from Perth to Singapore. At 37,000 feet, one of the plane's three air data inertial reference units had a failure, causing incorrect data to be sent to the plane's flight control systems. This caused the plane to suddenly and severely pitch down, throwing unrestrained occupants to the plane's ceiling. At least 110 of the 303 passengers and 9 of the 12 crew members were injured. The injuries of 12 of the occupants were serious, and another 39 occupants required treatment at a hospital. An SEU was the only potential cause for the malfunctions not ruled out. All potential causes were found to be "unlikely," or "very unlikely," except for an SEU. However, the Australian Transport Safety Board (ATSB) found it had "insufficient evidence to estimate the likelihood" that an SEU was the cause. –ATSB Transport Safety Report Aviation Occurance Investigation AO-2008-070 Final
If an SEU occurs in a flight controller on a manned aircraft, the pilot can override the flight controller, or better, the circuits in the controller can automatically correct the error through triple modular redundancy (TMR). In TMR, the signal in one electronic circuit is compared with the results from two other identical circuits. The error-affected circuit is then overridden—in short, outvoted by the other two circuits—before the wrong signal ever leaves the controller. TMR has worked very well for flight controllers and other critical devices that depend on microchips. However, TMR mitigation is very expensive in terms of dollars, time, weight added to the aircraft, and space required, so until recently TMR was considered uneconomical for the less-critical functions like imaging and data processing devices.
The SEU rate per microchip depends on three things multiplied together: the neutron intensity, the intrinsic sensitivity of each transistor to neutron-induced SEUs, and the number of transistors on the microchip. Suppose the SEU rate for a particular microchip with particular transistors, used at a certain altitude, is 1 every 1000 hours, and there are 100 microchips in use. Then at that altitude, 1 of those 100 microchips will suffer an SEU once every 10 hours. In other words, the higher the altitude, the greater the neutron sensitivity of the transistor, and the larger the number of microchips in use, the higher the SEU rate.
Many companies (including these) have visited Los Alamos National Laboratory to use the services of the ICE House.
How Big Is the Neutron Threat?
Today the military has increasing concerns about the neutron threat because the number of airborne microchip-based devices is increasing rapidly. For example, in the Iraq and Afghanistan wars, awesome arrays of microchip-based off-the-shelf computers and imaging devices have been deployed on surveillance and other military aircraft to deliver critical battlefield information. Some are flown over the North Pole at up to 60,000 feet and give the U.S. military a view of the entire northern hemisphere. The neutron intensity there is about 2,000 times that at sea level.
The evolution of the digital world is due to a single driver: the shrinking size of individual transistors
Other lower-altitude aircraft are giving soldiers real-time imagery of the streets and neighborhoods they are about to enter. The military counts on having the information processed onboard and quickly downloaded to soldiers on the ground. However, the SEU rate per microchip at sea level in the latest off-the-shelf devices has grown rapidly in the last 5 years as the transistor size has decreased and the number of transistors on each chip has increased. Is the SEU risk now too high? Is mitigation worth the cost? And how can these risks be measured before the equipment is deployed?
The military is not alone in facing this problem. The same microchips used in avionics are appearing everywhere in our digital world, for example, in ground-level civilian systems for banking, transportation, medicine, communication, entertainment, and more. They are critical in insulin monitors and GPS-enabled emergency response systems, in antilock brakes, and smart stoplights, smart phones, increasingly realistic video games, advanced audio systems, and the supercomputers that forecast the weather and predict the performance of our nuclear weapons. (See sidebar "Supercomputer Testing at the ICE House.")
Will Moore's Law Come to an End?
The evolution of the digital world is due to a single driver: the shrinking size of individual transistors. Each time the area of the transistor is cut in half, the industry doubles the number of transistors per microchip, and the chip performance (number of operations per second) doubles. For the last 40 years, transistor area has halved and chip performance has doubled every 2 years, a rate of increase known as Moore's Law. Because smaller transistor size reduces fabrication costs and allows transistors to operate at lower voltages, the increased performance comes at little extra cost, enabling more microchips to be used in an ever-greater number of products. It is no wonder Moore's Law is hailed as an engine of growth for our economy.
Yet, Moore's Law may to come to an end due in large part to the neutron threat. The drive toward smaller transistors is now leading to an increased sensitivity to SEUs per transistor, particularly in transistors with subcomponents that are 65 nanometers (billionths of a meter) or less wide. At those dimensions, billions of transistors can be patterned on a chip, but the critical electric charge needed to flip a transistor becomes very low. Now because much smaller bursts of charge from neutrons hitting silicon nuclei can cause an SEU, the SEU rate increases sharply.
Heather Quinn of Los Alamos' Intelligence and Space Research Division is a reliability expert for electronic data systems aboard satellites and aircraft. Quinn, who has been measuring the rate of SEUs since she came to LANL in 2004, warns that the more our society goes toward automation and the more that advanced microchips with billions of transistors per microchip are used, the greater the neutron problem will become.
One hour of exposure in WNR's neutron beam should produce the same number of SEUs as 100 years of exposure at normal cruising altitudes. It would be neutron testing on steroids.
LANSCE: Dealing with the Neutron Threat
Today it's widely recognized that neutron radiation is a major factor limiting the reliability of advanced electronics. Chipmakers and users have been learning the hard way that they need to measure neutron-induced effects in advance to avoid dangerous, costly failures. Boeing was among the first to see the problem. In the early 1990s, Boeing was concerned about the electronics going into their new 777 commercial airliner and needed a rapid way to test for neutron-induced failures. But how and where could they quantify the risk?
Boeing's Eugene Normand knew that the neutron beams at LANSCE's Weapons Neutron Research (WNR) facility, the most intense high-energy neutron source in the world, have the same energy spectrum (numbers of neutrons at different energies) as the neutron radiation in the atmosphere. Normand contacted Steve Wender, director of WNR, and proposed that Boeing be allowed to place its electronics in WNR's neutron beam to replicate exposure to the neutron energy spectrum in the atmosphere. That way Boeing could research neutron-induced electronic upsets and the relative rates at which they would occur aboard the new aircraft. By using WNR, Boeing could assess the atmospheric neutron risk at a single facility instead of traveling to different single-energy neutron sources and then filling in data for the other neutron energies with theoretical guesswork.
Routers run the digital world, sending information from one computer network to another across cities, regions, nations, and continents. An office building can contain thousands of them. Their sheer numbers make them a target for neutron-induced SEUs.
Called the Irradiation of Chips and Electronics (ICE) House, the facility is now a mecca for the global electronics and avionics industries—from chip producers to consumer product companies.
Wender pointed out that, in addition, the WNR neutron beam intensity is a million times greater than the neutron intensity at about 30,000 feet. That meant that one hour of exposure in WNR's neutron beam should produce the same number of SEUs as 100 years of exposure at normal cruising altitudes. It would be neutron testing on steroids.
Wender began working with a team from Boeing, Honeywell, and LSI (the semiconductor storage and networking giant) to develop one of WNR's neutron beam lines as the first one-stop shop for predicting the SEU rates from atmospheric neutron radiation. That beam line was gradually transformed into the world's best user facility for determining the risks of neutron SEUs.
The ICE House
Called the Irradiation of Chips and Electronics (ICE) House, the facility is now a mecca for the global electronics and avionics industries from chip producers to consumer product companies.
On the military front, the Department of Defense (DoD) has asked Quinn to place electronic components planned for DoD aircraft in the neutron beam at the ICE House and test for neutron-induced SEU rates. While military airplanes have an overall lifetime of 20 to 30 years, their electronics get refreshed every 5 to 10 years. DoD wants to increase the flexibility and range of functions on each microchip, which today means deploying electronics with transistor components as small as 28 nanometers. It also means greater use of field-programmable gate arrays (FPGAs): chips that can be reprogrammed remotely with an uploaded bit stream of new program instructions. These FPGAs give DoD the option to alter the mission focus of an aircraft in midair if, for example, a new threat suddenly emerges.
Quinn not only tests components at the ICE House, but also tests possible mitigation strategies. Susceptibility to neutron-induced "latch-up" (in which the part suddenly draws a large current and could potentially burn out) are considered unacceptable, and those parts are immediately screened. But parts susceptible to "soft" (nondestructive) errors, such as SEUs, can often be helped. Quinn will recommend a redesign or the use of error-correcting software or built-in redundancy (like TMR), depending on the test results.
The Joint Electron Devices Engineering Council, representing about 300 manufacturers and users of electronics, states in its published standard for testing memory units that WNR "is the preferred facility" for accelerated neutron-induced SEU testing.
Growing Demand for the ICE House
Among the five neutron sources in the world that attempt to reproduce the effects of atmospheric neutrons, the ICE House is the only one in the United States and, according to a recent article published in the Institute of Electrical and Electronics Engineers (IEEE) Transactions on Nuclear Science, ICE House test results most closely match what can be expected in the field.
Beyond aircraft manufacturers and DoD, more industries are using the ICE House to test their new products. Automotive standards require that a car's computer system be tested for neutron radiation effects once it has more than a specific amount of microchip-based memory. The Joint Electron Devices Engineering Council, representing about 300 manufacturers and users of electronics, states in its published standard for testing memory units that WNR "is the preferred facility" for accelerated neutron-induced SEU testing.
Chipmakers such as Intel are developing new transistor designs that are small but hold enough charge to be resistant to the effects of neutrons while operating at low voltages. To test these new designs, they are requesting significant amounts of beam time at the ICE House.
"The ICE House is the only facility providing users from the military, industry, and academia with easy, economical access to neutrons that mimic the environment," says Wender. "And we are rapidly becoming oversubscribed."
Adding On to Meet Demand
To meet increased demand, LANL's management organization, LANS, LLC, has capitalized the construction of a second beam line for the ICE House. It should be completed in 2012 and will come none too soon. The high-tech industry hopes to keep Moore's Law going for at least another decade, during which time the subcomponents of transistors will downsize from 45 nanometers to 4.5 nanometers, making the transistors all the more susceptible to neutron-induced threats.
To make systems more tolerant of neutron-induced errors and device variations, researchers are envisioning more-powerful mitigation strategies that involve every layer of the system—from the software applications and operating systems to the individual circuit components. "This is not a problem that we expect to go away anytime soon, and solving it must have a high priority," states IBM Fellow Carl J. Anderson in a recent study of cross-layer reliability sponsored by the National Science Foundation and edited by Quinn, Nick Carter of Intel, and André DeHon of the University of Pennsylvania.
Those solutions to the neutron threat will have to be vetted. Undoubtedly LANSCE's ICE House, with its controlled and quantified neutron-radiation environment, will continue to be an invaluable resource to help researchers design, test, and certify the highly complex electronic automation systems we use now and envision for our future.
–Necia Grant Cooper
In this issue...
- More on this article: LANSCE: Button-to-Boom
- More on this article: Supercomputer Testing at the ICE House