# Reliability as a challenge & opportunity to technology scaling

JOSE Maiz Fellow, Technology and Manufacturing Group Director of Logic Technology Quality & Reliability

2007 Salishan HPC Conference April 26, 2007



## Key Messages

Technology scaling continues according to Moore's Law

- 2X increase in functionality every 2 years
- In the form of cores, integrated functionality or both
- 65nm in 2005, 45nm 2007, 32nm 2009
- Technology & Reliability Challenges are many, but so are the opportunities
  - Many new device types and materials
  - A challenge as well as an opportunity
- High RAS will require global fault management strategies along with robust circuit design
  - Better understanding needed on RAS requirements
  - Research and Cost effectiveness of proposed options





#### Moore's Law Delivers Value to the End User



Twice the functionality at the same cost every 2 years



## Performance/ Watt

#### Trend in Performance/ Watt relative to i386



Performance/ watt improvement for both integer and Floating point

#### Lead in 45nm Technology and Products



Intel® Penryn Core<sup>™</sup>2 family processor 410/820 M transistors (2C/4C) Wo*rld's first working 45 nm CPU* 



153 Mbit SRAM 0.346 μm<sup>2</sup> cell 119 mm2 chip size >1B transistors Functional in Jan 2006



## 45 nm Yield Improvement Trend



2000 2001 2002 2003 2004 2005 2006 2007 2008

Excellent Yield learning and good reliability too On track for production ramp in 2H '07





## High-k + Metal Gate Transistors

Integrated 45 nm CMOS process High performance Low leakage Meets reliability requirements Manufacturable

in high volume



9

Low Resistance Layer

Work Function Metal Different for NMOS and PMOS

High-k Dielectric Hafnium based

#### Silicon Substrate



|                 | High-k vs. SiO <sub>2</sub> | Benefit               |
|-----------------|-----------------------------|-----------------------|
| Capacitance     | 60% greater                 | Faster<br>transistors |
| Gate<br>Leakage | >100x<br>reduction          | Cooler chips          |

## Very High Innovation Rate

| Materials               | HighK-MG Xtors for performance & Low Power               |
|-------------------------|----------------------------------------------------------|
|                         | LowK ILDs for interconnect                               |
|                         | Novel materials for strain and electrical Pformance      |
| Transistor              | Novel transistor architectures for HighK-MG              |
| Architecture            | TriGate Xtors and III-V integration in the future        |
| Chip                    | Efficient Performance/Power with CoreTM2                 |
| Architecture            | MultiCore                                                |
|                         | Monolythic integration of Graphics, Mem. Controller etc. |
| Platform<br>Integration | Power & form factor optimization                         |



## Many Reliability Challenges

- Increased Electric Fields
- The shrinking V<sub>max</sub>-V<sub>min</sub> window
- The development of robust High K/ Metal Gate transistors
- Dimensional scaling of interconnects and their liners
- Thermo-Mechanical limitations of very LowK ILDs
- Soft Errors
- Defectivity with scaled technology
- Transient and intermittent errors
- Fault tolerance

#### Innovation needed more than ever



## Gate Dielectric Field Trend



Substantial increases in Efield enabled by HK/MG





 $V_{max}$ :Scailing for density, performance & power  $V_{min}$ :Transistor variability & increase bit count



### **Transistor variability impacts Vmin**



Due to random dopant fluctuation and other process parameters

• Develop design techniques that can handle variability



#### V<sub>min</sub> impacted by bit count



Impacted by variability and defectivity  $\bigcirc$ 

Improved process and cell upsizing helps Need Robust manufacturing process now and Fault **Tolerance techniques in the future** 

## Transistor degrades during use



Slow but continuous process
Addressed by variation tolerant design and Frequency Guardbands at test



#### ... and so does Product operating frequency



Test Guardbands used to eliminate customer impact



## Transistor degradation



Process improvements are a must to counter the effect of increased E fields



#### Transistor architecture & materials are changing

#### Many new materials

- HighK/Metal Gates for gate leakage control and performance scaling 
   Integration and reliability challenges
- Low K ILDs 
   Thermomechanical risk may slow their introduction
- Lead free Bumps
- Clever changes in planar transistors
  - Strain, epiaxial Source/ Drain layers
- Novel Transistor architectures like tri-gate
- Exotic options explored: from Carbon Nanotubes and semiconductor nanowires to III-V compounds



#### Tri-Gate Transistors Source Drain Gate Oxide Gate Source Drain Channel Gate

Transistor gate wraps around 3 sides of Si channel (Tri-Gate)
Transistor channel is "fully depleted", unlike normal bulk CMOS
Fully depleted operation reduces leakage current by up to 10x



## **Increasing Electron Mobility**

| n-Mobility | Compound Semiconductors |      |      |
|------------|-------------------------|------|------|
| Si         | GaAs                    | InAs | InSb |
| 1          | 8                       | 33   | 50   |

Increased electron mobility leads to higher performance and less energy consumption

 The challenge is integrating them with Silicon and improving Hole mobility





#### Scaling of the interconnect



0.05 0.15 0.25 0.35 0.45 0.55 0.65 Copper CD (um)

#### • Effective resistivity increase due to:

- Cross section reduction due to barriers
- Increased scattering from grain boundaries and surfaces

**Tough but manageable challenges** 

Salishan HPC 2007, J. Maiz

65nm

## Single Event Upsets

- Transient errors that corrupt data but do not produce permanent damage (on limited doses)
  - Charge burst that overwhelms a storage node
- α particles in materials and atmospheric neutrons in terrestrial systems
- Cosmic rays and heavy nuclei in space
  - Orders of magnitude higher fluxes than at sea level
- This is just one class of transient errors
  - Others are noise related fails in the interconnect fabric etc.



## Single Event Upsets: Cache cell



- SEU errors due to neutrons from cosmic rays and  $\alpha$  particles from residual impurities
- Reduction in charge collection dominates over reduction in critical charge



#### Single event upsets: Multi-bit fails



Multi-bit errors are increasing as a proportion of fails
Expected consequence of increased charge sharing



#### Single Event Upsets: Logic latches



Similar trend starting for latches
45 nm results still preliminary



#### Single Event Upset: Chip impact



Saturation for cache arrays
Getting there for logic. Perhaps in 32nm



## Circuit contributions to SEU in a typical microprocessor

Static combinational logic

Residual Unprotected memory

Flipflops



#### Many protection options proposed

- Adding Parity/ ECC to logic arrays and Register Files
- Replication of functional units or cores
- Lock-step for cores or complete chips
- Residue Checking
- Redundant multithreading
- Fingerprinting
- Modified Scan latches
- Hardening of worst contributing latches

What is the goal that we are trying to meet? What is the value proposition for HPC?



## Recap of fail types & trends

| Fail type                   | Trend                 | Solution space                                        |
|-----------------------------|-----------------------|-------------------------------------------------------|
| Hard Fails                  | Flat                  | Continued Process improvements                        |
| Berthand St.                |                       | <ul> <li>Architectural fault tolerance(2)</li> </ul>  |
| Parametric degradation      | Increasing            | <ul> <li>Continued Process improvements</li> </ul>    |
|                             | ALC: NO               | <ul> <li>Guard-bands</li> </ul>                       |
|                             |                       | <ul> <li>Architectural fault tolerance</li> </ul>     |
| Intermittent                | ??                    | <ul> <li>Continued process improvement</li> </ul>     |
| the factor and              |                       | <ul> <li>Architectural fault tolerance(2)</li> </ul>  |
| Transient<br>(noise)        | Increasing            | <ul> <li>Continued Process improvements</li> </ul>    |
|                             |                       | • Guard-bands                                         |
|                             |                       | <ul> <li>Architectural fault tolerance (1)</li> </ul> |
| Transient<br>(Ionizing Rad) | Increasing<br>to flat | <ul> <li>Improved estimation methodology</li> </ul>   |
|                             |                       | <ul> <li>Hardening of critical elements</li> </ul>    |
|                             |                       | <ul> <li>Architectural fault tolerance (1)</li> </ul> |

- (1) Such as EDAC: (local circuit, component or system level)
- (2) Requires self-diagnostics and redundancy

## Fault Tolerance trends/ needs

- Conservatism in technology to eliminate errors to Many Sigma will cut into performance
- Fault Tolerant schemes will allow few errors to occur by providing the means to detect and correct
  - Minimal to no impact to the customer
- The continuation of Moore's Law makes transistor availability plentiful and enables a much broader thinking in Fault Tolerance
  - Local Circuit and functional circuit block level
  - Multi/ Mary core availability
  - Complement hardware /chip strategies with Platform system strategies



## Help needed from HPC experts

 What are the RAS Requirements for various categories and uses of HPC?

- Are there agreed targets that can guide us?
- Can a \$ value be assigned to them?

 How can System architecture and Software help and complement the effort at the component level?



## Key Messages

Technology scaling continues according to Moore's Law

- 2X increase in functionality every 2 years
- In the form of cores, integrated functionality or both
- 65nm in 2005, 45nm 2007, 32nm 2009
- Technology & Reliability Challenges are many, but so are the opportunities
  - Many new device types and materials
  - A challenge as well as an opportunity
- High RAS will require global fault management strategies along with robust circuit design
  - Better understanding needed on RAS requirements
  - Research and Cost effectiveness of proposed options





