

#### Energy Efficient Computing: From Bits to Buildings

Horst D. Simon Lawrence Berkeley National Laboratory and EECS Dept., UC Berkeley hdsimon@lbl.gov

The Salishan Conference on High-Speed Computing April 29, 2009



#### Acknowledgements

A large number of individuals have contributed to energy efficiency in computing at Berkeley Lab, UC Berkeley, and to this presentation:

David Bailey (CRD), Michael Banda (CRD), Michael Bennett (ITD), Shoaib Kamil (CRD), Jonathan Koomey (Stanford), Randy Katz (EECS), Tsu Jae King (EECS), Chuck McParland (CRD), Bruce Nordman (EETD), Lenny Oliker (CRD), Ekow Otoo (CRD), Vern Paxson (UCB/ICSI/CRD), Doron Rotem (CRD), Dale Sartor (EETD), John Shalf (NERSC), Erich Strohmaier (CRD), Bill Tschudi (EETD), Howard Walter (NERSC), Michael Wehner (CRD), Kathy Yelick (NERSC/CRD) ... and many others

Almost all Berkeley resources about energy efficiency are available at

http://www.lbl.gov/CS/html/energy%20efficient%20co

mputing.html



#### **Energy "Spaghetti" Chart**



# Power has become an industry-wide issue for computing

Two interrelated issues:

- Building and infrastructure problem -- continued increase in demand for computing ("buildings")
- Computer technology problem
  -- no more power density
  scaling ("bits")



### Why does saving energy matter?



#### Energy Consumption in the United States 1949 - 2005 200 \$ 1.7 175 Trillion Avoided Supply = 70 Quads in 2005 150 125 If E/GDP had dropped 0.4% per year \$ 1.0 100 Trillion New Physical Supply = 25 Q 75 Actual (E/GDP drops 2.1% per year) 50 70 Quads per year saved or avoided corresponds to 1 Billion cars off the 25 road 0 +--1973 1946 2005 Source: Art Rosenfeld, California Energy Commission, http://www.energy.ca.gov/commission/commissioners/rosenfeld\_docs/index.html U.S. DEPARTMENT OF rrrrrr Office of Science

BERKELEY I

**An Honest Question?** 

# Does the HPC community really care about reducing the carbon footprint?

## NO!



#### **HPC Interests**

- Energy efficiency in computer rooms
  - Spend more resources on computing than on infrastructure
- Energy efficient technology
  - Maintain performance growth and get things done that could not be done before



#### **Khazzoom-Brookes Postulate**

- Energy efficiency at the micro-level leads to higher energy consumption at the macro-level
  - cheaper energy increases use
  - increased energy efficiency leads to economic growth
  - increased efficiency in one bottleneck resource increases use of companion technologies
- HPC follows Khazzoom-Brookes



### **Energy and IT**

- "Big IT" all electronics
  - PCs / etc., consumer electronics, telephony
    - Residential, commercial, industrial
  - More than 200 TWh/year
  - \$16 billion/year
    - Based on .08\$/KWh
  - Nearly 150 million tons of CO<sub>2</sub> per year
    - Roughly equivalent to 30 million cars!

One central baseload power plant (about 7 TWh/yr)



**Numbers represent** 

U.S. only



#### ... and IT electricity use is increasing

data taken from: Jonathan Koomey, "Estimating Total Power Consumption by Servers in the U.S. and the World" Available at: http://www.koomey.com/publications.html





#### **Worldwide IT Carbon Footprint**

IT footprints Emissions by sub-sector, 2020

2007 Worldwide IT carbon footprint: 2% = 830 m tons CO<sub>2</sub> Comparable to the global aviation industry

820m tons CO<sub>2</sub>

Expected to grow to 4% by 2020

rrrrr



Total emissions: 1.43bn tonnes CO<sub>2</sub> equivalent

360m tons CO<sub>2</sub>

260m tons CO<sub>2</sub>



#### **2020 IT Carbon Footprint**

#### "SMART 2020: Enabling the Low Carbon Economy in the Information Age", The Climate Group

Fig. 2.3 The global footprint by subsector



#### Datacenters: Owned by single entity interested in reducing opex



# Power has become an industry-wide issue for computing

Two interrelated issues:

- Building and infrastructure problem -- continued increase in demand for computing ("buildings")
- Computer technology problem
  -- no more power density
  scaling ("bits")



## **Absolute Power Levels**







#### **The Problem**



Unrestrained IT power consumption could eclipse hardware costs and put great pressure on affordability, data center infrastructure, and the environment.

Source: Luiz André Barroso (Google), "The Price of Performance," *ACM Queue*, Vol. 2, No. 7, pp. 48-53, September 2005 (Modified with permission)



#### **Top Challenges to Clusters**







#### Responses

- Cloud
- Containerized data centers
- Large scale data "factories"
- Increased emphasis on computer room and building efficiency



#### **Containerized Datacenter Mechanical-Electrical Design**



ice of Science

#### **Data Center Economic Reality (2006)**

- June 2006 Google begins building a new data center near the Columbia River on the border between Washington and Oregon
  - Because the location is "at the intersection of cheap electricity and readily accessible data networking"

"Hiding in Plain Sight, Google Seeks More Power" by John Markoff, NYT, June 14, 2006

- Microsoft and Yahoo are building big data centers upstream in Wenatchee and Quincy, Wash.
  - To keep up with Google, which means they need cheap electricity and readily accessible data networking

Source: New York Times, June 14, 2006



Google Dalles Oregon Facility 68,680 Sq Ft Per Pod



Source: Levy and Snowhorn, Data Center Power Trends, February 18, 2008











#### Microsoft Quincy, Wash. 470,000 Sq Ft, 47MW!



Source: Levy and Snowhorn, Data Center Power Trends, February 18, 2008



#### Microsoft's Chicago Modular Datacenter



#### The Million Server Datacenter

- 24000 sq. m housing 400 containers
  - Each container contains 2500 servers
  - Integrated computing, networking, power, cooling systems
- 300 MW supplied from two power substations situated on opposite sides of the datacenter
- Dual water-based cooling systems circulate cold water to containers, eliminating need for air conditioned

rooms



#### Potential Benefits of Improved Data Center Energy Efficiency:

- 20-40% savings typically possible
- Aggressive strategies can yield better than 50% savings
- Extend life and capacity of existing data center infrastructures
- But is my center good or bad?





#### Benchmarking for Energy Performance Improvement:

Energy benchmarking can allow comparison to peers and help identify best practices

LBNL conducted studies of over 30 data centers:

- Found wide variation in performance
- Identified best practices





#### High Level Metric— Data Center Infrastructure Efficiency (DCiE) Ratio of Electricity Delivered to IT Equipment to Total





# Using benchmark results to find best practices:

- Air management
- Right-sizing
- Central plant optimization
- Efficient air handling
- Liquid cooling
- Free cooling
- Humidity control
- Improve power chain
- On-site generation
- Design and M&O processes





# UC's Computational Research and Theory (CRT) Facility

#### **Berkeley Weather**





### **Use Free Cooling:**

- Water-side Economizers
  - No contamination question
  - Can be in series with chiller
- Outside-Air Economizers
  - Can be very effective (24/7 load)
  - Must consider humidity



#### **System Design Approach:**

- Air-Side Economizer (93% of hours)
- Direct Evaporative Cooling for Humidification/ precooling
- Low Pressure-Drop Design (1.5" total static)



#### **Hours of Operation**

Mode 1

Mode 2 Mode 3 Mode 4 Mode 5

total

**rrrrr** 

| 100% Economiser    | 2207 | hrs |
|--------------------|------|-----|
| OA + RA            | 5957 | hrs |
| Humidification     | 45   | hrs |
| Humid + CH cooling | 38   | hrs |
| CH only            | 513  | hrs |
|                    | 8760 | hrs |



#### Water Cooling: Four-pipe System

- Allows multiple temperature feeds at server locations through mixing of CHW & TRW
- Closed-loop treated cooling water from cooling towers (via heat exchanger)
- Chilled water from chillers
- Headers, valves and caps for modularity and future flexibility

### **Predicted CRT**

# Performances based on annual energy

DCIE of 0.88 based on peak
 power





## **Design Guidelines Are Available**

- Design Guides were developed based upon the observed best practices
- Guides are available through PG&E and LBNL websites
- Self benchmarking protocol also available

http://hightech.lbl.gov/datacenters.html





## **Links to Get Started**

DOE Website: Sign up to stay up to date on new developments www.eere.energy.gov/datacenters

Lawrence Berkeley National Laboratory (LBNL) http://hightech.lbl.gov/datacenters.html



LBNL Best Practices Guidelines (cooling, power, IT systems) http://hightech.lbl.gov/datacenters-bpg.html

ASHRAE Data Center technical guidebooks <a href="http://tc99.ashraetcs.org/">http://tc99.ashraetcs.org/</a>

The Green Grid Association – White papers on metrics http://www.thegreengrid.org/gg\_content/

Energy Star® Program http://www.energystar.gov/index.cfm?c=prod\_development.server\_efficiency

Uptime Institute white papers www.uptimeinstitute.org



### TALK TO DALE: Join his network to share information and Pull market towards higher efficiency products

Contact Information: Dale Sartor, P.E. Lawrence Berkeley National Applications Team MS 90-3111 University of California Berkeley, CA 94720

DASartor@LBL.gov

111111

(510) 486-5988 http://Ateam.LBL.gov





**Power consumption has become an** industry-wide issue for computing

Two interrelated issues:

- Building and infrastructure problem -- continued increase in demand for computing
- Computer technology problem
  -- no more power density
  scaling ("bits")



## **An Early Warning**

# • Presented by Shekhar Borkar in Berkeley in November 2000



## Power will be a problem



## **Power density will increase**



## **Traditional Sources of Performance** Improvement are Flat-Lining (2004)

- New Constraints
  - 15 years of exponential clock rate growth has ended
- Moore's Law reinterpreted:
  - How do we use all of those transistors to keep performance increasing at historical rates?
  - Industry Response: #cores per chip doubles every 18 months *instead* of clock frequency!
  - multicore





## **Estimated Exascale Power Requirements**

- LBNL IJHPCA Study for ~1/5 Exaflop for Climate Science
  - Extrapolation of Blue Gene and AMD design trends
  - Estimate: 20 MW for BG and 179 MW for AMD
- DOE E3 Report
  - Extrapolation of existing design trends to exascale in 2016
  - Estimate: 130 MW
- DARPA Study
  - More detailed assessment of component technologies for exascale system
  - Estimate: more than 120 MW
- The current approach is not sustainable!



## **DARPA Exascale Study**

- Commissioned by DARPA to explore the challenges for Exaflop computing
- Two model for future performance growth
  - Simplistic: ITRS roadmap; power for memory grows linear with #of chips; power for interconnect stays constant
  - Fully scaled: same as simplistic, but memory and router power grow with peak flops per chip



## We won't reach Exaflops with this approach



# ... and the power costs will still be staggering



From Peter Kogge, DARPA Exascale Study



## **Extrapolating to Exaflop/s in 2018**

|                                                  | BlueGene/L<br>(2005) | Exaflop<br>Directly<br>scaled | Exaflop<br>compromise using<br>expected technology | Assumption for "compromise guess"                                                                                                                                                                                                                                 |  |
|--------------------------------------------------|----------------------|-------------------------------|----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Node Peak Perf                                   | 5.6GF                | 20TF                          | 20TF                                               | Same node count (64k)                                                                                                                                                                                                                                             |  |
| hardware<br>concurrency/node                     | 2                    | 8000                          | 1600                                               | Assume 3.5GHz                                                                                                                                                                                                                                                     |  |
| System Power in<br>Compute Chip                  | 1 MW                 | 3.5 GW                        | 35 MW                                              | 100x improvement (very optimistic)                                                                                                                                                                                                                                |  |
| Link Bandwidth (Each<br>unidirectional 3-D link) | 1.4Gbps              | 5 Tbps                        | 1 Tbps                                             | Not possible to maintain bandwidth ratio.                                                                                                                                                                                                                         |  |
| Wires per<br>unidirectional 3-D link             | 2                    | 400 wires                     | 80 wires                                           | Large wire count will eliminate high density and drive links onto cables where they are 100x more expensive. Assume 20 Gbps signaling                                                                                                                             |  |
| Pins in network on node                          | 24 pins              | 5,000 pins                    | <u>1,000 pins</u>                                  | 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will need optics for in rack interconnects.<br>10Gbps now possible in both copper and optics.                                                                                     |  |
| Power in network                                 | 100 KW               | 20 MW                         | 4 MW                                               | 10 mW/Gbps assumed.<br>Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one<br>direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electrical<br>Electrical power in future: separately optimized links for power. |  |
| Memory<br>Bandwidth/node                         | 5.6GB/s              | 20TB/s                        | 1 TB/s                                             | Not possible to maintain external bandwidth/Flop                                                                                                                                                                                                                  |  |
| L2 cache/node                                    | 4 MB                 | 16 GB                         | 500 MB                                             | About 6-7 technology generations                                                                                                                                                                                                                                  |  |
| Data pins associated<br>with memory/node         | 128 data pins        | 40,000 pins                   | <u>2000 pins</u>                                   | 3.2 Gbps per pin                                                                                                                                                                                                                                                  |  |
| Power in memory I/O<br>(not DRAM)                | 12.8 KW              | 80 MW                         | 4 MW                                               | 10 mW/Gbps assumed. Most current power in address bus.<br>Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f<br>for random data on data pins) Address power is higher.                                                                 |  |
| QCD CG single<br>iteration time                  | 2.3 msec             | 11 usec                       | 15 usec                                            | Requires:<br>1) fast global sum (2 per iteration)<br>2) hardware offload for messaging (Driverless messaging)                                                                                                                                                     |  |

Source: David Turek, IBM

## **Power Efficiency related to Processors**





## **Green Flash:** Ultra-Efficient Climate Modeling

- Project by Shalf, Oliker, Wehner and others at LBNL
- An alternative route to exascale computing
  - Target specific machine designs to answer a scientific question
  - Use of new technologies driven by the consumer market.



# Ultra-Efficient "Green Flash" Computing at NERSC: 100x over Business as Usual

Radically change HPC system development via application-driven hardware/software co-design

- Achieve 100x power efficiency and 100x capability of mainstream HPC approach for targeted high-impact applications
- Accelerate development cycle for exascale HPC systems
- Approach is applicable to numerous scientific applications
- Proposed pilot application: Ultra-high resolution climate change simulation



#### Path to Power Efficiency Reducing Waste in Computing

- Examine methodology of low-power embedded computing market
  - optimized for low power, low cost and high computational efficiency

"Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste."

— Mark Horowitz, Stanford University & Rambus Inc.

- Sources of waste
  - Wasted transistors (surface area)
  - Wasted computation (useless work/speculation/stalls)
  - Wasted bandwidth (data movement)
  - Designing for serial performance









## Design for Low Power: More Concurrency



- Cubic power improvement with lower clock rate due to V<sup>2</sup>F
  - Slower clock rates enable use of simpler cores
  - Simpler cores use less area (lower leakage) and reduce cost
- Tailor design to application to <u>reduce</u> <u>waste</u>

This is how iPhones and MP3 players are designed to maximize battery life



## **Low Power Design Principles**



**```````** 

- IBM Power5 (server)
  - 120W@1900MHz
  - Baseline
- Intel Core2 sc (laptop) :
  - 15W@1000MHz
  - 4x more FLOPs/watt than baseline
- IBM PPC 450 (BG/P low power)
  - 0.625W@800MHz
  - 90x more
- Tensilica XTensa (Moto Razor) :
  - 0.09W@600MHz
  - 400x more

Even if each core operates at 1/3 to 1/10th efficiency of largest chip, you can pack 100s more cores onto a chip and consume 1/20 the power



## Customization Continuum: Green Flash



- <u>Application-driven does NOT necessitate a special purpose machine</u>
- MD-Grape: Full custom ASIC design
  - 1 Petaflop performance for one application using 260 kW for \$9M
- D.E. Shaw Anton System: Full and Semi-custom design
  - Simulate 100x–1000x timescales vs any existing HPC system (~200kW)
- Application-Driven Architecture (Green Flash): Semicustom design
  - Highly programmable core architecture using C/C++/Fortran
  - Goal of 100x power efficiency improvement vs general HPC approach
  - Better understand how to build/buy application-driven systems
  - Potential: 1km-scale model (~200 Petaflops peak) running in O(5 years)







## **Green Flash Strawman System Design**

We examined three different approaches (in 2008 technology)

Computation .015°X.02°X100L: 10 PFlops sustained, ~200 PFlops peak

- AMD Opteron: Commodity approach, lower efficiency for scientific applications offset by cost efficiencies of mass market
- BlueGene: Generic embedded processor core and customize system-onchip (SoC) to improve power efficiency for scientific applications
- Tensilica XTensa: Customized embedded CPU w/SoC provides further power efficiency benefits but maintains programmability

| Processor                         | Clock    | Peak/<br>Core<br>(Gflops) | Cores/<br>Socket | Sockets | Cores | Power                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Cost<br>2008           |
|-----------------------------------|----------|---------------------------|------------------|---------|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|
| AMD Opteron                       | 2.8GHz   | 5.6                       | 2                | 890K    | 1.7M  | 179 MW                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | \$1B+                  |
| IBM BG/P                          | 850MHz   | 3.4                       | 4                | 740K    | 3.0M  | 20 MW                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | \$1B+                  |
| Green Flash /<br>Tensilica XTensa | 650MHz   | 2.7                       | 32               | 120K    | 4.0M  | 3 MW                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | \$75M                  |
| BERKELEY LAB                      | <b>U</b> |                           | -                |         | D     | and the second sec | NERGY<br>ce of Science |

#### Climate System Design Concept Strawman Design Study



## **Green Flash Hardware Demo at SC08**

- Demonstrated during SC '08
- Proof of concept
  - CSU atmospheric model ported to Tensilica Architecture
  - Single Tensilica processor running atmospheric model at 50MHz
- Emulation performance advantage
  - Processor running at 50MHz
    vs. Functional model at 100 kHz
  - 500x Speedup

**CCCCC** 

Actual code running - not representative benchmark







## Silicon Photonics for Energy-Efficient Communication



- Silicon photonics enables optics to be integrated with conventional CMOS
- Enables up to 27x improvement in communication energy efficiency!

1111111



## Summary

- Power consumption is a huge problem in HPC
  - "Bits": we may not be able to scale to Exaflops without new technologies
  - "Buildings": we may have to spend more \$\$ on infrastructure and less on computing







## Outline

- 1. Power consumption has become an industry-wide issue for computing
- 2. Building and computer room energy efficiency
- 3. Computer architecture for energy efficiency- the Green Flash project

## 4. Future



## **Processor Technology Trend**

- 1990s R&D computing hardware dominated by desktop/COTS
  - -Had to learn how to use COTS technology for HPC
- 2010 R&D investments moving rapidly to consumer electronics/ embedded processing
  - Must learn how to leverage embedded processor technology for future HPC systems
     Market in Japan(B\$)





#### Consumer Electronics Convergence



#### **Consumer Electronics has Replaced PCs as** the Dominant Market Force in CPU Design!!





#### **Consumer Electronics has Replaced PCs as** the Dominant Market Force in CPU Design!!





#### **Consumer Electronics has Replaced PCs as** the Dominant Market Force in CPU Design!!



**Consumer Electronics has Replaced PCs as the Dominant Market Force in CPU Design!!** 



#### Power fundamentals 2018--2020

Processor budget: **15 MW** for a sustained HPL Exaflops (10pJ/op) {250}

- Memory budget: 25-50 MW (25 pJ/op) {300} [1/2 Byte/sec/Flops]
- Interconnect budget: **50 MW** (5 pJ/op) [0.1 B/F] {30}
- I/O Budget: **5 MW** (5 pJ/byte) 1 petabyte/sec
- Power and Cooling Budget @30%: 30 MW

## **Total Power required 125 MW!**



### Power Ranking and How Not to do it!

- To rank objects by
  - Weight or Volume
  - Rmax (TOP500)
    - A 'larger' syster
- The ratio of 2 exter
  - (weight/volumne =
  - Performance / Pow
- One can-not 'rank' objects with densities BY SIZE:
  - Density does not tell anything about size of an object
  - A piece of lead is not heavier or larger than one piece of wood.
- Linpack (sub-linear) / Power (linear) will always sort smaller systems before larger ones!





## The Transition to Low-Power Technology is Inevitable

## Does it make sense to build systems that require the electric power equivalent of an aluminum smelter?

- Information "factories" are only affordable for a few government labs and large commercial companies (Google, MSN, Yahoo ...)
  - Midrange installations will soon hit the 1 2 MW wall, requiring costly new installations
  - Economics will change if operating expenses of a server exceed acquisition cost
- The industry will switch to low-power technology within 2 3 years
- Embedded processors or game processors will be the next step (BG, Cell, Nvidia, SiCortex, Tensilica)
  - Example RR, first Petaflops system



## **Absolute Power Levels**



#### **Power Efficiency related to Processors**



#### Frequencies and Power Efficiency

Power rating is 80 Watts each!

#### **Maximum Power Efficiency of Harpertown E54xx**



## **Most Power Efficient Systems**



## **Convergence of Platforms**

- Multiple parallel general-purpose processors (GPPs)
- Multiple application-specific processors (ASPs)



## BG/L—the Rise of the Embedded Processor

#### **TOP 500 Performance by Architecture**





## Summary (1)

- LBNL has taken a comprehensive approach to the power in computing problem
  - Component level (investigate use of low-power components and build new system)
  - System level (measuring and understanding energy consumption of system
  - Computer Room level (understand airflow and cooling technology)
  - Building Level (enforce rigorous energy standards in new computer building and use of innovative energy savings technology)



## Summary (2)

- Economic factors are driving us already to more energy efficient solutions in computing
- Incremental improvements are well on track, but we may ultimately need revolutionary new technology to reach the Exaflop/s level and beyond



## Outline

- 1. Power consumption has become an industry-wide issue for computing
- 2. Building and computer room energy efficiency
- 3. Computer architecture for energy efficiency- the Green Flash project
- 4. Towards a better understanding of "green computing"



## **Focus on PUE**

- PUE = "power usage effectiveness" metric promoted by "Green Grid"
- PUE = total facility power/ computer equipment power
- Reduce PUE by consistent application of facilities improvements

|                     | PUE |     |
|---------------------|-----|-----|
| Current Trends      | 1.9 |     |
| Improved Operations |     | 1.7 |
| Best Practices      |     | 1.3 |
| State-of-the-Art    | 1.2 |     |

