CONTACTS

Meeting Planner
Peggy S. Vigil, Protocol
(505) 667-8448
peggysue@lanl.gov
Agenda Contact
Erika Maestas
(505) 664-0673
emaestas@lanl.gov

Workshops - Wednesday, October 14

Workshop on Non-Traditional Programming Models for High-Performance Computing

Workshop on Performance Analysis of Extreme-Scale Systems and Applications

Organizers
Adolfy Hoisie, Los Alamos National Laboratory
Jeff Hollingsworth, University of Maryland

Building extreme-scale parallel systems and applications that can achieve high performance is a dauntingly difficult task. Today's systems have complex processors, deep memory hierarchies and plus complex I/O subsystems.

A particular focus this year will be on I/O and performance. I/O systems are becoming more complex with the addition of non-magnetic stable storage in the I/O hierarchy. In addition, the number of I/O nodes in leadership class machines now approaches the number of compute nodes in parallel machines only a few years ago.

Given this multi-disciplinary mix of performance and productivity, in this workshop we will concern their interplay across hardware, applications and system software design. The invited speakers will not only cover these areas, but will also address the state-of-the-art in methodologies for performance analysis and optimization including benchmarking, modeling, tools development, tuning and steering, as well as metrics for productivity.

Agenda

Click on a link to view the abstract and presentation.

7:30 - 8:30	Continental Breakfast
8:15 - 8:30	Welcome
I/O
8:30 - 9:00	Crossing the Memory Wall: The Server Push Architecture Xian-He Sun (Illinois Institute of Technology) Data access is a known bottleneck of high performance computing (HPC). The prime sources of this bottleneck are the performance gap between the processor and disk storage and the large memory requirements of ever-hungry applications. Although advanced memory hierarchies and parallel file systems have been developed in recent years, they only provide high bandwidth for contiguous, well-formed data streams, and performing poorly in serving small and noncontiguous data requests. The problematic data-access wall remains after years of study and, in fact, is becoming probably the most notorious bottleneck of HPC. We propose a new I/O architecture for HPC. Unlike traditional I/O designs where data is stored and retrieved by request, our architecture is based on a novel “Server-Push” model in which a data access server proactively pushes data from a file server to the memory of compute nodes and makes application-specific data layout decisions automatically. In this talk, we present two successful designs of the server-push architecture, the post-analysis based signature approach and the runtime compiler-supported pre-execution approach, and their cheerful implementation results in MPICH2. We also discuss the design and implementation of key components, such as data prefetching, data layout, and performance modeling, of the server push architecture.
9:00 - 9:30	Extreme Scale Data Intensive Computing at NERSC Harvey Wasserman (Lawrence Berkeley Laboratory) In keeping with the focus of this year's LACSS on Data-Intensive Architectures and Applications this talk will give an overview of some high-profile data-intensive computing projects at NERSC, as well as some improvements made recently in NERSC infrastructure to support these projects. Presentation (pdf)
9:30 - 10:00	Scalable Methods for Performance and Power Data Collection and Analysis Karen Karavanic (Portland State University) Challenging science goals continue to push the high end of computing to larger and more complex systems. Coupled with this trend is increasing concern about the power consumption of data centers and computer laboratories, which in some cases matches or exceeds the resources required to power a small city. These trends together drive a need for a new, integrated approach to parallel performance analysis that integrates traditional application-oriented performance data with measurements of the physical runtime environment. We have developed the needed infrastructure for combined evaluation of system, application, and machine room performance in the high end environment. We demonstrate the integration of measured performance data from the application, system, and physical room environment, and discuss the challenges encountered, with a particular focus on scalability. Presentation (pdf)
10:00 - 10:30	Coffee Break
10:30 - 11:00	Parallel I/O Performance: From Events to Ensembles Leonid Oliker (Lawrence Berkeley Laboratory) Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel ﬁlesystems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being inﬂuenced by numerous factors inside and outside the application, thus making it extremely difﬁcult to isolate cause and effect for performance events. In this paper, we first present I/O performance analysis across of broad spectrum of parallel filesystems using a lightweight, portable benchmark called MADbench2. Next we examine a statistical approach to understanding I/O behavior that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can help identify application and middleware performance deficiencies — resulting in more than 4× runtime improvement for both examined applications. Presentation (pdf)
Benchmarking and Modeling
11:00 - 11:30	Performance Analysis of Intel Nehalem Based Clusters Subhash Saini (NASA Ames Research Center) Large-scale clusters based on commodity components must have balanced computational power, memory performance (latency and bandwidth), and interconnect performance to achieve better scalability. The Intel Nehalem processor offers some important initial steps toward ameliorating the memory latency and bandwidth problem. It has overcome problems associated with the sharing of the front side bus (FSB) in previous processor generations by integrating an on-chip memory controller and by connecting the two processors through the Intel QuickPath Interconnect (QPI). For power management it provides Turbo mode technology, which provides a frequency-stepping mode that enables the processor frequency to be increased in increments of 133 MHz. For optimal utilization of processor resources it provides hyper-threading technology, which enables two threads to execute on each core to hide latencies related to data access. For larger systems the interconnect latency and bandwidth, and topology have an increasing impact as a higher fraction of communication, remote memory access and I/O is done over the network. For scalability, intra-node and inter-node communication must be balanced for both small and large messages, especially for collectives. Identifying and understanding the interplay among the application codes, software stacks (MPI libraries, I/O, OS, firmware, etc.), and the hardware components (processor, memory, HCA, QDR etc.) are necessary to achieve scalability. Critical evaluation of these features with full-scale scientific applications on is presented. Presentation (pdf)
11:30 - 12:00	Early Experience with Dash - A Supercomputer for Data Intensive Computing Based on Flash Allan Snavely (San Diego Supercomputer Center)
12:00 - 1:30	Lunch Break
1:30 - 2:00	A Performance Analysis and Comparison of Roadrunner, Blue Gene/P, and a AMD Barcelona/InfiniBand Cluster, Kei Davis (LANL) We present a performance analysis and comparison of three major tri-lab ASC computing platforms: LANL's Roadrunner, LLNL's Blue Gene/P Dawn, and LANL's version of the tri-lab capacity cluster (TLCC) Lobo. As ever, our interest is not primarily peak performance, or LinPack performance, but performance on applications of interest--in this case to DOE. We give an overview of these machines' architectures, their low-level performance characteristics, their performance and scaling behavior with application of interest (both measured and modeled), and various optimization issues. Presentation (pdf)
2:00 - 2:30	Toward Scalable Implementations of Particle Methods for Physics and Data Analysis Applications, Rich Vuduc (Georgia Tech) Presentation (pdf)
Tools
2:30 - 3:00	Gaining Insight into Parallel Program Performance using Sampling John Mellor-Crummey (Rice University) Event-based sampling is a useful technique for gaining insight into program performance on scalable parallel systems. First, it can be used to pinpoint performance bottlenecks without any preconceived hypothesis about their nature or location. Second, its overhead and data collection rates can be directly controlled, which can make it practical even for extreme-scale parallelism. Third, it can provide deep insight into a wide range of performance losses in parallel programs. This talk will describe strategies for using sampling-based measurement to gain insight into the performance of optimized parallel programs on parallel systems composed of multicore processors. We will describe techniques for gaining insight into scalability bottlenecks within and across nodes in parallel systems, pinpointing sources of lock contention in multithreaded programs, understanding the temporal behavior of parallel programs, and pinpointing sources of inadequate parallelism in multicore runtime systems based on work stealing. Presentation (pdf)
3:00 - 3:30	Coffee Break
3:30 - 4:00	ScalaTrace: Ultra-scalable tracing, analysis and modeling of HPC codes Frank Mueller (North Carolina State University) Characterizing the communication behavior of large-scale applications is a difficult and costly task due to code/system complexity and their long execution times. An alternative to running actual codes is to gather their communication traces and then replay them, which facilitates application tuning and future procurements. While past approaches lacked lossless scalable trace collection, we contribute an approach that provides orders of magnitude smaller, if not near constant-size, communication traces regardless of the number of nodes while preserving structural information. We introduce intra- and inter-node compression techniques of MPI events, we develop a scheme to preserve time and causality of communication events, and we present results of our implementation for BlueGene/L. Given this novel capability, we discuss its impact on communication tuning, multi-level I/O tracing, trace extrapolation and their impact on exascale computing. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with time-preserving deterministic MPI call replay are without any precedence.
4:00 - 4:30	Assigning Blame Jeff Hollingsworth (University of Maryland) Parallel programs are increasingly being written using programming frameworks and other environments that allow parallel constructs to be programmed with greater ease. The data structures used allow the modeling of complex mathematical structures like linear systems and partial differential equations using high-level programming abstractions. While this allows programmers to model complex systems in a more intuitive way, it also makes the debugging and profiling of these systems more difficult due to the complexity of mapping these high level abstractions down to the low level parallel programming constructs. This talk discusses our mapping mechanism, called variable blame, for creating these mappings and using them to assist in the profiling and debugging of programs created using advanced parallel programming techniques. I also describe a prototype implementation of the system. Presentation (pdf)
4:30 - 5:00	Discussion and Closing

Workshop on Novel Computing Architectures

Organizing Committee
Zachary K. Baker, Los Alamos National Laboratory
Justin Tripp, Los Alamos National Laboratory
Maya Gokhale, Lawrence Livermore National Laboratory

Multimedia processors and FPGAs have become mainstream solutions in application domains requiring intensive computations over large data sets. What is beyond GPU, Cell, and FPGA? Is massive parallelism the magic bullet that could revolutionize performance in the face of slowing clock rates and bandwidth limits?

The Novel Computing Architectures workshop at the Los Alamos Computer Science Symposium (LACSS) seeks to explore future computing models and their realization in architectures capable of extreme performance (in non-cryogenic environments!). We will explore emerging research in non-traditional architectures and new approaches to old problems. Where will the industry be in 5-10 years? Will it be more of the same (but slower and more parallel) or are there revolutionary computing approaches/devices in our future?

Topics may include:

Future general purpose computing architectures
Addressing the memory wall
Moving computation to the data
Nano-computing
Quantum Computing
Biologically-inspired or facilitated computing

The 1/2 day workshop will include several invited speakers and
several papers chosen from submitted abstracts.

See http://www.lanl.gov/conferences/lacss/2009/ for more details, or contact zbaker@lanl.gov.

Agenda

7:30 - 8:30	Continental Breakfast
8:15 - 8:30	Welcome
8:30 - 9:00	Peter Athanas (VT) "Elemental Computing with the ElementCXI ECA" Abstract (pdf)
9:00 - 9:30	Ron Sass (UNC) "Lessons Learned from the RCC Project: What Might a`Real' All-FPGA Cluster Look Like?" Abstract (pdf)
9:30 - 10:00	Daniel Creveling (LANL) Neural Computing for Distributed Sensor Networks
10:00 - 10:15	Coffee Break
10:15 - 10:45	Andreas Olofsson (Adapteva) "A Manycore Coprocessor Architecture for Heterogeneous Computing" Abstract (pdf)
10:45 - 11:15	Ron Minnich (Sandia National Lab, Livermore): "Worms and Botnets as Computing Engines"
11:15 - 12:00	Bob Benner/Uzoma Onunkwo (SNL), "Tilera Hash Tables" Abstract (pdf)
12:00 - 1:30	Lunch Break

Workshop on HPC Resiliency

Workshop general co-chairs:
Stephen L. Scott, Oak Ridge National Laboratory
Chokchai (Box) Leangsuksun, extreme Computing Research Group

Program chair:
Christian Engelmann, Oak Ridge National Laboratory

Program Committee:
Greg Bronevetsky, Lawrence Livermore National Laboratory
Franck Cappello, UIUC-INRIA Joint Laboratory on PetaScale Computing
Nathan Debardeleben, Los Alamos National Laboratory
Ann Gentile, Sandia National Laboratories
Frank Mueller, North Carolina State University
Jon Stearley, Sandia National Laboratories
Geoffroy Vallee, Oak Ridge National Laboratory

Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today's current tera-scale to next-generation peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.

Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as preemptive migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.

The goal of the HPC Resiliency Summit is to bring together experts in the area of fault tolerance and resiliency for high-performance computing from national laboratories and universities to present their achievements and to discuss the challenges ahead. The secondary goal is to raise awareness in the HPC community about existing solutions, ongoing and planned work, and future research and development needs. The workshop program consists of a series of invited talks by experts and a round table discussion.

Web sites
Los Alamos Computer Science Symposium 2009
HPC Resiliency Summit at Los Alamos Computer Science Symposium 2009

Program topics:

Current system and application resiliency
Application-level fault handling
MPI-level fault handling
System-level checkpoint/restart
System-level preemptive migration
System health monitoring
System log analysis
System failure analysis
HPC resiliency standards
Soft error issues
Computational redundancy concepts
Resiliency for HPC file/storage systems

Agenda

7:30 - 8:30	Continental Breakfast
8:15 - 8:30	Welcome
8:30 - 9:30	Keynote: Resilience Challenges John T. Daly, U.S. Department of Defense
9:30 - 10:00	Increasing Fault Resiliency in a Message-Passing Environment Rolf Riesen, Sandia National Laboratories
10:00 - 10:30	Coffee Break
10:30 - 11:00	Transparent Process-level Fault Tolerance for MPI: Challenges and Solutions Frank Mueller, North Carolina State University Process-Level Fault Tolerance for Job Healing in HPC Environments Abstract: As the number of nodes in high-performance computing environments keeps increasing, faults are becoming commonplace. Frequently deployed checkpoint/restart mechanisms generally require a complete restart. Yet, some node failures can be anticipated by detecting a deteriorating health status in today's systems, which can be explored by proactive fault tolerance (FT). Our work proposes novel, scalable mechanisms in support of proactive FT and significant enhancements to reactive FT. The contributions are three-fold. First, we provide a transparent job pause service allowing live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Second, we complement reactive with proactive FT by a process-level live migration mechanism that supports continued execution of an application during much of migration. Third, we develop incremental checkpointing techniques to capture only data changed since the last checkpoint to reduce the cost of reactive FT. Biography: Dr. Stephen L. Scott is a Senior Research Scientist and team leader of the System Software Research Team in the Computer Science Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL). Dr. Scott’s research interest is in experimental systems with a focus on high performance distributed, heterogeneous, and parallel computing. He is a founding member of the Open Cluster Group (OCG) and Open Source Cluster Application Resources (OSCAR). Within this organization, he has served as the OCG steering committee chair, as the OSCAR release manager, and as working group chair. Dr. Scott is the project lead principal investigator for the Reliability, Availability and Serviceability (RAS) for Petascale High-End Computing research team. This multi-institution research effort, funded by the Department of Energy – Office of Science, concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale scientific high-end computing (HEC) as part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). Dr. Scott is also principal investigator of a project investigating techniques in virtualized system environments for petascale computing and is Co-PI of a related storage effort, funded by the National Science Foundation, which is investigating the advantages of storage virtualization in petascale computing environments. Dr. Scott serves on a number of scientific advisory boards and is presently serving as the chair of the international Scientific Advisory Committee for the European Commission’s XtreemOS project. Stephen has published over 100 peer-reviewed papers in the areas of parallel, cluster and distributed computing and holds both the Ph.D. and M.S. in computer science.
11:00 - 11:30	Overview of the Scalable Checkpoint/Restart (SCR) Library Adam Moody, Lawrence Livermore National Laboratory A coordinated infrastructure for Fault Tolerant Systems (CIFTS) Abstract: The need for leadership class fault-tolerance has steadily increased and continues to increase as emerging high performance systems move towards offering petascale level performance. While most high-end systems do provide mechanisms for detection, notification and perhaps handling of hardware and software related faults, the individual components present in the system perform these actions separately. Knowledge about occurring faults is seldom shared between different programs and almost never on a system-wide basis. A typical system contains numerous programs that could benefit from such knowledge, include applications, middleware libraries, job schedulers, file systems, math libraries, monitoring software, operating systems, and check pointing software. The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative provides the foundation necessary to enable systems to adapt to faults in a holistic manner. CIFTS achieves this through the Fault Tolerance Backplane (FTB), providing a unified management and communication framework, which can be used by any program to publish fault-related information. In this talk, I will present some of the work done by the CIFTS group towards the development of FTB and FTB-enabled components. Biography: Rinku Gupta is a senior scientific developer at Argonne National Laboratory and the lead developer for the Fault Tolerance Backplane project. She received her MS degree in Computer Science from Ohio State University in 2002. She has several years of experience developing systems and infrastructure for enterprise high-performance computing. Her research interests primarily lie towards middleware libraries, programming models and fault tolerance in high-end computing systems.
11:30 - 12:00	Designing Fault Resilient and Fault Tolerant Systems with InfiniBand DK Panda, The Ohio State University Towards Support for Fault Tolerance in the MPI Standard Abstract: As the number of components comprising computer systems has grown, so has the need to deal with component failure for applications to utilize the full capabilities of these systems. As we face an explosion in system size, it is important to consider fault-tolerance through the full stack, from the hardware clear to the application, if we are to use the full capabilities of these emerging systems. The MPI Forum is currently considering what changes to make to the MPI standard to deal with failure. This talk will present the direction being taken by the MPI Forum's Fault Tolerance working group for responding to failures. Biography: Gregory A. Koenig is an R&D Associate at Oak Ridge National Laboratory where his work involves developing scalable runtime systems and parallel tools for ultrascale-class parallel computers. His interests also include middleware for grid and on-demand/utility computing incorporating technologies such as virtualization, fault detection and avoidance, and resource scheduling. He holds a PhD (2007) and MS (2003) in computer science from the University of Illinois at Urbana-Champaign as well as three BS degrees (mathematics, 1996; electrical engineering technology, 1995; computer science, 1993) from Indiana University-Purdue University Fort Wayne.
12:00 - 1:30	Lunch Break
1:30 - 2:00	Adaptive Runtime Support for Fault Tolerance Esteban Meneses, Celso Mendes, and Laxmikant Kale, University of Illinois at Urbana Champaign Studying Systems as Artifacts Abstract: Imperfections are an unavoidable characteristic of complex systems; the costs of these imperfections make it imperative for us to devise generic methods for effectively detecting and isolating them. Toward this end, we present a technique that infers the dependency structure of a system by looking for anomalous behavior correlated in time across components. I'll present some early results on a supercomputer and an autonomous vehicle, as well as provide a motivational survey of my work on system management: job scheduling, quality of service guarantees, checkpointing, and log analysis. Biography: Adam Oliner is a third-year PhD student in the Computer Science Department at Stanford University, working with Alex Aiken. He is a DOE High Performance Computer Science Fellow and honorary Stanford Graduate Fellow. Before coming to Stanford, he earned a Master's of Engineering in electrical engineering and computer science at MIT, where he also received undergraduate degrees in computer science and mathematics. He interned several times at IBM with the Blue Gene/L system software team and spent a summer studying supercomputers logs at Sandia National Labs.
2:00 - 2:30	Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box Jim Brandt, Sandia National Laboratories Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing Abstract: New platforms are growing in both size and complexity, both within a node element and within the high-bandwidth, low-latency networks which provide the communication paths between node elements. Multi-core architectures add even more diversity to communication paths and contention for resources as the core count per socket continues to grow. Furthermore, corresponding growth in component count contributes to an ever shrinking system wide mean time to component failure. Understanding the heterogeneous and hierarchical nature of the platform will allow better utilization of the underlying platform resources and better handling of failure or expected failure situations. This talk presents our ongoing work on using system characterization and resource state monitoring and analysis in conjunction with intelligent resource management and existing and new programming models to not only make applications more resilient to system faults but more efficient. Biography: Jim Brandt has been involved in research in high-performance computing platforms, performance optimization tools, and informatics for over 10 years. He is the lead of Sandia's OVIS (http://ovis.ca.sandia.gov) project which is developing an open-source tool for Intelligent Real-time Monitoring and Analysis of Large HPC clusters. OVIS has been used for analyzing system data from Sandia's Red Storm, Thunderbird, TLCC, and Talon clusters as well as chemical sensor data in conjunction with Sandia's SNIFFER project. Jim's relevant workshop organization activities include: organizer of the 2006 Tri-lab RAS workshop, chair of the 2008 Sandia Workshop on Data Mining and Data Analysis, and organizer of the 2007 Red Storm performance optimization workshop
2:30 - 3:00	Reliablity-Aware Scalability Models for High Performance Computing Ziming Zheng, Illinois Institute of Technology Root Cause Analysis Abstract: Because the functional interdependencies among components is numerous, complex, and dynamic, determining the root cause of failures on HPC systems requires extensive knowledge, unwavering tenacity, and often, a good "hunch". The difficulty of this task on future systems however grows not simply with the increasing number of components, but combinatorially with their interdependencies. Furthermore, as global checkpoint/restart overheads increase, the importance of a focused response to faults increases, which requires root cause determination. Consider a supercomputer as a graph where vertices are components (hardware or software), edges are dependencies (physical or functional), and labels are symptomatic factors (text, numeric thresholds, waveforms, etc) - is this model useful towards determining the root cause of failures within HPC systems to the benefit of human or automated responders? Biography: Jon Stearley enjoys variety and challenge, vocationally ranging from electrical engineering, neuroimaging programming, infrastructure architecture, and resilient supercomputing. Having spent the majority of recent efforts on log analysis (http://www.cs.sandia.gov/sisyphus), he is currently seeking to expand his scope of system information to compute upon, focusing on novel methods to determine the root cause of failures.
3:00 - 3:30	Coffee Break
3:30 - 4:00	Scalable Fault Tolerance for Exascale HPC Zizhong (Jeffrey) Chen, Colorado School of Mines Accurate Prediction of Soft Error Vulnerability of Scientific Applications Abstract: Understanding the soft error vulnerability of supercomputer applications is critical as these systems are using ever larger numbers of devices that have decreasing feature sizes and, thus, increasing frequency of soft errors. As many large scale parallel scientific applications use BLAS and LAPACK linear algebra routines, the soft error vulnerability of these methods constitutes a large fraction of the applications' overall vulnerability. This talk analyzes the vulnerability of these routines in the context of overall application error vulnerability. We develop a novel technique that uses vulnerability proles of individual routines to model the propagation of errors through chained invocations of them. We use our propagation models to assemble vulnerability proles of arbitrary scientific applications that are primarily composed of calls to BLAS and LAPACK. We demonstrate that the resulting application vulnerability proles are highly accurate while having very low overhead. Biography: Greg Bronevetsky graduated from Cornell University in 2006 under the direction of Keshav Pingali. He currently holds a Lawrence Post-doctoral Fellowship at the Lawrence Livermore National Laboratory. Greg's work focuses on compiler analyses for parallel applications and scalable fault tolerance techniques.
4:00 - 4:30	Fault Tolerant Algorithms for Heat Transfer Problems Hatem Ltaief, University of Tennessee, Knoxville Modular Redundancy in HPC Systems: Why, Where, When and How? Abstract: In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation high-performance computing (HPC) systems. One major source of concern are non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as preemptive migration, simply fail as random soft errors can't be predicted. This talk proposes a new, bold direction in resiliency for HPC as it targets resiliency for next-generation extreme-scale HPC systems at the system software level through computational redundancy strategies, i.e., dual- and triple-modular redundancy. Biography: Christian Engelmann is a R&D Staff Member in the System Research Team of the Computer Science Research Group in the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL). He holds a MSc in Computer Science from the University of Reading and a MSc in Computer Systems Engineering from the Technical College for Engineering and Economics (FHTW) Berlin. As part of his research activities at ORNL, Christian is currently pursuing a PhD in Computer Science at the University of Reading. His research aims at providing high-level reliability, availability, and serviceability for next-generation supercomputers to improve their resiliency (and ultimately efficiency) with novel high availability and fault tolerance system software solutions. Another research area concentrates on ``plug-and-play'' supercomputing, where transparent portability eliminates most of the software modifications caused by divers platforms and system upgrades.
4:30 - 5:00	Discussion and Closing Making Resilience a Reality Through a Resilience Consortium Abstract: The study of large scale systems is challenging and attempting to draw objective conclusions is even more difficult. To better understand these systems and provide meaningful information to the entire HPC community some basic guidelines should be defined. From the data in the log files to the reports presented, a standard set of terminology and metrics with unified semantics should be introduced. There should also be cohesion among the various researchers and industry personnel to ensure that resilience research continues to grow. To initiate this process a consortium of researchers and industry personnel has been formed. This talk will highlight some of challenges encountered performing resilience research, and how we plan to address them through the resilience consortium. Biography: James Elliott is a PhD student at Louisiana Tech University studying under Dr. Box Leangsuksun. His interests lie in modeling and analyzing resilience mechanisms at various levels of the software stack.

Interlab 2007 Workshop October 1, 2, and 3

CONTACTS

Workshops - Wednesday, October 14

Agenda

Agenda

Agenda

Places to Visit in New Mexico