CONTACTS
|
Symposium Talks Agenda (Draft) - Tuesday, October 13, 2009
Click on a link to view the abstract.
7:30 AM-8:20 AM |
Breakfast
Welcome: Adolfy Hoisie (LANL)
Opening Remarks: Alan Bishop (LANL)
Presentation Slides (pdf)
|
8:20 AM |
Keynote Speaker: Prof. Randal Bryant (Carnegie Mellon University), Data-Intensive Scalable Computing: Taking Google-Style Computing Beyond Web Search
Web search engines have become fixtures in our society, but few people realize that they are actually publicly accessible supercomputing systems, where a single query can unleash the power of several hundred processors operating on a data set of over 200 terabytes. With Internet search, computing has risen to entirely new levels of scale, especially in terms of the sizes of the data sets involved. Google and its competitors have created a new class of large-scale computer systems, which we label "Data-Intensive Scalable Computer" (DISC) systems. DISC systems differ from conventional supercomputers in their focus is on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data.
With the massive amounts of data arising from such diverse sources as telescope imagery, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. DISC points the way to new ways of organizing large-scale computing systems to be more robust, scalable, and cost effective than are current high-performance computing systems.
Presentation Slides (pdf) |
9:10 AM |
Bette Korber (LANL), Problems in HIV Biology Solved by Parallel Computing
We have confronted many issues in HIV biology through interdisciplinary collaborative efforts among people with strong computational, biological, and statistical backgrounds. While the problems we are dealing with are not always "capability scale", parallel computing has nonetheless been critical to addressing these problems. In this talk I will briefly describe the biology behind some of the problems we have dealt with by parallel computing, and their application to HIV, including: phylogenetic tree reconstruction, finding signatures patterns in diverse HIV protein sequences that correlate with biological phenotypes, vaccine design, and clustering immunological data into statistically meaningful patterns. I will also briefly discuss results from one of our first ventures into a new DNA sequencing methodology, ultradeep sequencing, where we have been able to characterize in extraordinary detail how HIV evolves in vivo while evading the host immune response.
Presentation Slides (pdf) |
9:50 AM |
Jacek Becla (Stanford University), Data-Intensive Scientific Computing: Requirements & Solutions
Peta-scale data volumes and growing complexity of scientific analytics require new approaches: the underlying systems must scale out, data has to be distributed, approaches to fault tolerance, concurrency, performance tuning and I/O optimizations need be reconsidered from the ground-up, ... the list is very long. In addition, scientific data is often highly correlated, ordered, uncertain, and exhibits adjacency properties.
The talk will cover key requirements of data-intensive scientific computing, with examples from BaBar, LSST and other collaborations, contrast them with the traditional HPC computing as well as with needs of data-intensive industrial users. It will discuss emerging trends and solutions including mapReduce and shared nothing MPP DBMSes, with a particular focus on SciDB, a new open source DBMS for scientific research.
Presentation Slides (pdf) |
10:30 AM |
Coffee Break |
11:00 AM |
Xian-He Sun (Illinois Institute of Technology), Reevaluating Amdahl's Law in the Multicore Era
Multicore architecture has become the trend of high performance processors. While it is generally accepted that we have entered the multicore era, concerns exist on when or will moving into the manycore stage. Recently, Hill and Marty presented a pessimistic view of multicore scalability, citing Amdahl's law and the memory-wall problem. Technology is available, but major vendors are hesitant in making processors that have a large number of cores. This is a very interesting phenomenon, where history seems to repeat itself on the scalability debate of parallel processing that occurred 20 years ago. In this introductory talk we first review the history and concepts of scalable computing, and review the current technologies and the memory-wall problem. We then use the same hardware cost model of multicore chips used by Hill and Marty to introduce two performance models from the scalable computing point of view. These models show that there is no inherent, immovable upper bound on the scalability of multicore architectures. Finally, we conclude with proposed solutions to the memory-wall problem to make the potential scalability of multicore reachable in practice.
Presentation Slides (pdf) |
11:40PM-1:00PM |
Lunch
Luncheon Speaker: Jim Ahrens (LANL), Data-Intensive Applications on Numerically-Intensive Supercomputers
With the advent of the era of petascale supercomputing, via the delivery of the Roadrunner supercomputing platform at Los Alamos National Laboratory, there is a pressing need to address the problem of visualizing massive petascale-sized results. In this presentation, I discuss progress on a number of approaches including in-situ analysis, multi-resolution out-of-core streaming and interactive rendering on the supercomputing platform. These approaches are placed in context by the emerging area of data-intensive supercomputing. |
1:40 PM |
Maya Gokhale (LLNL), Application-driven Evaluation of HPC Architectures for Data Analytics
Using a common Information Retrieval computation, the Term-Frequency Inverse-Document-Frequency measure, we analyze three architectural alternatives: the commodity data cluster, popularized by the Hadoop Map/Reduce framework; an FPGA co-processor; and the Tilera many-core co-processor. The architectures' throughput, energy-efficiency, and programmability are quantitatively evaluated. Trade-offs with respect to accuracy and precision are discussed.
Presentation Slides (pdf)
|
2:20 PM |
Randal Burns (Johns Hopkins University), TurbulenceDB: A Data-Intensive Architecture for the Analysis of Multiscale Fluid Simulations
We describe TurbulenceDB; a database cluster that stores the complete space time history of high-resolution, multi-scale fluid simulations for retrospective analysis. This includes the evolution of the architecture of the database cluster from the commodity database nodes on which it is currently deployed toward low-power, Amdahl-balanced blades.
We also describe the key components of systems software that allow for community access to TurbulenceDB over the Internet. These include our strategies for data distribution, indexing, I/O scheduling, and data-driven batch scheduling. Finally, we will describe how the archival approach of TurbulenceDB has enabled new types of analyses and discuss how to enhance its utility by integrating TurbulenceDB with HPC platforms.
Presentation Slides (pdf)
|
3:00 PM |
Jeffrey Vitter (Texas A&M University), Searching Document Collections for the Most Relevant Documents
The world is drowning in data. A key challenge is how to make use of this data in a meaningful way. This talk addresses how to search huge document collections and, for a given query pattern, to find the most relevant documents that contain that pattern. "Most relevant" may mean the documents that contain the greatest number of instances of the pattern or the documents with the highest Page Rank (such as used by Google). Inverted indexes do not handle general pattern search. Suffix trees and suffix arrays support general pattern search, but they are too expensive in terms of space usage, and in addition, they require the finding of every occurrence of the pattern, which can be very expensive when the number of pattern occurrences is much larger than the number of documents. We improve upon the results of Muthukrishnan with a linear-space data structure that yields optimum time performance. We also develop a more succinct search structure whose size is proportional to the size of the documents when compressed using a high-order entropy method.
Joint work with Wing-Kai Hon and Rahul Shah
Presentation Slides (pdf)
|
3:30 PM |
Coffee Break |
4:10 PM |
Bruce Hendrickson (SNL), Data Analytics and High Performance Computing: When Worlds Collide
Advanced data analysis is finding myriad roles in scientific research, commerce, and national security. High performance computing has an important role to play in this field when response time matters, yet many aspects of traditional approaches to HPC are poorly suited to data centric applications. This talk will detail this mismatch and discuss possible consequences for the future. Presentation Slides (pdf)
|
4:50 PM |
John Feo (PNNL), Graphs are not Grids
The irregular structure and dynamic nature of grids used in scientific applications is different than that of graphs used in analytic applications. The former tend to have a regular structure, add and delete edges locally, be large world, and have a small variation in edge count per node. The latter tend to have no structure, add and delete edges between any two nodes, be small world, and have a power law distribution in edge count per node. Consequently, the systems and models of computations built for scientific applications are inappropriate for graph applications. In this talk, I explain the difference between grids and graphs, describe system features that accelerate graph algorithms, and present research being carried out in the Center for Adaptive Supercomputer Software at PNNL in support of graph applications. |
6:00PM-7:00PM |
Symposium Reception |
|
Places to Visit in New Mexico
|