|
||||||||||||||||||
|
||||||||||||||||||
|
Modern high-end computing design calls for large numbers of processors and/or nodes to be combined into an image of a single machine. Such computing environments are prone to periodic hardware failures and must be able to overcome these failures and still perform useful work. In order to minimize the amount of down time experienced, we are developing new techniques and algorithms to provide fault tolerant and highly efficient task scheduling in multi-processor computing systems.
One often neglected measure of robustness is the timeliness of system operation. In addition to fault-tolerance and scheduling work, we have developed some useful approaches for designing and reasoning about systems with timeliness requirements, i.e., real-time systems.
As an area where robustness is a factor, we are also looking into the building of clusters which minimize the "total cost of ownership / performance" ratio. Where as the traditional "price / performance" ratio addresses the question "What is the highest performance supercomputer I can afford to buy?" it does not address the more important question "What is the highest performance supercomputer that I can afford to own?" In addition to the cost of acquisition, owning a supercomputer requires one to consider the facility and utility costs (including space, power and cooling). It also requires consideration of the maintenance and system administration costs. The latter costs are directly impacted by robustness (or the lack thereof). We call this project Supercomputing in Small Spaces.
Operated by the
University of California for the National Nuclear Security Administration, of the US Department of Energy. Copyright © 2001 UC | Disclaimer/Privacy |
|
NOTICE: Information from this server resides on a computer system funded by the U.S. Department of Energy. Anyone using this system consents to monitoring of this use by system or security personnel. For complete conditions of use see Disclaimer/Privacy. |