USRC Publications

View the publications from various staff members of the USRC.
2023
  • Reid Priedhorsky, Jordan Ogas, Claude H. (Rusty) Davis IV, Z. Noah Hounshel, Ashlyn Lee, Benjamin Stormer, and R. Shane Goff. Charliecloud’s layer-free, git-based container build cache. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W '23, page 135–146, New York, NY, USA, 2023. Association for Computing Machinery. [  DOI | http ]
  • Howard Porter Jr. Pritchard. Mtt maintenance/deployment on ecp platforms [slide]. 3 2023. [  DOI | http ]
  • Howard Porter Pritchard Jr. Preliminary design for intel level zero (l0) component for open mpi accelerator framework: Ecp stpr17-129 highlight quadchart. 6 2023. [  DOI | http ]
  • Anna Mataleena Pietarila Graham, Jeffrey Robert Haack, Sumathi Lakshmiranganatha, Jonathan David Pietarila Graham, Howard Porter Pritchard Jr., Theresa Joyce Lee, Thomas Gerard Lowe Henderson, Nathan Henry Hart, Thomas Saller, Charles Roger Ferenbaugh, Christopher Michael Mauney, and Shane Patrick Fogerty. El capitan hackathon, april 2023: Lanl code team outbriefs [slides]. 4 2023. [  DOI | http ]
  • Howard Porter Pritchard, Jr. and Rajat Bhattarai. Elastic workflows with pmix [slides]. 8 2023. [  DOI | http ]
  • Jake Tronge and Howard Pritchard. Embedding rust within open mpi. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W '23, page 438–447, New York, NY, USA, 2023. Association for Computing Machinery. [  DOI | http ]
  • Howard Porter Pritchard, Jr. Implement intel level zero (ze) component for open mpi accelerator framework [poster]. 9 2023. [  DOI | http ]
  • Jake Tronge, Howard Pritchard, and Jed Brown. Improving mpi safety for modern languages. In Proceedings of the 30th European MPI Users' Group Meeting, EuroMPI '23, New York, NY, USA, 2023. Association for Computing Machinery. [  DOI | http ]
  • W. M. Jones, C. S. Walker, V. E. Hafener, W. D. Graham, N. A. DeBardeleben, and S. T. Senator. Incorporating staggered planned maintenance reservations to improve performance in computational clusters. In 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), pages 32--36, Los Alamitos, CA, USA, oct 2023. IEEE Computer Society. [  DOI | http ]
  • Sharmistha Chakrabarti and Adan E. Vela. Modeling and characterizing aircraft trajectories near airports using extracted control actions. Journal of Aerospace Information Systems, 20(2):81--101, 2023. [  DOI | http ]
  • Howard P Pritchard, Thomas Naughton III, Amir Shehata, and David Bernholdt. Open mpi for hpe cray ex systems. Technical report, Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States), 2023.
  • Sharmistha Chakrabarti and Adan Vela. Optimal Sequencing and Scheduling of Airport Operations. [  DOI | http ]
2022
  • Sharmistha Chakrabarti, Adan Vela, and Keumjin Lee. A Data-Driven Modeling Analysis for Identifying Potential Inefficiencies in Aircraft Landing Ordering. [  DOI | http ]
  • Jan Fecht, Martin Schreiber, Martin Schulz, Howard Pritchard, and Daniel J. Holmes. An emulation layer for dynamic resources with mpi sessions. In Hartwig Anzt, Amanda Bienz, Piotr Luszczek, and Marc Baboulin, editors, High Performance Computing. ISC High Performance 2022 International Workshops, pages 147--161, Cham, 2022. Springer International Publishing.
  • Hyun Lim, Oleg Korobkin, Julien Loiseau, Christopher Mauney, Irina Sagert, Alexander Kaltenborn, Bing-Jyun Tsao, and Wesley Even. Conservation of Angular Momentum in the Fast Multipole Method. In APS April Meeting Abstracts, volume 2022 of APS Meeting Abstracts, page S17.062, April 2022.
  • Ezra S Brooker, Sarah M Stangl, Christopher M Mauney, et al. Dependence of dust formation on the supernova explosion. The Astrophysical Journal, 931(2):85, 2022.
  • Coleman Nichols, Megan Hickman Fulp, Nathan DeBardeleben, and Jon C Calhoun. Exploring data reduction techniques for additive manufacturing analysis. In 2022 IEEE/ACM 8th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD), pages 21--28. IEEE, 2022.
  • Niranjhana Narayanan, Zitao Chen, Bo Fang, Guanpeng Li, Karthik Pattabiraman, and Nathan Debardeleben. Fault injection for tensorflow applications. IEEE transactions on dependable and secure computing, 2022.
  • Anna Mataleena Pietarila Graham, Jeffrey Robert Haack, Alex Roberts Long, Christopher Michael Mauney, Daniel Alphin Holladay, Rob Tuan Aulwes, Philipp Valentin Ferdinand Edelmann, Jonathan David Pietarila Graham, Sumathi Lakshmiranganatha, Anna M. Matsekh, Nathan Henry Hart, Robert Joseph Zerr, and Daniel J. Magee. Lanl code teams' outbriefs from el cap hackathon [slides]. 10 2022. [  DOI | http ]
  • Howard Porter Jr. Pritchard. Mpi forum sessions wg - mpi bof sc 2022. 11 2022. [  DOI | http ]
  • Howard Porter Pritchard Jr. Mpi sessions-working group activities post mpi 4.0 standard ratification. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2022.
  • Howard Porter Pritchard, Jr. Mtt maintenance/deployment on ecp platforms [poster]. 12 2022. [  DOI | http ]
  • David Bernholdt and Howard Porter Pritchard Jr. Mtt/gitlab ci open mpi and pmix/prrte testing [slides]. 3 2022. [  DOI | http ]
  • Nicklaus Przybylski, William M. Jones, and Nathan DeBardeleben. Online detection and classification of state transitions of multivariate shock and vibration data. In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--7, Sep. 2022. [  DOI ]
  • Alexandra Ruth Stewart, Li-Ta Lo, Oleg Korobkin, Irina Sagert, Julien Loiseau, Hyun Lim, Mark Alexander Kaltenborn, Christopher Michael Mauney, and Jonah Maxwell Miller. Realistic kilonova up close. arXiv preprint arXiv:2201.01865, 2022.
  • Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M Ciorba, Nathan DeBardeleben, et al. Resiliency in numerical algorithm design for extreme scale simulations. The international journal of high performance computing applications, 36(2):251--285, 2022.
  • Jeffrey Hammett Peterson, Jonah Maxwell Miller, Anna Mataleena Pietarila Graham, Daniel Alphin Holladay, Christopher Michael Mauney, Richard Felix Berger, and Karen Chung-Yen Tsai. Singularity-eos xcap report. 12 2022. [  DOI | http ]
  • Jonah M Miller, Daniel Holladay, Chad D Meyer, Joshua C Dolence, Sriram Swaminarayan, Christopher M Mauney, and Karen Tsai. Spiner: Performance portable routines for generic, tabulated, multi-dimensional data. Journal of Open Source Software, 7(75):4367, 2022.
  • Dominik Huber, Maximilian Streubel, Isaías Comprés, Martin Schulz, Martin Schreiber, and Howard Pritchard. Towards dynamic resource management with mpi sessions and pmix. In Proceedings of the 29th European MPI Users' Group Meeting, pages 57--67, 2022.
2021
  • Elisabeth Ann Moore, Nathan A Debardeleben, and Sean P Blanchard. Analysis of system log data using machine learning, June 10 2021. US Patent App. 17/061,956.
  • Reid Priedhorsky, Jordan Andrew Ogas, and Hunter Patrick Easterday. Charliecloud 101. Technical report.
  • Oleg Korobkin, Hyun Lim, Irina Sagert, Julien Loiseau, Christopher Mauney, M Alexander R Kaltenborn, Bing-Jyun Tsao, and Wesley P Even. Conservation of angular momentum in the fast multipole method. arXiv preprint arXiv:2107.07166, 2021.
  • Ezra Sebastian Brooker, Sarah Marie Stangl, Christopher Michael Mauney, and Christopher Lee Fryer. Dependence of dust formation on the supernova explosion and nudust. 3 2021. [  DOI | http ]
  • Sarah Marie Stangl, Ezra Booker, Christopher Mauney, and Christopher Fryer. Dust formation and growth in core collapse supernovae explosions [slides]. 4 2021. [  DOI | http ]
  • Craig Walker, Braeden Slade, Gavin Bailey, Nicklaus Przybylski, Nathan DeBardeleben, and William M. Jones. Exploring the tradeoff between reliability and performance in hpc systems. In 2021 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--7, Sep. 2021. [  DOI ]
  • M. A. Kaltenborn, W. Even, O. Korobkin, H. Lim, J. Loiseau, C. Mauney, and I. Sagert. FleCSPH for Modeling Binary White Dwarf Mergers. In A. D. Kapinska, editor, 37th Annual New Mexico Symposium, page 1, November 2021.
  • Howard Porter Pritchard Jr. Mtt/gitlab ci ongoing open mpi testing-focus on gpu/ofi and openpmix (quadchart for ecp ompi-x milestone stpr17-80). Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2021.
  • Reid Priedhorsky, R Shane Canon, Timothy Randles, and Andrew J Younge. Minimizing privilege for building hpc containers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--14, 2021.
  • Irina Sagert, Oleg Korobkin, Mark Alexander Randolph Kaltenborn, Hyun Lim, Julien Loiseau, Christopher Michael Mauney, Ingo Tews, Bing-Jyun Tsao, and Wesley Paul Even. Overview of flecsph solid material modeling capabilities [slides]. 5 2021. [  DOI | http ]
  • Da Zhang, Gagandeep Panwar, Jagadish B Kotra, Nathan DeBardeleben, Sean Blanchard, and Xun Jian. Quantifying server memory frequency margin and using it to improve performance in hpc systems. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 748--761. IEEE, 2021.
  • Max Grossman, Steve Poole, Howard Pritchard, and Vivek Sarkar. Shmem-ml: Leveraging openshmem and apache arrow for scalable, composable machine learning. In Stephen Poole, Oscar Hernandez, Matthew Baker, and Tony Curtis, editors, OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks, pages 111--125, Cham, 2022. Springer International Publishing.
  • Nathan DeBardeleben, Tom Burr, Stephen Penton, Craig Walker, Josip Loncaric, and William M Jones. Statistical framework for two-party acceptance testing of hpc systems for reliability. In 2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pages 21--30. IEEE, 2021.
  • Daniel Oliveira, Sean Blanchard, Nathan DeBardeleben, Fernando Fernandes dos Santos, Gabriel Piscoya Dávila, Philippe Navaux, Andrea Favalli, Opale Schappert, Stephen Wender, Carlo Cazzaniga, et al. Thermal neutrons: a possible threat for supercomputer reliability. The Journal of Supercomputing, 77:1612--1634, 2021.
  • Samuel Keith Gutierrez and Howard Porter Pritchard, Jr. Toward well-provenanced computer system benchmarking: An update. 9 2021. [  DOI | http ]
  • Kurt B Ferreira, Scott Levy, Victor Kuhns, Nathan DeBardeleben, and Sean Blanchard. Understanding the effects of dram correctable error logging at scale. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 421--432. IEEE, 2021.
2020
  • Daniel Oliveira, Sean Blanchard, Nathan DeBardeleben, Fernando F dos Santos, Gabriel Piscoya Davila, Philippe Navaux, Stephen Wender, Carlo Cazzaniga, Christopher Frost, Robert Baumann, et al. An overview of the risk posed by thermal neutrons to the reliability of computing devices. In 2020 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S), pages 92--97. IEEE, 2020.
  • Sharmistha Chakrabarti and Adan Ernesto Vela. Clustering aircraft trajectories according to air traffic controllers' decisions. In 2020 AIAA/IEEE 39th Digital Avionics Systems Conference (DASC), pages 1--9, Oct 2020. [  DOI ]
  • Laura Monroe and Vanessa Job. Computationally inequivalent summations and their parenthetic forms. arXiv preprint arXiv:2005.05387, 2020.
  • Huan Ke, Haryadi S Gunawi, David Bonnie, Nathan DeBardeleben, Michael Grosskopf, Terry Grové, Dominic Manno, Elisabeth Moore, and Brad Settlemyer. Extreme protection against data loss with single-overlap declustered parity. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 343--354. IEEE, 2020.
  • Julien Loiseau, Hyun Lim, Mark Alexander Kaltenborn, Oleg Korobkin, Christopher M. Mauney, Irina Sagert, Wesley P. Even, and Benjamin K. Bergen. FleCSPH: Parallel and distributed SPH implementation based on the FleCSI. Astrophysics Source Code Library, record ascl:2007.011, July 2020.
  • Julien Loiseau, Hyun Lim, Mark Alexander Kaltenborn, Oleg Korobkin, Christopher M Mauney, Irina Sagert, Wesley P Even, and Benjamin K Bergen. Flecsph: The next generation flecsible parallel computational infrastructure for smoothed particle hydrodynamics. SoftwareX, 12:100602, 2020.
  • Alfred Torrez, Reid Priedhorsky, and Timothy Randles. Hpc container runtime performance overhead: At first order, there is none.
  • Dylan Wallace, William M. Jones, Robert Robey, Laura Monroe, Terry Grové, and Nathan DeBardeleben. Impact of contextual error correction techniques in clamr. In 2020 SoutheastCon, pages 1--2, March 2020. [  DOI ]
  • Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey, Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, and Qing Zheng. Mochi: Composing data services for high-performance computing environments. Journal of Computer Science and Technology, 35(1):121--144, Jan 2020. [  DOI | http ]
  • Vanessa Job, Terry Grové, Shane Fogerty, Christopher Mauney, Brett Neuman, Laura Monroe, and Robert W Robey. Order matters: A case study on reducing floating point error in sums via ordering and grouping. In 2020 IEEE/ACM 4th International Workshop on Software Correctness for HPC Applications (Correctness), pages 10--19. IEEE, 2020.
  • Zitao Chen, Niranjhana Narayanan, Bo Fang, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben. Tensorfi: A flexible fault injection framework for tensorflow applications. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), pages 426--435. IEEE, 2020.
  • Daniel Oliveira, Sean Blanchard, Nathan Debardeleben, Fernando F Dos Santos, Gabriel Piscoya Dávila, Philippe Navaux, Carlo Cazzaniga, Christopher Frost, Robert C Baumann, and Paolo Rech. Thermal neutrons: a possible threat for supercomputers and safety critical applications. In 2020 IEEE European Test Symposium (ETS), pages 1--6. IEEE, 2020.
2019
  • Gagandeep Panwar, Da Zhang, Yihan Pang, Mai Dahshan, Nathan DeBardeleben, Binoy Ravindran, and Xun Jian. 2019. Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). ACM, New York, NY, USA, 821-835. DOI: https://doi.org/10.1145/3352460.3358267 (ACM Digital Library) (data from paper)
  • Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing (ICS '19). ACM, New York, NY, USA, 106-116.  (ACM Digital Library)
  • BinFI: An Efficient Fault Injector for Safety-Critical Machine Learning Systems
    Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben, To appear in the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2019. (TensorFI-BinaryFI on github) (blog at UBC about the paper)
  • S. Huang, S. Liang, S. Fu, W. Shi, D. Tiwari and H. Chen, "Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems," 2019 IEEE International Conference on Autonomic Computing (ICAC), Umea, Sweden, 2019, pp. 157-166.
    doi: 10.1109/ICAC.2019.00027 (IEEE Digital Library)
  • Z. Qiao, S. Liang, S. Fu, H. Chen and B. Settlemyer, "Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems," 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Industry Track, Portland, OR, USA, 2019, pp. 17-20.
    doi: 10.1109/DSN-Industry.2019.00011 (IEEE Digital Library)
  • Zhi Qiao, Song Fu, Hsing-Bung Chen and Bradley Settlemyer, Exploring Declustered Software RAID for Enhanced Reliability and Recovery Performance in Storage Systems, The 38th International Symposium on Reliable Distributed Systems (SRDS 2019). Oct. 1st – Oct. 4th, 2019, Lyon, France.  (to appear)
  • Nathan Hjelm, Howard Pritchard, Samuel K. Gutierrez, Daniel. J. Holmes, Ralph Castain and Anthony Skjellum, "MPI Sessions: Evaluation of an Implementation in Open MPI," IEEE Cluster 2019.  (to appear)
2018
  • M. Hickman et al., "Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code," 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Memphis, TN, 2018, pp. 100-105. doi: 10.1109/ISSREW.2018.00-23 (IEEE Digital Library)
  • E. Baseman et al., "Physics-Informed Machine Learning for DRAM Error Modeling," 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Chicago, IL, 2018, pp. 1-6. doi: 10.1109/DFT.2018.8602983 (IEEE Digital Library)
  • Amvrosiadis G, Park J W, Ganger G, Gibson G, Baseman E, and DeBardeleben N.  On the Diversity of Cluster Workloads and its Impact on Research Results.  USENIX ATC 2018. (USENIX link with abstract, presentation slides, and audio)
  • Z Qiao, J Hochstetler, S Liang, S Fu, H Chen, B Settlemyer. Developing Cost-Effective Data Rescue Schemes to Tackle Disk Failures in Data Centers.  International Conference on Big Data, 194-208
  • Qiang Liu, Nageswara SV Rao, Satyabrata Sen, Bradley W Settlemyer, Hsing-Bung Chen, Joshua M Boley, Rajkumar Kettimuthu, Dimitrios Katramatos.  Virtual Environment for Testing Software-Defined Networking Solutions for Scientific Workflows.  Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science, Pages 3-11. ACM.
  • Nageswara SV Rao, Qiang Liu, Satyabrata Sen, Raj Kettimuthu, Josh Boley, Bradley W Settlemyer, Hsing B Chen, Dimitrios Katramatos, Dantong Yu.  Software-Defined Network Solutions for Science Scenarios: Performance Testing Framework and Measurements.  Proceedings of the 19th International Conference on Distributed Computing and Networking, Pages 53-64.  ACM.
  • Michael A Sevilla, Carlos Maltzahn, Peter Alvaro, Reza Nasirigerdeh, Bradley W Settlemyer, Danny Perez, David Rich, Galen M Shipman.  Programmable Caches with a Data Management Language and Policy Engine.  Proceedings of the International Symposium on Cluster, Cloud and Grid Computing (CCGrid'18).
  • Scott Levy, Kurt B. Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, and Elisabeth Baseman. 2018. Lessons learned from memory errors observed over the lifetime of Cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 43, 12 pages. DOI: https://doi.org/10.1109/SC.2018.00046 (ACM Digital Library)
  • DeLucia, A, Baseman E. Work in Progress: Topic Modeling for HPC Job State Prediction. Machine Learning for Computing Systems Workshop, HPDC 2018.
  • Goetting I, Baseman E, Cao H. Work in Progress: Causal Relationships amongst Sensors in the Trinity Supercomputer. Machine Learning for Computing Systems Workshop, HPDC 2018.
  • DeLucia, A, Baseman E. High Performance Computing Job Outcome by Mining System Logs. Southern Data Science Conference 2018.
2017
  • Tan L, DeBardeleben N, Guan Q, Blanchard S, Lang M. 2017. RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability. the 3rd International Workshop on Recent Advances in the DependabIlity AssessmeNt of Complex systEms (RADIANCE). 
  • Tan L, DeBardeleben N, Guan Q, Blanchard S, Lang M. 2017. Using Virtualization to Quantify Power Conservation via Near-Threshold Voltage Reduction for Inherently Resilient Applications. Parallel Computing. 
  • Otstott D, Ionkov L, Lang M, Zhao M. 2017. TCASM: An asynchronous shared memory interface for high-performance application composition. Parallel Computing. 63: 61-78.
  • Wu P, DeBardeleben N, Guan Q, Blanchard S, Chen J, Tao D, Liang X, Ouyang K, Chen Z. 2017. Silent Data Corruption Resilient Two-sided Matrix Factorizations. Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 415–427, ACM, Austin, Texas, USA, 2017, ISBN: 978-1-4503-4493-7.
  • Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Garth A Gibson, Charles D Cranor, Bradley W Settlemyer, Gary Grider, Fan Guo.  2017.  Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory.  Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, 7-12, ACM
  • Lei Cao, Bradley W Settlemyer, John Bent.  2017.  To share or not to share: comparing burst buffer architectures.  Conference Proceedings of the 25th High Performance Computing Symposium.  Pages 4-14, Society for Computer Simulation International.
  • Baseman E. Helping Exascale Computers Help Us: Machine Learning for High Performance Computing. Women in Machine Learning Workshop, NIPS 2017.
  • Haque A, DeLucia A, Baseman E. Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs. HPC User Support Tools Workshop, Supercomputing 2017.
  • Siddiqua T, Sridharan V, Raasch S, DeBardeleben N, Ferreira K, Levy S, Baseman E, Guan Q. Lifetime Memory Reliability Data from the Field. DFT 2017.
  • Baseman E, DeBardeleben N, Ferreira K, Sridharan V, Siddiqua T, Tkachenko O. Automating DRAM Fault Mitigation by Learning from Experience. DSN (Industrial Track) 2017.
2016
  • Baseman E, Blanchard S, Li Z, Fu S. 2016. Relational Synthesis of Text and Numeric Data for Anomaly Detection on Computing System Logs. 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 882-885. 
  • Baseman E, DeBardeleben N, Ferreira K, Levy S, Raasch S, Sridharan V, Siddiqua T, Guan Q. 2016. Improving DRAM Fault Characterization through Machine Learning. 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 250-253..
  • Fang B, Wu P, Guan Q, DeBardeleben N, Monroe L, Blanchard S, Chen Z, Pattabiraman K, Ripeanu M. 2016. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 72-76.
  • DeBardeleben N. 2016. Extreme scale and bleeding edge technology lead to a need for resilient high performance computing systems. 2016 IEEE International Reliability Physics Symposium (IRPS), pp. 3B-1-1-3B-1-8.
  • Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J, Chen Z. 2016. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp. 31–42, ACM, Kyoto, Japan, 2016, ISBN: 978-1-4503-4314-5..
  • Fang B, Wu P, Guan Q, DeBardeleben N, Monroe L, Blanchard S, Chen Z, Pattabiraman K, Ripeanu M. 2016. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops 2016, Toulouse, France, June 28 - July 1, 2016, pp. 72–76.
  • Monroe L, Jones WM, Lavigne SR, IV CD, Guan Q, DeBardeleben N. 2016. On the Inherent Resilience of Integer Operations. Euro-Par 2016: Parallel Processing Workshops - Euro-Par 2016 International Workshops, Grenoble, France, August 24-26, 2016, Revised Selected Papers, pp. 648–659.
  • Nageswara SV Rao, Qiang Liu, Satyabrata Sen, Greg Hinkel, Neena Imam, Ian Foster, Rajkumar Kettimuthu, Bradley W Settlemyer, Chase Q Wu, Daqing Yun.  2016.  Experimental analysis of file transfer rates over wide-area dedicated connections.  IEEE 18th International Conference on High Performance Computing and Communications Pages 198-205, Best Paper Winner.
  • John Bent, Bradley W Settlemyer, Gary Grider.  Serving data to the lunatic fringe: The evolution of HPC storage.  The USENIX Magazine 41 (2), 34-39.
  • NSV Rao, G Hinkel, N Imam, BW Settlemyer.  Measurements of file transfer rates over dedicated long-haul connections.  2nd International Workshop on The Lustre Ecosystem.
  • Morrow A, Baseman E, Blanchard S. Ranking Anomalous High Performance Computing Sensor Data using Unsupervised Clustering. CSCI: Symposium on Parallel and Distributed Computing and Computational Science 2016.
  • Baseman E, Blanchard S, DeBardeleben N, Bonnie A, Morrow A. Interpretable Anomaly Detection for Monitoring of High Performance Computing Systems. Outlier Definition, Detection, and Description on Demand Workshop, KDD 2016.
  • Guan Q, DeBardeleben N, Wu P, Eidenbenz S, Blanchard S, Monroe L, Baseman E, Tan L. Design, Use, and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications. SIMUTOOLS 2016.
  • Baseman E, DeBardeleben N, Ferreira K, Levy S, Raasch S, Sridharan V, Siddiqua T, Guan Q. Improving DRAM Fault Characterization through Machine Learning. DSN (Industrial Track) 2016.
2015
  • Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Empirical Studies of the Soft Error Susceptibility of Sorting Algorithms. 5th Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop with HPDC 2015.
  • Wang K, Zhou X, Qiao K, Lang M, McClelland B, Raicu I. 2015. Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing. ACM HPDC.
  • Wang K, Qiao K, Sadooghi I, Zhou X, Li T, Lang M, Raicu I. 2015. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE(00): 1-29.
  • Sridharan V, DeBardeleben N, Blanchard S, Ferreira K, Stearley J, Shalf J, Gurumurthi S. 2015. Memory Errors in Modern Systems: The Good, The Bad, and the Ugly. Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems. 
  • Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P and others. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operating. IEEE 21st International Symposium on High Performance Computer Architecture (HPCA): 331-342.
  • Huang S, Fu S, DeBardeleben N, Guan Q, Xu C. 2015. Differentiated Failure Remediation with Action Selection for Resilient Computing. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC).
  • Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Addressing Statistical Significance of Fault Injection: Empirical Studies of the Soft Error Susceptibility. IEEE Pacific Rim International Symposium on Dependable Computing(PRDC).
  • Guan Q, DeBardeleben N, Atkinson B, Robey R, Jones W. 2015. Towards Building Resilience Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with CLAMR Hydrodynamics Mini-App. IEEE Cluster 2015. 
  • DeBardeleben N, Blanchard S, Kaeli D, Rech P. 2015. Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design. 2015 IEEE 33rd VLSI Test Symposium (VTS), pp. 1-2, 2015, ISSN: 1093-0167.
2014
  • Snir M, Wisniewski R, Abraham J, Adve S, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2014. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications.
  • DeBardeleben N, Blanchard S, Sridharan V, Gurumurthi S, Stearley J, Ferreira K, Shalf J. 2014. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University. 
  • Bautista Gomez L, Cappello F, Carro L, DeBardeleben N, Fang B, Gurumurthi S, Pattabiraman K, Rech P, Sonza Reorda M. 2014. GPGPUs: How to Combine High Computational Power with High Reliability. Design, Automation & Test in Europe (DATE14), Dresden, Germany. 
  • Guan Q. 2014. F-SEFI: A Fine-grained Soft Error Fault Injector for Profiling Application Vulnerability. Poster presentation: LANL Predictive Science Panel Review, Los Alamos, NM. 
  • DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. HiPEAC 2014, Vienna, Austria.
  • DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. Design, Automation & Test in Europe (DATE14), as part of "Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability". 
  • Atkinson B, DeBardeleben N, Guan Q, Robey R, Jones WM. 2014. Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App. Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium: 6-9. 
2013
  • Ionkov L, Lang M, Maltzahn C. 2013. DRepl: Optimizing Access to Application Data for Analysis and Visualization. 
  • Yuan X, Mahapatra S, Lang M, Pakin S. 2013. RRR: A Load Balanced Routing Scheme for Slimmed Fat-trees. 
  • Pakin S, Lang M. 2013. Understanding the Performance of Two Production Supercomputers. 
  • Akkan H, Lang M, Liebrook L. 2013. Understanding and isolating the noise in the Linux kernel. International Journal of High Performance Computing Applications. 
  • Soltero P, Bridges P, Arnold D, Lang M. 2013. A Gossip-based Approach to Exascale System Services. 
  • Akkan H, Ionkov L, Lang M. 2013. Transparently Consistent Asynchronous Shared Memory. 
  • Pakin S, Lang M. 2013. Energy Modeling of Supercomputers and Large-Scale Scientific Applications. IEEE. 
  • Wang K, Kulkarni A, Lang M, Arnold D, Raicu I. 2013. Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services. 
  • Yuan X, Mahapatra S, Nienaber W, Pakin S, Lang M. 2013. A New Routing Scheme for Jellyfish and its Performance with HPC Workloads. Supercomputing Conference. 
  • Akkan H, Lang M, Ionkov L. 2013. HPC Runtime Support for Fast and Power Efficient Locking and Synchronization. IEEE. 
  • Pakin S, Luang X, Lang M. 2013. Predicting the performance of extreme-scale supercomputer networks. The Next Wave (http://www.nsa.gov/research/tnw/). 20(2). 
  • Huang B, Sass R, DeBardeleben N, Blanchard S. 2013. PyDac: A Resilient Run-time Framework for Divide-and-Conquer Applications on a Heterogeneous Many-core Architecture. Proceedings of the The 6th Workshop on UnConventional High Performance Computing 2013 (UCHPC 2013). 
  • DeBardeleben N, Blanchard S, Monroe L, Romero P, Grunau D, Idler C, Wright C. 2013. GPU Behavior on a Large HPC Cluster. 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany,. 
  • Jian X, Blanchard S, DeBardeleben N, Sridharan V, Kumar R. 2013. Reliability Models for Double Chipkill Detect/Correct Memory Systems. SELSE (Silicon Errors in Logic, System Effects): 6. 
  • Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2013. Addressing Failures in Exascale Computing. Argonne National Laboratory Technical Report.
  • Jian X, DeBardeleben N, Blanchard S, Sridharan V, Kumar R. 2013. Analyzing Reliability of Memory Subsystems with Double Chipkill Detect/Correct. The 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2013). Vancouver, BC, Canada. 
  • Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S. 2013. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. SC13, Denver Colorado. 
2012
  • Kulkarni A, Wang K, Lang M. 2012. Exploring the Design Tradeoffs for Exescale System Services Through Simulation. 
  • Kulkarni A, Lumsdaine A, Lang M, Ionkov L. 2012. Optimizing Latency and Throughput for Spawning Processes on Massively Multicore Processors. 
  • Akkan H, Lang M, Liebrook LM. 2012. Stepping Towards Noiseless Linux Environment. 
  • Kulkarni A, Manzanares A, Ionkov L, Lang M, Lumsdaine A. 2012. The Design and Implementation of a Multi-level Content-Addressable Checkpoint File System.
  • Jones WM, Daly JT, DeBardeleben N. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. Proceedings of the 50th Annual Southeast Regional Conference: 262-267. 
  • DeBardeleben N, Blanchard S, Guan Q, Zhang Z, Fu S. 2012. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. Euro-Par 2011: Parallel Processing Workshops Lecture Notes in Computer Science. 7156: 282-291. 
  • Geist A, Snir M, Roman E, Still B, Clay R, Engelmann C, Ross R, Schulz M, Krishnamoorthy S, Vishnu A and others. 2012. US Department of Energy Fault Management Workshop Report.
  • Daly J, Harrod B, Hoang T, Nowell L, Adolf B, Borkar S, DeBardeleben N, Elnozahy M, Heroux M, Rogers D and others. 2012. Inter-Agency Workshop on HPC Resilience at Extreme Scale.
2011
  • DeBardeleben N, Blanchard SP, Fu S, Guan Q, Zhang Z. 2011. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. 
  • Kulkarni A, Lang M, Lumsdaine A. 2011. GoDEL: A multidirectional dataflow execution model for large-scale computing. 
  • Ionkov L. 2011. Gostor: Storage beyond POSiX. 
  • Greenberg H, Lang M, Ionkov L, Blanchard SP. 2011. REDfish - REsilient Dynamic dIstributed Scalable System Services for Exescale.