Operational Data
This is data characterized as "operational" in nature such as telemetry, environmental data, etc.
ATLAS is a collaboration with Carnegie Melon University's Parallel Data Lab. The repository contains traces from LANL supercomputers but also other systems as well, including a financial company. There are also links to publications and insights garnered from this data. The data for the ATLAS project is hosted by Carnegie Melon.
This tar file contains uncompressed VPIC restart files. The provided data are a subset of a much larger restart data set. VPIC is an open-source, particle-in-cell code developed at Los Alamos National Laboratory that is primarily used for astrophysics simulations. These data has proven problematic to compress and are frequently used by the Laboratory to test various compression algorithms.
The data are System Environment Data Collections (SEDC), sensor telemetry (voltage, temperature, fan speeds, water flow information, etc.). These data cover 2/9/16–2/18/16, during which Trinity was in "open science" testing by users. See the README for more information. LA-UR-17-24849
This outlines a collection of data released by LANL under LA-UR-19-28211 and available below. This data is memory usage data from three open clusters from late 2018 through early 2019.
Citation
The following datasets were released in conjunction with the following paper:
- Gagandeep Panwar, Da Zhang, Yihan Pang, Mai Dahshan, Nathan DeBardeleben, Binoy Ravindran, and Xun Jian. 2019. Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). ACM, New York, NY, USA, 821-835. DOI: doi.org/10.1145/3352460.3358267 (ACM Digital Library)
Overview
This entire collection of data has been released openly with the identifier LA-UR-19-28211. The data consists of 87.4GBs of (uncompressed) 19,678 JSON files. These JSON files contain memory usage statistics from three compute clusters at LANL: grizzly, badger, and snow. The data is relatively fine-grained (every 10 seconds, where records exist) for every node from each cluster, and identifies which job ID (numerical) used that memory. Since the job IDs are unique, one can see the parallel jobs running between multiple nodes. For instance, job ID 123 running on nodes 100, 101, and 102 will have separate memory utilization records for all three nodes for the time it was running.
Each archive has a detailed README which explains the data format and the cluster sizes, memory system sizes, etc. A brief sample of some trivial analysis from this data is at the bottom of the page. For a more detailed analysis, we encourage you to see the paper referenced above.
Data
Cluster Name | Date Range | file size | file size (uncompressed) | # of files | Link |
Grizzly (dataset 0) | 11/1/18 - 11/27/18 | 1.6 GB | 12 GB | 1295 | grizzly0 |
Grizzly (dataset 1) | 12/1/18 - 12/22/18 | 1.3 GB | 10 GB | 999 | grizzly1 |
Grizzly (dataset 2) | 12/22/18 - 1/11/19 | 1.0 GB | 8.2 GB | 1001 | grizzly2 |
Grizzly (dataset 3) | 1/12/19 - 2/5/19 | 1.3 GB | 9.7 GB | 1201 | grizzly3 |
Grizzly (dataset 4) | 2/5/19 - 2/22/19 | 799 MB | 6.1 GB | 801 | grizzly4 |
Grizzly (dataset 5) | 2/22/19 - 3/18/19 | 927 MB | 7.4 GB | 1167 | grizzly5 |
Badger | 11/1/18 - 3/18/19 | 2.1 GB | 17 GB | 6607 | badger |
Snow | 11/1/18 - 3/18/19 | 2.4 GB | 17 GB | 6607 | snow |
Samples
The following images are intended to give simple snapshots of the raw data. They are not a scientific analysis but may be useful to give a reader an idea what the data holds.
Hosts
Here we simply look at hostnames reporting data. This may be indicative of utilization of these hosts. When combined with job and time information (in the dataset, but not shown here), more interesting and advanced analytics could be performed. This first plot is from the grizzly0 dataset and we see most hosts reporting about the same number of records.
Metrics
As the README files explain, the datasets contain meminfo.memfree and meminfo.active samples from every node at periodic intervals. Using this, one can get an idea of memory usage. For instance, the following plot shows a quick glimpse at the meminfo.active metric for the snow dataset.
JobIDs
The datasets include memory usage data per node with the JobID consuming that memory. Since these systems are used for parallel computation, one can see the amount of memory consumed by parallel jobs and determine how many nodes that job ran on. The following plot shows a snapshot of the number of records in just the first hour of the grizzly1 dataset. This doesn't state how much memory was used, nor how many nodes these jobs ran on, just the number of records in the dataset. It is presented here (as mentioned earlier) to give the reader an idea of what is in the data.
Failure Data
The following are collections of data related to failures on high performance computing (HPC) systems. Many of the data dumps include READMEs explaining the datasets as well as the descriptions of the machines and/or systems that the statistics were taken from. Failures include memory, processor, network, etc.
NOTE: These data are historical in nature and were originally released in 2005.
In order to enable open computer science research, access to computer operational data is desperately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desperately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.
All we ask is that if you use these data in your research, please recognize Los Alamos National Laboratory for providing these data.
All files and content available for download are covered by LA-URs listed in the filename. Each file is a tar.gz archive with a data file (either CSV or text file) and a README with some explanations.
Description | Size Gz (unpacked) | # Records | Link |
All systems failure/interrupt data 1996-2005 | 336K (2.8M) | 23,741 | Data |
System 20 usage with domain info | 9.9M (50M) | 489,376 | Data |
System 20 usage with node info - nodes number from zero | 10M (42M) | 489,376 | Data |
System 20 usage event info - nodes number from zero | 3.1M (32M) | 433,490 | Data |
System 20 node internal disk failure info - nodes number from zero | 8K (16K) | 14 | Data |
System 15 usage with node info - nodes number from zero | 560K (2.3M) | 17,823 | Data |
System 16 usage with node info - nodes number from one | 52M (308M) | 1,630,479 | Data |
System 23 usage with node info - nodes number from one | 15M(58M) | 654,927 | Data |
System 8 usage with node info - nodes number from one | 14M (64M) | 763,293 | Data |
Bianca Schroeder (at Carnegie Mellon University at the time) had been kind enough to provide a frequently asked questions (FAQ) document about the 1995–2005 failure data set. NOTE: These responses should be seen in the context of this dataset and in the time they were written.
All datasets reference anonymized "system numbers." To understand the failure, usage, and event info, you also have to understand the machine and/or system layout. For this, we have provided this Excel file. NOTE: Care needs to be taken when looking at the above data as some of the systems may appear to grow (or shrink) in size for short time periods. Sadly, these time periods are not well documented, but this occasionally occurred to put two systems together to perform a larger calculation than could be done on any one part. The machine and/or system layout data are a single snapshot in time and not a snapshot over time, so they may require other effort to determine these periods. Luckily, this likely only impacts certain types of studies, and the time periods were relatively short.
Storage Data
The following are collections of data related to storage. They include things like file system stats (fsstats—available on the Ultrascale Systems Research Center [USRC] Software page), dumps from parallel file systems, archive systems, Network File System (NFS) systems, and even workstations.
NOTE: These data are historical in nature and were originally released in 2012.
To enable computer science research, Los Alamos National Laboratory is releasing static file tree data (fsstats) for some of our paralle file systems. These data include aggregate information on capacity, file and directory sizes, filename lengths, link counts, etc. The fsstats cover 9 anonymous parallel file systems ranging from 16 TB to 439 TB totalcapacity used, and file counts range from 2,024,729 to 43,605,555.
The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.
All file sizes are extremely small (approximately 15K) and uncompressed.
All files and content available for download are covered by document LA-UR-07-5769.
Anonymous Filesystem | Date | Machine System Number | Link |
Anonymous Parallel Filesystem 1 | January 11, 2012 | 28,31,36,37 | Data |
Anonymous Parallel Filesystem 2 | January 3, 2012 | 28,31,36,37 | Data |
Anonymous Parallel Filesystem 3 | January 3, 2012 | 28,31,36,37 | Data |
Anonymous Parallel Filesystem 4 | January 4, 2012 | 28,31,36,37 | Data |
Anonymous Parallel Filesystem 5 | January 11, 2012 | 28,31,36,37 | Data |
Anonymous Parallel Filesystem 6 | November 21, 2011 | 29,30,34,38,39,64,65 | Data |
Anonymous Parallel Filesystem 7 | November 17, 2011 | 29,30,34,38,39,64,65 | Data |
Anonymous Parallel Filesystem 8 | November 21, 2011 | 27,35 | Data |
Anonymous Parallel Filesystem 9 | November 21, 2011 | 27,35 | Data |
Frequently Asked Questions:
- Can you tell me a little bit about how these file systems are typically used?
- Group 1: these users tend to have N-N codes; they write tons of tiny files.
- Group 2: you see a lot more N-1 codes; thus, Group 2 has fewer files than what Group 1 typically has.
- Group 3: test systems—non-production runs meant to test or debug.
Really any "group" can run on any system, but here are the trends:
File systems 6 and 7 tend to be Group 1 usage, while 1–5 and 8–9 are more Group 2.
- Can we compare these file systems to the 2008 fsstats released by LANL?
- For the 2008 fsstats, there are two Group 2 file systems, and one (panscratch1.csv) is from Group 3. Comparing data across groups may not be useful.
- Why is the negative overhead for anonymous file system 6 so large?
- Almost all of the negative overhead for anonymous file system 6 comes from 2(!) files from the same user.
File1 1029403328512 bytes = 958 GB
File2 58650690272 = 54.6 GB
So, in total, 1012 GB came from the same user.
- Almost all of the negative overhead for anonymous file system 6 comes from 2(!) files from the same user.
NOTE: These data are historical in nature and were originally released in 2007.
These data represent file system statistics gathered from backup information for 3000-plus LANL workstations. The following set of data is provided under universal release to any computer science researcher to use to enable computer science work.
The data are contained in a tar file containing three CSV files that represent the three primary types of workstations used at LANL (MAC, UNIX, Windows). Please refer to the README contained in the tar file for a description of the data gathered. All files and content available for download are covered bydocument LA-UR-07-5769.
Each archive contains a README explaining the data in more detail.
Dataset Release | Results Upload Date | Size | Link |
version 3.0 | August 31, 2009 | 9.2M | Data |
version 2.0 | March 11, 2009 | 9M | Data |
version 1.1 | October 20, 2008 | 12M | Data |
Frequently Asked Questions:
- Why does version 3.0 have quite a few more nodes than previous versions?
- Analysis of this run showed that approximately 80 nodes were not surveyed due to an internal problem that prevented the script from getting information on those nodes. Furthermore, approximately 300 nodes were removed from the backup system, and 340 new nodes were added. The net effect was a reduction in total nodes.
- Why does version 2.0 have a fewer number of nodes (versus version 1.2), yet the number of files and total data size have grown?
- The new nodes identified by this run accounted for much of the growth in terms of file counts and data size. Approximately 209 million files belonged to these new nodes, while the number of files related to the nodes no longer backed up was much lower. This resulted in a net gain of files.
- Will anonymous node names and file space names persist over the life of this project?
- Yes, the node name and filespace will be unique to that particular node over the life of this project. A future survey of the nodes will identify those nodes that have already been assigned an anonymous name, and that name (anonymous) will be retained.
- Why does the inactive files histogram have min bucket and max bucket values of equal size?
- This project re-used code that created histograms based on bucket ranges. Because this histogram is meant to show inactive files (1,2,3,4), a range is not necessary. This will be fixed in a future version. See the README contained in the tarball for more information.
NOTE: These data are historical in nature and were originally released in 2011.
These data represent a file system walk with information such as file sizes, creation time, modification time, UID/GID, etc. The data are highly anonymized and are released under document LA-UR-07-5769. Please see the README contained in the individual archives for a more detailed description of the data. The files are text and compressed, and the file size below represents the compressed size for download.
The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.
All files and content available for download are covered by document LA-UR-07-5769.
Anonymous Filesystem | Size | Number of Records | Machine System Number | Link |
Anonymous Archive 1 | 1.4G | 112,020,366 | 29,30,34,38,39,57-61,63,64,65 | Data |
Anonymous Global NFS 1 | TBD | 6,437,081 | 29,30,34,38,39,57-61,63,64,65 | TBD |
Anonymous non-shared NFS Filesystem 1 | 3.5M | 306,340 | 30, 64 | Data |
Anonymous non-shared NFS Filesystem 2 | 6.9M | 590,610 | 39 | Data |
Anonymous non-shared NFS Filesystem 3 | 10M | 855,361 | 57-61 | Data |
Anonymous non-shared NFS Filesystem 4 | 1.8M | 163,267 | 29 | Data |
Anonymous non-shared NFS Filesystem 5 | 8M | 634,008 | 34, 65 | Data |
Anonymous non-shared NFS Filesystem 6 | 2.5M | 222,111 | 38 | Data |
Frequently Asked Questions:
- What is the difference between global and non-shared NFS file systems?
- The global NFS file system is shared among a number of different clusters. The non-shared NFS file systems are visible to only a single cluster.
- Is the create time field really the time the file was created?
- Yes. This field is not UNIX "ctime" (change time).
- Are the user identifieer (UID)/group identifier (GID) or paths consistent across file systems?
- No. No assumptions can be made across files. The paths within a single file maintain the file tree hierarchy.
- I notice several instances where the GID is different from the UID, and multiple UIDs have the same GID. Is this right?
- Yes, this is especially the case for the globally accessible file systems. Multiple UIDs having the same GID are usually users sharing files with a group and UNIX permissions. Local NFS file systems usually have a one-to-one mapping between UID and GID, but this is not required.
- Is the blocksize field the actual blocksize used to store the file, or is it the size allocated to a file?
- This field is for the size of each block allocated.
- Are these file systems/archive mostly used by developers or users?
- Most of the file systems and machines are open to both developers and users. The exception to this is machine 29, which is targeted solely to developers. Some developers also work on their desktop machines, so these files may not be representative of their entire workloads.