USRC Data Sources

Operational Data

This is data characterized as "operational" in nature such as telemetry, environmental data, etc.

ATLAS - Analysis of Traces from Los Alamos Supercomputers.

ATLAS is a collaboration with Carnegie Melon University's Parallel Data Lab. The repository contains traces from LANL supercomputers but also other systems as well, including a financial company. There are also links to publications and insights garnered from this data. The data for the ATLAS project is hosted by Carnegie Melon.

VPIC Restart Files

This tar file contains uncompressed VPIC restart files. The provided data are a subset of a much larger restart data set. VPIC is an open-source, particle-in-cell code developed at Los Alamos National Laboratory that is primarily used for astrophysics simulations. These data has proven problematic to compress and are frequently used by the Laboratory to test various compression algorithms.

Download this data source via FTP

Trinity Open Science Environmental Sensors Data

The data are System Environment Data Collections (SEDC), sensor telemetry (voltage, temperature, fan speeds, water flow information, etc.). These data cover 2/9/16–2/18/16, during which Trinity was in "open science" testing by users. See the README for more information. LA-UR-17-24849

Download this data source via FTP

Memory Usage Statistics From Open Clusters

This outlines a collection of data released by LANL under LA-UR-19-28211 and available below. This data is memory usage data from three open clusters from late 2018 through early 2019.

Citation

The following datasets were released in conjunction with the following paper:

Gagandeep Panwar, Da Zhang, Yihan Pang, Mai Dahshan, Nathan DeBardeleben, Binoy Ravindran, and Xun Jian. 2019. Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). ACM, New York, NY, USA, 821-835. DOI: doi.org/10.1145/3352460.3358267 (ACM Digital Library)

Overview

This entire collection of data has been released openly with the identifier LA-UR-19-28211. The data consists of 87.4GBs of (uncompressed) 19,678 JSON files. These JSON files contain memory usage statistics from three compute clusters at LANL: grizzly, badger, and snow. The data is relatively fine-grained (every 10 seconds, where records exist) for every node from each cluster, and identifies which job ID (numerical) used that memory. Since the job IDs are unique, one can see the parallel jobs running between multiple nodes. For instance, job ID 123 running on nodes 100, 101, and 102 will have separate memory utilization records for all three nodes for the time it was running.

Each archive has a detailed README which explains the data format and the cluster sizes, memory system sizes, etc. A brief sample of some trivial analysis from this data is at the bottom of the page. For a more detailed analysis, we encourage you to see the paper referenced above.

Data

Cluster Name	Date Range	file size (compressed)	file size (uncompressed)	# of files	Link
Grizzly (dataset 0)	11/1/18 - 11/27/18	1.6 GB	12 GB	1295	grizzly0
Grizzly (dataset 1)	12/1/18 - 12/22/18	1.3 GB	10 GB	999	grizzly1
Grizzly (dataset 2)	12/22/18 - 1/11/19	1.0 GB	8.2 GB	1001	grizzly2
Grizzly (dataset 3)	1/12/19 - 2/5/19	1.3 GB	9.7 GB	1201	grizzly3
Grizzly (dataset 4)	2/5/19 - 2/22/19	799 MB	6.1 GB	801	grizzly4
Grizzly (dataset 5)	2/22/19 - 3/18/19	927 MB	7.4 GB	1167	grizzly5
Badger	11/1/18 - 3/18/19	2.1 GB	17 GB	6607	badger
Snow	11/1/18 - 3/18/19	2.4 GB	17 GB	6607	snow

Samples

The following images are intended to give simple snapshots of the raw data. They are not a scientific analysis but may be useful to give a reader an idea what the data holds.

Hosts

Here we simply look at hostnames reporting data. This may be indicative of utilization of these hosts. When combined with job and time information (in the dataset, but not shown here), more interesting and advanced analytics could be performed. This first plot is from the grizzly0 dataset and we see most hosts reporting about the same number of records.

Metrics

As the README files explain, the datasets contain meminfo.memfree and meminfo.active samples from every node at periodic intervals. Using this, one can get an idea of memory usage. For instance, the following plot shows a quick glimpse at the meminfo.active metric for the snow dataset.

JobIDs

The datasets include memory usage data per node with the JobID consuming that memory. Since these systems are used for parallel computation, one can see the amount of memory consumed by parallel jobs and determine how many nodes that job ran on. The following plot shows a snapshot of the number of records in just the first hour of the grizzly1 dataset. This doesn't state how much memory was used, nor how many nodes these jobs ran on, just the number of records in the dataset. It is presented here (as mentioned earlier) to give the reader an idea of what is in the data.

Failure Data

The following are collections of data related to failures on high performance computing (HPC) systems. Many of the data dumps include READMEs explaining the datasets as well as the descriptions of the machines and/or systems that the statistics were taken from. Failures include memory, processor, network, etc.

1995–2005 Reliability/Interrupt/Failure/Usage Data Sets

NOTE: These data are historical in nature and were originally released in 2005.

In order to enable open computer science research, access to computer operational data is desperately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desperately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.

All we ask is that if you use these data in your research, please recognize Los Alamos National Laboratory for providing these data.

All files and content available for download are covered by LA-URs listed in the filename. Each file is a tar.gz archive with a data file (either CSV or text file) and a README with some explanations.

Description	Size Gz (unpacked)	# Records	Link
All systems failure/interrupt data 1996-2005	336K (2.8M)	23,741	Data
System 20 usage with domain info	9.9M (50M)	489,376	Data
System 20 usage with node info - nodes number from zero	10M (42M)	489,376	Data
System 20 usage event info - nodes number from zero	3.1M (32M)	433,490	Data
System 20 node internal disk failure info - nodes number from zero	8K (16K)	14	Data
System 15 usage with node info - nodes number from zero	560K (2.3M)	17,823	Data
System 16 usage with node info - nodes number from one	52M (308M)	1,630,479	Data
System 23 usage with node info - nodes number from one	15M(58M)	654,927	Data
System 8 usage with node info - nodes number from one	14M (64M)	763,293	Data

Bianca Schroeder (at Carnegie Mellon University at the time) had been kind enough to provide a frequently asked questions (FAQ) document about the 1995–2005 failure data set. NOTE: These responses should be seen in the context of this dataset and in the time they were written.

All datasets reference anonymized "system numbers." To understand the failure, usage, and event info, you also have to understand the machine and/or system layout. For this, we have provided this Excel file. NOTE: Care needs to be taken when looking at the above data as some of the systems may appear to grow (or shrink) in size for short time periods. Sadly, these time periods are not well documented, but this occasionally occurred to put two systems together to perform a larger calculation than could be done on any one part. The machine and/or system layout data are a single snapshot in time and not a snapshot over time, so they may require other effort to determine these periods. Luckily, this likely only impacts certain types of studies, and the time periods were relatively short.

Storage Data

The following are collections of data related to storage. They include things like file system stats (fsstats—available on the Ultrascale Systems Research Center [USRC] Software page), dumps from parallel file systems, archive systems, Network File System (NFS) systems, and even workstations.

Parallel File Systems File System Statistics (fsstats)

NOTE: These data are historical in nature and were originally released in 2012.

To enable computer science research, Los Alamos National Laboratory is releasing static file tree data (fsstats) for some of our paralle file systems. These data include aggregate information on capacity, file and directory sizes, filename lengths, link counts, etc. The fsstats cover 9 anonymous parallel file systems ranging from 16 TB to 439 TB totalcapacity used, and file counts range from 2,024,729 to 43,605,555.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.

All file sizes are extremely small (approximately 15K) and uncompressed.

All files and content available for download are covered by document LA-UR-07-5769.

Anonymous Filesystem	Date	Machine System Number	Link
Anonymous Parallel Filesystem 1	January 11, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 2	January 3, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 3	January 3, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 4	January 4, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 5	January 11, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 6	November 21, 2011	29,30,34,38,39,64,65	Data
Anonymous Parallel Filesystem 7	November 17, 2011	29,30,34,38,39,64,65	Data
Anonymous Parallel Filesystem 8	November 21, 2011	27,35	Data
Anonymous Parallel Filesystem 9	November 21, 2011	27,35	Data

Frequently Asked Questions:

Can you tell me a little bit about how these file systems are typically used?
- Group 1: these users tend to have N-N codes; they write tons of tiny files.
- Group 2: you see a lot more N-1 codes; thus, Group 2 has fewer files than what Group 1 typically has.
- Group 3: test systems—non-production runs meant to test or debug.
  Really any "group" can run on any system, but here are the trends:
  File systems 6 and 7 tend to be Group 1 usage, while 1–5 and 8–9 are more Group 2.
Can we compare these file systems to the 2008 fsstats released by LANL?
- For the 2008 fsstats, there are two Group 2 file systems, and one (panscratch1.csv) is from Group 3. Comparing data across groups may not be useful.
Why is the negative overhead for anonymous file system 6 so large?
- Almost all of the negative overhead for anonymous file system 6 comes from 2(!) files from the same user.
  File1 1029403328512 bytes = 958 GB
  File2 58650690272 = 54.6 GB
  So, in total, 1012 GB came from the same user.

Workstation File System Statistics (fsstats)

NOTE: These data are historical in nature and were originally released in 2007.

These data represent file system statistics gathered from backup information for 3000-plus LANL workstations. The following set of data is provided under universal release to any computer science researcher to use to enable computer science work.

The data are contained in a tar file containing three CSV files that represent the three primary types of workstations used at LANL (MAC, UNIX, Windows). Please refer to the README contained in the tar file for a description of the data gathered. All files and content available for download are covered bydocument LA-UR-07-5769.

Each archive contains a README explaining the data in more detail.

Dataset Release	Results Upload Date	Size	Link
version 3.0	August 31, 2009	9.2M	Data
version 2.0	March 11, 2009	9M	Data
version 1.1	October 20, 2008	12M	Data

Frequently Asked Questions:

Why does version 3.0 have quite a few more nodes than previous versions?
- Analysis of this run showed that approximately 80 nodes were not surveyed due to an internal problem that prevented the script from getting information on those nodes. Furthermore, approximately 300 nodes were removed from the backup system, and 340 new nodes were added. The net effect was a reduction in total nodes.
Why does version 2.0 have a fewer number of nodes (versus version 1.2), yet the number of files and total data size have grown?
- The new nodes identified by this run accounted for much of the growth in terms of file counts and data size. Approximately 209 million files belonged to these new nodes, while the number of files related to the nodes no longer backed up was much lower. This resulted in a net gain of files.
Will anonymous node names and file space names persist over the life of this project?
- Yes, the node name and filespace will be unique to that particular node over the life of this project. A future survey of the nodes will identify those nodes that have already been assigned an anonymous name, and that name (anonymous) will be retained.
Why does the inactive files histogram have min bucket and max bucket values of equal size?
- This project re-used code that created histograms based on bucket ranges. Because this histogram is meant to show inactive files (1,2,3,4), a range is not necessary. This will be fixed in a future version. See the README contained in the tarball for more information.

Archive and NFS Metadata

NOTE: These data are historical in nature and were originally released in 2011.

These data represent a file system walk with information such as file sizes, creation time, modification time, UID/GID, etc. The data are highly anonymized and are released under document LA-UR-07-5769. Please see the README contained in the individual archives for a more detailed description of the data. The files are text and compressed, and the file size below represents the compressed size for download.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.

All files and content available for download are covered by document LA-UR-07-5769.

Anonymous Filesystem	Size	Number of Records	Machine System Number	Link
Anonymous Archive 1	1.4G	112,020,366	29,30,34,38,39,57-61,63,64,65	Data
Anonymous Global NFS 1	TBD	6,437,081	29,30,34,38,39,57-61,63,64,65	TBD
Anonymous non-shared NFS Filesystem 1	3.5M	306,340	30, 64	Data
Anonymous non-shared NFS Filesystem 2	6.9M	590,610	39	Data
Anonymous non-shared NFS Filesystem 3	10M	855,361	57-61	Data
Anonymous non-shared NFS Filesystem 4	1.8M	163,267	29	Data
Anonymous non-shared NFS Filesystem 5	8M	634,008	34, 65	Data
Anonymous non-shared NFS Filesystem 6	2.5M	222,111	38	Data

Frequently Asked Questions:

What is the difference between global and non-shared NFS file systems?
- The global NFS file system is shared among a number of different clusters. The non-shared NFS file systems are visible to only a single cluster.
Is the create time field really the time the file was created?
- Yes. This field is not UNIX "ctime" (change time).
Are the user identifier (UID)/group identifier (GID) or paths consistent across file systems?
- No. No assumptions can be made across files. The paths within a single file maintain the file tree hierarchy.
I notice several instances where the GID is different from the UID, and multiple UIDs have the same GID. Is this right?
- Yes, this is especially the case for the globally accessible file systems. Multiple UIDs having the same GID are usually users sharing files with a group and UNIX permissions. Local NFS file systems usually have a one-to-one mapping between UID and GID, but this is not required.
Is the blocksize field the actual blocksize used to store the file, or is it the size allocated to a file?
- This field is for the size of each block allocated.
Are these file systems/archive mostly used by developers or users?
- Most of the file systems and machines are open to both developers and users. The exception to this is machine 29, which is targeted solely to developers. Some developers also work on their desktop machines, so these files may not be representative of their entire workloads.

Operational Data

This is data characterized as "operational" in nature such as telemetry, environmental data, etc.

ATLAS - Analysis of Traces from Los Alamos Supercomputers.

VPIC Restart Files

Download this data source via FTP

Trinity Open Science Environmental Sensors Data

Download this data source via FTP

Memory Usage Statistics From Open Clusters

This outlines a collection of data released by LANL under LA-UR-19-28211 and available below. This data is memory usage data from three open clusters from late 2018 through early 2019.

Citation

The following datasets were released in conjunction with the following paper:

Gagandeep Panwar, Da Zhang, Yihan Pang, Mai Dahshan, Nathan DeBardeleben, Binoy Ravindran, and Xun Jian. 2019. Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). ACM, New York, NY, USA, 821-835. DOI: doi.org/10.1145/3352460.3358267 (ACM Digital Library)

Overview

Data

Cluster Name	Date Range	file size (compressed)	file size (uncompressed)	# of files	Link
Grizzly (dataset 0)	11/1/18 - 11/27/18	1.6 GB	12 GB	1295	grizzly0
Grizzly (dataset 1)	12/1/18 - 12/22/18	1.3 GB	10 GB	999	grizzly1
Grizzly (dataset 2)	12/22/18 - 1/11/19	1.0 GB	8.2 GB	1001	grizzly2
Grizzly (dataset 3)	1/12/19 - 2/5/19	1.3 GB	9.7 GB	1201	grizzly3
Grizzly (dataset 4)	2/5/19 - 2/22/19	799 MB	6.1 GB	801	grizzly4
Grizzly (dataset 5)	2/22/19 - 3/18/19	927 MB	7.4 GB	1167	grizzly5
Badger	11/1/18 - 3/18/19	2.1 GB	17 GB	6607	badger
Snow	11/1/18 - 3/18/19	2.4 GB	17 GB	6607	snow

Samples

The following images are intended to give simple snapshots of the raw data. They are not a scientific analysis but may be useful to give a reader an idea what the data holds.

Hosts

Metrics

JobIDs

Failure Data

1995–2005 Reliability/Interrupt/Failure/Usage Data Sets

NOTE: These data are historical in nature and were originally released in 2005.

All we ask is that if you use these data in your research, please recognize Los Alamos National Laboratory for providing these data.

Description	Size Gz (unpacked)	# Records	Link
All systems failure/interrupt data 1996-2005	336K (2.8M)	23,741	Data
System 20 usage with domain info	9.9M (50M)	489,376	Data
System 20 usage with node info - nodes number from zero	10M (42M)	489,376	Data
System 20 usage event info - nodes number from zero	3.1M (32M)	433,490	Data
System 20 node internal disk failure info - nodes number from zero	8K (16K)	14	Data
System 15 usage with node info - nodes number from zero	560K (2.3M)	17,823	Data
System 16 usage with node info - nodes number from one	52M (308M)	1,630,479	Data
System 23 usage with node info - nodes number from one	15M(58M)	654,927	Data
System 8 usage with node info - nodes number from one	14M (64M)	763,293	Data

Storage Data

Parallel File Systems File System Statistics (fsstats)

NOTE: These data are historical in nature and were originally released in 2012.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.

All file sizes are extremely small (approximately 15K) and uncompressed.

All files and content available for download are covered by document LA-UR-07-5769.

Anonymous Filesystem	Date	Machine System Number	Link
Anonymous Parallel Filesystem 1	January 11, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 2	January 3, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 3	January 3, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 4	January 4, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 5	January 11, 2012	28,31,36,37	Data
Anonymous Parallel Filesystem 6	November 21, 2011	29,30,34,38,39,64,65	Data
Anonymous Parallel Filesystem 7	November 17, 2011	29,30,34,38,39,64,65	Data
Anonymous Parallel Filesystem 8	November 21, 2011	27,35	Data
Anonymous Parallel Filesystem 9	November 21, 2011	27,35	Data

Frequently Asked Questions:

Can you tell me a little bit about how these file systems are typically used?
- Group 1: these users tend to have N-N codes; they write tons of tiny files.
- Group 2: you see a lot more N-1 codes; thus, Group 2 has fewer files than what Group 1 typically has.
- Group 3: test systems—non-production runs meant to test or debug.
  Really any "group" can run on any system, but here are the trends:
  File systems 6 and 7 tend to be Group 1 usage, while 1–5 and 8–9 are more Group 2.
Can we compare these file systems to the 2008 fsstats released by LANL?
- For the 2008 fsstats, there are two Group 2 file systems, and one (panscratch1.csv) is from Group 3. Comparing data across groups may not be useful.
Why is the negative overhead for anonymous file system 6 so large?
- Almost all of the negative overhead for anonymous file system 6 comes from 2(!) files from the same user.
  File1 1029403328512 bytes = 958 GB
  File2 58650690272 = 54.6 GB
  So, in total, 1012 GB came from the same user.

Workstation File System Statistics (fsstats)

NOTE: These data are historical in nature and were originally released in 2007.

Each archive contains a README explaining the data in more detail.

Dataset Release	Results Upload Date	Size	Link
version 3.0	August 31, 2009	9.2M	Data
version 2.0	March 11, 2009	9M	Data
version 1.1	October 20, 2008	12M	Data

Frequently Asked Questions:

Why does version 3.0 have quite a few more nodes than previous versions?
- Analysis of this run showed that approximately 80 nodes were not surveyed due to an internal problem that prevented the script from getting information on those nodes. Furthermore, approximately 300 nodes were removed from the backup system, and 340 new nodes were added. The net effect was a reduction in total nodes.
Why does version 2.0 have a fewer number of nodes (versus version 1.2), yet the number of files and total data size have grown?
- The new nodes identified by this run accounted for much of the growth in terms of file counts and data size. Approximately 209 million files belonged to these new nodes, while the number of files related to the nodes no longer backed up was much lower. This resulted in a net gain of files.
Will anonymous node names and file space names persist over the life of this project?
- Yes, the node name and filespace will be unique to that particular node over the life of this project. A future survey of the nodes will identify those nodes that have already been assigned an anonymous name, and that name (anonymous) will be retained.
Why does the inactive files histogram have min bucket and max bucket values of equal size?
- This project re-used code that created histograms based on bucket ranges. Because this histogram is meant to show inactive files (1,2,3,4), a range is not necessary. This will be fixed in a future version. See the README contained in the tarball for more information.

Archive and NFS Metadata

NOTE: These data are historical in nature and were originally released in 2011.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.

All files and content available for download are covered by document LA-UR-07-5769.

Anonymous Filesystem	Size	Number of Records	Machine System Number	Link
Anonymous Archive 1	1.4G	112,020,366	29,30,34,38,39,57-61,63,64,65	Data
Anonymous Global NFS 1	TBD	6,437,081	29,30,34,38,39,57-61,63,64,65	TBD
Anonymous non-shared NFS Filesystem 1	3.5M	306,340	30, 64	Data
Anonymous non-shared NFS Filesystem 2	6.9M	590,610	39	Data
Anonymous non-shared NFS Filesystem 3	10M	855,361	57-61	Data
Anonymous non-shared NFS Filesystem 4	1.8M	163,267	29	Data
Anonymous non-shared NFS Filesystem 5	8M	634,008	34, 65	Data
Anonymous non-shared NFS Filesystem 6	2.5M	222,111	38	Data

Frequently Asked Questions:

What is the difference between global and non-shared NFS file systems?
- The global NFS file system is shared among a number of different clusters. The non-shared NFS file systems are visible to only a single cluster.
Is the create time field really the time the file was created?
- Yes. This field is not UNIX "ctime" (change time).
Are the user identifier (UID)/group identifier (GID) or paths consistent across file systems?
- No. No assumptions can be made across files. The paths within a single file maintain the file tree hierarchy.
I notice several instances where the GID is different from the UID, and multiple UIDs have the same GID. Is this right?
- Yes, this is especially the case for the globally accessible file systems. Multiple UIDs having the same GID are usually users sharing files with a group and UNIX permissions. Local NFS file systems usually have a one-to-one mapping between UID and GID, but this is not required.
Is the blocksize field the actual blocksize used to store the file, or is it the size allocated to a file?
- This field is for the size of each block allocated.
Are these file systems/archive mostly used by developers or users?
- Most of the file systems and machines are open to both developers and users. The exception to this is machine 29, which is targeted solely to developers. Some developers also work on their desktop machines, so these files may not be representative of their entire workloads.

Ultrascale Systems Research Center Data Sources

Data sources from supercomputers and HPC applications.

Citations:

Operational Data

Citation

Overview

Data

Samples

Hosts

Metrics

JobIDs

Failure Data

Storage Data

Ultrascale Systems Research Center Data Sources