USRC Data Sources Storage Data

Traces and statistics

Description

The following are collections of data related to storage. They include things like file system stats (fsstats—available on the Ultrascale Systems Research Center [USRC] Software page), dumps from parallel file systems, archive systems, Network File System (NFS) systems, and even workstations.

Citation

Please use the following citation unless the data source below offers a more specific citation:

If you'd like to cite our data and use bibtex, please use the following snippet:

@MISC{usrc:datasources-general,
    author = {{Los Alamos National Laboratory}},
    title = {{Ultrascale Systems Research Center (USRC) Data Sources}},
    howpublished= {\url{https://usrc.lanl.gov/data-sources.php}}
}

Parallel File Systems File System Statistics (fsstats)

NOTE: These data are historical in nature and were originally released in 2012.

To enable computer science research, Los Alamos National Laboratory (LANL) is releasing static file tree data (fsstats) for some of our parallel file systems. These data include aggregate information on capacity, file and directory sizes, filename lengths, link counts, etc. The fsstats cover 9 anonymous parallel file systems ranging from 16 TB to 439 TB total capacity used, and file counts range from 2,024,729 to 43,605,555.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.

All file sizes are extremely small (approximately 15K) and uncompressed.

All files and content available for download are covered by document LA-UR-07-5769.

Table with list of data files.
Anonymous Filesystem Date Machine System Number Link
Anonymous Parallel Filesystem 1 January 11, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 2 January 3, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 3 January 3, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 4 January 4, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 5 January 11, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 6 November 21, 2011 29,30,34,38,39,64,65 Data
Anonymous Parallel Filesystem 7 November 17, 2011 29,30,34,38,39,64,65 Data
Anonymous Parallel Filesystem 8
November 21, 2011 27,35 Data
Anonymous Parallel Filesystem 9 November 21, 2011 27,35 Data

Frequently Asked Questions:

  1. Can you tell me a little bit about how these file systems are typically used?
    • Group 1: these users tend to have N-N codes; they write tons of tiny files.
      Group 2: you see a lot more N-1 codes; thus, Group 2 has fewer files than what Group 1 typically has.
      Group 3: test systems—non-production runs meant to test or debug.

      Really any "group" can run on any system, but here are the trends:
      File systems 6 and 7 tend to be Group 1 usage, while 1–5 and 8–9 are more Group 2.
  2. Can we compare these file systems to the 2008 fsstats released by LANL?
    • For the 2008 fsstats, there are two Group 2 file systems, and one (panscratch1.csv) is from Group 3. Comparing data across groups may not be useful.
  3. Why is the negative overhead for anonymous file system 6 so large?
    • Almost all of the negative overhead for anonymous file system 6 comes from 2(!) files from the same user.
      File1 1029403328512 bytes = 958 GB
      File2 58650690272 = 54.6 GB
      So, in total, 1012 GB came from the same user.
Workstation File System Statistics (fsstats)

NOTE: These data are historical in nature and were originally released in 2007.

These data represent file system statistics gathered from backup information for 3000-plus LANL workstations. The following set of data is provided under universal release to any computer science researcher to use to enable computer science work. These data were acquired re-using some aspects of the Petascale Data Storage Institute (PDSI) tool (available on USRC Software page). 

The data are contained in a tar file containing three CSV files that represent the three primary types of workstations used at LANL (MAC, UNIX, Windows). Please refer to the README contained in the tar file for a description of the data gathered. All files and content available for download are covered bydocument LA-UR-07-5769.

Each archive contains a README explaining the data in more detail.

Table with list of data files.
Dataset Release Results Upload Date Size Link
version 3.0 August 31, 2009 9.2M Data
version 2.0 March 11, 2009 9M Data
version 1.1 October 20, 2008 12M Data

Frequently Asked Questions:

  1. Why does version 3.0 have quite a few more nodes than previous versions?
    • Analysis of this run showed that approximately 80 nodes were not surveyed due to an internal problem that prevented the script from getting information on those nodes. Furthermore, approximately 300 nodes were removed from the backup system, and 340 new nodes were added. The net effect was a reduction in total nodes.
  2. Why does version 2.0 have a fewer number of nodes (versus version 1.2), yet the number of files and total data size have grown?
    • The new nodes identified by this run accounted for much of the growth in terms of file counts and data size. Approximately 209 million files belonged to these new nodes, while the number of files related to the nodes no longer backed up was much lower. This resulted in a net gain of files.
  3. Will anonymous node names and file space names persist over the life of this project?
    • Yes, the node name and filespace will be unique to that particular node over the life of this project. A future survey of the nodes will identify those nodes that have already been assigned an anonymous name, and that name (anonymous) will be retained.
  4. Why does the inactive files histogram have min bucket and max bucket values of equal size?
    • This project re-used code that created histograms based on bucket ranges. Because this histogram is meant to show inactive files (1,2,3,4), a range is not necessary. This will be fixed in a future version. See the README contained in the tarball for more information.
Archive and NFS Metadata

NOTE: These data are historical in nature and were originally released in 2011.

These data represent a file system walk with information such as file sizes, creation time, modification time, UID/GID, etc. The data are highly anonymized and are released under document LA-UR-07-5769. Please see the README contained in the individual archives for a more detailed description of the data. The files are text and compressed, and the file size below represents the compressed size for download.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page, including information such as the number of nodes, cores, memory size, and dates of use.

All files and content available for download are covered by document LA-UR-07-5769.

Table with list of data files.
Anonymous Filesystem Size Number of Records Machine System Number Link
Anonymous Archive 1 1.4G 112,020,366 29,30,34,38,39,57-61,63,64,65 Data
Anonymous Global NFS 1 TBD 6,437,081 29,30,34,38,39,57-61,63,64,65 TBD
Anonymous non-shared NFS Filesystem 1 3.5M 306,340 30, 64 Data
Anonymous non-shared NFS Filesystem 2 6.9M 590,610 39 Data
Anonymous non-shared NFS Filesystem 3 10M 855,361 57-61 Data
Anonymous non-shared NFS Filesystem 4 1.8M 163,267 29 Data
Anonymous non-shared NFS Filesystem 5 8M 634,008 34, 65 Data
Anonymous non-shared NFS Filesystem 6 2.5M 222,111 38 Data

Frequently Asked Questions:

  1. What is the difference between global and non-shared NFS file systems?
    • The global NFS file system is shared among a number of different clusters. The non-shared NFS file systems are visible to only a single cluster.
  2. Is the create time field really the time the file was created?
    • Yes. This field is not UNIX "ctime" (change time).
  3. Are the user identifieer (UID)/group identifier (GID) or paths consistent across file systems?
    • No. No assumptions can be made across files. The paths within a single file maintain the file tree hierarchy.
  4. I notice several instances where the GID is different from the UID, and multiple UIDs have the same GID. Is this right?
    • Yes, this is especially the case for the globally accessible file systems. Multiple UIDs having the same GID are usually users sharing files with a group and UNIX permissions. Local NFS file systems usually have a one-to-one mapping between UID and GID, but this is not required.
  5. Is the blocksize field the actual blocksize used to store the file, or is it the size allocated to a file?
    • This field is for the size of each block allocated.
  6. Are these file systems/archive mostly used by developers or users?
    • Most of the file systems and machines are open to both developers and users. The exception to this is machine 29, which is targeted solely to developers. Some developers also work on their desktop machines, so these files may not be representative of their entire workloads.