USRC Data Sources Storage Data

Traces and statistics.

Description

The following are collections of data are related to storage.  They include things like file system stats (fsstats - available on USRC Software Page) dumps from parallel file systems, archive systems, NFS systems, and even workstations.

Citation

Please use the following citation unless the data source below offers a more specific citation:

If you'd like to cite our data and use bibtex, please use the following snippet:

@MISC{usrc:datasources-general,
    author = {{Los Alamos National Laboratory}},
    title = {{Ultrascale Systems Research Center (USRC) Data Sources}},
    howpublished= {\url{https://usrc.lanl.gov/data-sources.php}}
}

Parallel File Systems File System Statistics (fsstats)

NOTE: This data is historical in nature and was originally released in 2012.

To enable computer science research Los Alamos National Laboratory is releasing static file tree data (fsstats) for some of our parallel filesystems. These data include aggregate information on capacity, file and directory sizes, filename lengths, link counts, etc. The fsstats cover 9 anonymous parallel filesystems ranging from 16 TB to 439 TB total capacity used and file counts range from 2,024,729 to 43,605,555.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page including information such as the number of nodes, cores, memory size, and dates of use.

All file sizes are extremely small (approximately 15K) and uncompressed.

All files and content available for download are covered by LA-UR-07-5769.

Anonymous Filesystem Date Machine System Number Link
Anonymous Parallel Filesystem 1 January 11, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 2 January 3, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 3 January 3, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 4 January 4, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 5 January 11, 2012 28,31,36,37 Data
Anonymous Parallel Filesystem 6 November 21, 2011 29,30,34,38,39,64,65 Data
Anonymous Parallel Filesystem 7 November 17, 2011 29,30,34,38,39,64,65 Data
Anonymous Parallel Filesystem 8
November 21, 2011 27,35 Data
Anonymous Parallel Filesystem 9 November 21, 2011 27,35 Data

FAQ:

  1. Can you tell me a little bit about how these filesystems are typically used?
    • Group 1: these users tend to have N-N codes, they write tons of tiny files
      Group 2: you see a lot more N-1 codes, thus fewer files than Group 1 typically has
      Group 3: test systems - non-production runs meant to test or debug

      Really any "Group" can run on any system, but here are the trends:
      Filesystems 6 and 7 tend to be Group 1 usage, while 1-5 and 8-9 are more Group 2.
  2. Can we compare these filesystems to the 2008 fsstats released by LANL?
    • For the 2008 fsstats there are two Group 2 filesystems and one (panscratch1.csv) is from Group 3. Comparing data across groups may not be useful.
  3. Why is the negative overhead for anonymous filesystem 6 so large?
    • Almost all of the negative overhead for anonymous filesystem 6 comes from 2(!) files from the same user.
      File1 1029403328512 bytes = 958 GB
      File2 58650690272 = 54.6 GB
      So in total 1012 GB from the same user.
Workstation File System Statistics (fsstats)

NOTE: This data is historical in nature and originally released in 2007.

This data represents file system statistics gathered from backup information for 3000 plus LANL workstations. The following set of data are provided under universal release to any computer science researcher to use to enable computer science work. This data was acquired re-using some aspects of the Petascale Data Storage Institute (PDSI) tool (available on USRC Software page). 

The data is contained in a tar file containing three CSV files that represent the three primary types of workstations used at LANL (MAC, UNIX, Windows). Please refer to the README contained in the tar file for a description of the data gathered. All files and content available for download are covered by LA-UR-07-5769.

Each archive contains a README explaining the data in more detail.

Dataset Release Results Upload Date Size Link
version 3.0 August 31, 2009 9.2M Data
version 2.0 March 11, 2009 9M Data
version 1.1 October 20, 2008 12M Data

FAQ:

  1. Why does version 3.0 have quite a few more nodes than previous versions?
    • Analysis of this run showed that approximately 80 nodes were not surveyed due to an internal problem that prevented the script from getting information on those nodes.  Furthermore, approximately 300 nodes were removed from the backup system and 340 new nodes were added. The net effect was a reduction in total nodes.
  2. Why does version 2.0 have a fewer number of nodes (versus version 1.2) yet the number of files and total data size has grown?
    • The new nodes identified by this run accounted for much of the growth in terms of file counts and data size. Approximately 209 million files belonged to these new nodes, while the number of files related to the nodes no longer backed up was much lower. This resulted in a net gain of files.
  3. Will anonymous node names and file space names persist over the life of this project?
    • Yes, The node name and filespace will be unique to that particular node over the life of this project. A future survey of the nodes will identify those nodes that have already been assigned an anonymous name and that name (anonymous) will be retained.
  4. Why does the inactive files histogram have min bucket and max bucket values of equal size?
    • This project re-used code that created histograms based on bucket ranges. Since this histogram is meant to show inactive files (1,2,3,4) a range is not necessary. This will be fixed in a future version. See the README contained in the tarball for more information.
Archive and NFS Metadata

NOTE: This data is historical in nature and was originally released in 2011.

This data represents a file system walk with information such as file sizes, creation time, modification time, UID/GID, etc.  The data is highly anonymized and released under LA-UR-07-5769.  Please see the README contained in the individual archives for a more detailed description of the data.  The files are text and compressed and the file size below represents the compressed size for download.

The Machine System Number is explained more under the anonymous machine tab elsewhere on this page including information such as the number of nodes, cores, memory size, and dates of use.

All files and content available for download are covered by LA-UR-07-5769.

Anonymous Filesystem Size Number of Records Machine System Number Link
Anonymous Archive 1 1.4G 112,020,366 29,30,34,38,39,57-61,63,64,65 Data
Anonymous Global NFS 1 TBD 6,437,081 29,30,34,38,39,57-61,63,64,65 TBD
Anonymous non-shared NFS Filesystem 1 3.5M 306,340 30, 64 Data
Anonymous non-shared NFS Filesystem 2 6.9M 590,610 39 Data
Anonymous non-shared NFS Filesystem 3 10M 855,361 57-61 Data
Anonymous non-shared NFS Filesystem 4 1.8M 163,267 29 Data
Anonymous non-shared NFS Filesystem 5 8M 634,008 34, 65 Data
Anonymous non-shared NFS Filesystem 6 2.5M 222,111 38 Data

FAQ:

  1. What is the difference between global and non-shared NFS filesystems?
    • The global NFS filesystem is shared among a number of different clusters. The non-shared NFS filesystems are visible to only a single cluster.
  2. Is the create time field really the time the file was created?
    • Yes. This field is not UNIX "ctime" (change time).
  3. Are the UID/GID or paths consistent across filesystems?
    • No. No assumptions can be made across files. The paths within a single file maintain the file tree hierarchy.
  4. I notice several instances where the GID is different from the UID, and multiple UIDs have the same GID. Is this right?
    • Yes, this is especially the case for the globally accessible filesystems. Multiple UIDs having the same GID are usually users sharing files with a group and UNIX permissions. Local NFS filesystems usually have a one-to-one mapping between UID and GID, but this is not required.
  5. Is the blocksize field the actual blocksize used to store the file, or is it the size allocated to a file?
    • This field is for the size of each block allocated.
  6. Are these filesystems/archive mostly used by developers or users?
    • Most of the filesystems and machines are open to both developers and users. The exception to this is machine 29, which is targeted solely to developers. Some developers also work on their desktop machines so these files may not be representative of their entire workloads.