Archiving data

Software

Archiving data

An important part of research is a good archive system for research data (simulation data and output, software snapshots), ensuring that old data is not lost and can be retrieved for future purposes. PDC has enabled the use of its Tivoli tape robot system for CTL, which allows automatic archival/retrieval of large amounts (terabytes) of data. The system group at CSC, together with PDC, has developed a user-friendly system for interfacing to the Tivoli system without having to know the internal details.

The archive system consists of a computer at CSC: na-hold.nada.kth.se, with a large disk (2TB currently) which acts as a gateway (temporary storage) to the Tivoli system and where CTL members can put data in a directory structure under /HOLDING/ARCHIVE. Every night at 5am all data under /HOLDING/ARCHIVE is moved to the Tivoli tape robot system (and is thus deleted from the na-hold computer on successful archival). After the archiving, a full list of everything in the archive system is saved at /var/log/csc-na-archive-log (on na-hold). If you need to retrieve something from the archive system, get the file name from the list and contact the system group at CSC. It's also possible to delete files from the archive system (for removing obsolete data for cost reasons, or removing incorrectly archived data), again contact the system group.

Use case of archive system

This is a typical use case of the archival system:

Let's assume you have a directory with research data at /NOBACKUP/jjan/sphere on the machine descartes.nada.kth.se.
Copy the directory to the na-hold computer (here's an ecample with rsync, you can copy in many ways)
rsync --rsync-path "/usr/local/bin/rsync" -avP /NOBACKUP/jjan/important_data jjan@na-hold:/HOLDING/INCOMING/
Login to na-hold and make an archive file of your directory:
ssh jjan@na-hold.nada.kth.se
cd /HOLDING/INCOMING
tar cvf sphere.tar sphere
Rename the archive file following the standard naming scheme below for easy retrieval:
mv important_data.tar sphere-jjan-2009.tar
Copy the archive file to the /HOLDING/ARCHIVE area, where you choose an appropriate directory depending on the project associated with the data, in this case let's assume it's "adaptive-incompressible":
cp sphere-jjan-2009.tar /HOLDING/ARCHIVE/projects/adaptive-incompressible/
Remove the files from the /HOLDING/INCOMING directory.
Wait until the next day (for the data to be moved at 5am).
Verify that the data was archived in the /var/log/csc-na-archive-log file (i.e. open it in a text editor and search, there should be an "Archive date" for the file).
If the file has been successfully archived, you can remove it from your disk if you no longer need it in your current work (in this example it was located on descartes):
ssh jjan@descartes.nada.kth.se
rm -r /NOBACKUP/jjan/sphere

Data organization

To be able to retrieve data in a simple, secure and systematic way the data has to be organized according to owner (project or person) with a standard naming scheme.

The archived data is organized in subdirectories:

/HOLDING/ARCHIVE
- projects
  - adaptive-acoustics
  - adaptive-incompressible
  - adaptive-compressible
  - adaptive-fsi
  - heart
  - parallel-amr
  - unicorn
  - ...
- personal
  - jjan
  - ...

When you insert an archive file into the system, we also enforce a naming scheme of the archive files to further ease retrieval. The naming scheme is: data_name-researcher_name-timestamp with the dash "-" separating the identifiers. data_name should be a descriptive name of the research/software in the archive file, for example "sphere" for a simulation of airflow around a sphere, or even "flow_sphere". researcher_name should be the username of the researcher responsible for generating the data, i.e. "jjan" for Johan Jansson. timestamp should describe the timeframe when the data was generated, typically a year, and month and day if more detail is required. Thus, in the above example the "important_data" directory is renamed to "sphere-jjan-2009".

Data size limit

The cost of having data archived is not negligible, and naturally there are limits in how much we can archive. Currently it costs ca. 2000SEK/TB/year to have data in the archive system. A reasonable limit is 1TB/person/year, with yearly overview of which data in the archival system is obsolete and can be permanently deleted.

From CTL

Contents

Archiving data

Use case of archive system

Data organization

Data size limit