It is often useful to combine a large number of small files into a small number of large files, especially when saving multiple directories to archival storage.
Depending on the types of files, compression techniques may also be applied in order to reduce file size and data transmission time.
The unix tar command, combined with either gzip or bzip2 compression, are popular combinations, but they may not scale very well with large numbers of files.
How to use ptar
The Blue Waters ptar module offers an alternative - using either the Parallel Implementation of GZip (pigz) or Parallel Bzip2 (pbzip2) compression, which use threads to compress multiple files concurrently.
To use this functionality on Blue Waters, simply load the ptar module:
module load ptar
Subsequent invocations of the tar command with compression enabled (i.e., using the -z or -j flag) will use a multi-threaded version of the compression library, thereby achieving significant speedup.
As an example, various strategies were used to create an archive file of a directory containing 128,641 files, 15 Gb total.
Using pigz compression (i.e., -z) was the fastest:
and produced a 6.1 Gb archive file.
Using pbzip2 compression (i.e., -j) produced a smaller (5.4 Gb) archive file, but was much slower:
Using the standard, non-threaded gzip compression required 35 minutes of real time, whereas pigz required only 1.5 minutes:
Additional Information / References
For more information about the Blue Waters ptar module, please send email to "firstname.lastname@example.org".