Multi-core File Compression

Published on Tuesday, 30 August 2011 09:10
Written by Douglas Eadline
Hits: 5674

From the getting your cores off the sofa department

Every now and then there are really nifty multi-core applications that help with some of the more mundane Linux HPC chores. The -j option for make is one such example. I recently stumbled upon two other applications that take advantage of multi-core for file compression.

The two packages are parallel gzip and parallel bzip2. As you might surmise, each application uses multiple cores to speed up the compression of large files. I checked to see if these were in the Scientific Linux Yum repository, but they did not show up when I tried to install them. Thus, I decided to build them myself.

Downloading an building was simple (make sure you have bzip2-devel and zlib-devel libraries installed) I actually created a spec files for them and produced my own RPMs to be used in my collection of cluster packages.

Of course, building packages is only half the fun. I decided to test the packages on my Limulus personal cluster. The head node is a currently a quad-core Intel Q6600 running Scientific Linux 6.0. In order to create a large file, I used tar and archived /usr/share (assuming it had a mix of file types):


$ tar cvf usr.share.tar /usr/share

The resulting file was 1.1GB, which was big enough for my simple tests. I then tried both the serial and the parallel versions on the file, both compressing and decompressing. Note, decompression is a mostly a serial task, thus does not benefit from multiple cores. To be a bit more formal, I created a simple script (see below) that all runs the tests and check to see if the file survived. I then placed the results in the following table.

Versiontime (seconds)Speed-upCompressed Size% of Original
Sequential gzip 97 1 461MB 42
Parallel pigz 27 3.6 460MB 42
Sequential bzip2 233 1 413MB 38
Parallel pbzip2 69 3.4 414MB 38

Table One: Results for various parallel compression packages for a 1.1GB file

The results were impressive. Basically I saw a ~3.5 times speed up for either method. The sequential bzip2 compression takes much longer than the sequential gzip, but the compression can be much better. In this case, the extra compression was not all that great, but in cases were the file is more "compressible," bzip2 will create much smaller files. In the case of pbzip2, it can make the compression times for bzip2 more tolerable. Overall, a nice pay-off for less than an hour of compiling and testing. Now I'll get back to my regular work, which quite interestingly involves creating some disk images with dd. Now all I need is fast compression tool ...

Simple Test Script

  #!/bin/bash

  #Create file
  tar cvf usr.share.tar /usr/share/  

  #Initial File size
  echo "Initial size and md5"
  md5sum usr.share.tar
  ls -lh usr.share.tar

  #gzip
  echo "Sequential gzip"
  time gzip usr.share.tar
  ls -lh usr.share.tar.gz
  gunzip usr.share.tar.gz

  #pigz
  echo "Parallel pigz"
  time pigz usr.share.tar
  ls -lh usr.share.tar.gz
  unpigz usr.share.tar.gz

  #bzip2
  echo "Sequential bzip2"
  time bzip2 usr.share.tar
  ls -lh usr.share.tar.bz2
  bunzip2 usr.share.tar.bz2

  #pbzip
  echo "Parallel pbzip2"
  time pbzip2 usr.share.tar
  ls -lh usr.share.tar.bz2
  bunzip2 usr.share.tar.bz2

  # Final File size
  echo "Final size and md5"
  md5sum usr.share.tar
  ls -lh usr.share.tar