Benchmark: Reading uncompressed and compressed files from disc

In this post I’ll compare the running time of reading uncompressed and compressed files from disc.

I’ll run a test using two files, data.txt (858M) and data.txt.gz (83M), that have the same content.

About cat and zcat

The well-known command cat, prints the contents of a file. The lesser-known zcat, prints the contents of a GZIP’ed file.

Storing log files in gzip’ed format, not only takes up less space. It also makes reading them from disc alot faster, even taking into account the extra overhead of on-the-fly decompression.

This is because I/O is slow and unzipping is fast. You want to transfer as few blocks from disc as possible, and pay with a little CPU.

To read data.txt using cat:

cat < data.txt
# or
cat data.txt

To read data.txt.gz using zcat:

zcat < data.txt.gz
# or
zcat data.txt.gz # Doesn't work on my Mac (10.6.8), zcat 1.3.12

Benchmark: Reading compressed and uncompressed files from disc

To compare the running time of reading uncompressed and compressed files, I’ll do the following:

  1. Generate two fake log files with identical content: data.txt (858M) and data.txt.gz (83M). Obviously, the latter file is compressed
  2. Simulate a cold disc cache by running the purge command to flush the disc cache (only works on Mac OS X)
  3. Read the files using cat and zcat respectively. To simulate some processesing, I count the lines using wc -l
  4. Perform the same test on the warm cache

Here are the results of the benchmark (also see run_benchmark.sh below):

Creating data.txt
Creating data.txt.gz
File sizes:
-rw-r--r--  1 kostas  staff   858M Feb 28 16:58 data.txt
-rw-r--r--  1 kostas  staff    83M Feb 28 16:59 data.txt.gz
Flushing disc cache to approximate cold disc buffer for benchmark
Uses the purge command, which only works on Mac OS X.
 
Running benchmark on cold disc cache
 
Running benchmark 1: Reading uncompressed data
cat < data.txt | wc -l
 10000000
Time:	23 s
 
Running benchmark 2: Reading compressed data
zcat < data.txt.gz | wc -l
 10000000
Time:	6 s
 
Rerunning benchmark on wark disc cache
 
Running benchmark 1: Reading uncompressed data
cat < data.txt | wc -l
 10000000
Time:	1 s
 
Running benchmark 2: Reading compressed data
zcat < data.txt.gz | wc -l
 10000000
Time:	5 s
 
Benchmarks done. Cleaning up.

Conclusion

Assuming data that compresses to about 10% in size, typical of log-files.

  1. For files on disc: Reading compressed files is ~4 times faster
  2. For files in disc cache: Reading uncompressed files is ~5 faster
  3. For very large files that don’t fit into the disc cache: Reading compressed files will always be faster
  4. For smaller files that are read often: Reading uncompressed files will likely be faster.

Code listings

create_data.py:

#!/usr/bin/python
 
import sys
import random
import hashlib
 
def main(argv):
 
	NUM_WORDS = 10
	LETTERS_PER_WORD = 8
	WORDS_PER_LINE = 10
	NUM_LINES = 10000000 # ten million
	random.seed(42)	
 
	# create word array
	words = []
	for i in range(NUM_WORDS):
		md5 = hashlib.md5()
		md5.update("%s" % i)
		word = md5.hexdigest()[:LETTERS_PER_WORD]
		words.append(word)
 
	for i in range(NUM_LINES):
		r = random.random
		line = []
		for j in range(WORDS_PER_LINE):
			line.append(words[int(r()*NUM_WORDS % NUM_WORDS)])
		print " ".join(line)
 
	return None
 
if __name__ == "__main__":
	main(sys.argv)

run_benchmark.sh:

#!/bin/sh
 
# CREATING DATA
echo "Creating data.txt"
./create_data.py > data.txt
echo "Creating data.txt.gz"
gzip < data.txt > data.txt.gz
 
echo "File sizes:"
ls -lh data.txt*
 
# PURGE DISC CACHE
echo ""
echo "Flushing disc cache to approximate cold disc buffer for benchmark"
echo "Uses the purge command, which only works on Mac OS X."
purge # this is a blocking operation. Mac OS X only
 
# COLD START
echo ""
echo "Running benchmark on cold disc cache"
 
# RUNNING BENCHMARK 1
echo ""
echo "Running benchmark 1: Reading uncompressed data"
 
T0=`date +%s`
echo "cat < data.txt | wc -l"
cat < data.txt | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
 
# RUNNING BENCHMARK 2
echo ""
echo "Running benchmark 2: Reading compressed data"
 
T0=`date +%s`
echo "zcat < data.txt.gz | wc -l"
zcat < data.txt.gz | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
 
# WARM START
 
echo ""
echo "Rerunning benchmark on wark disc cache"
 
# RUNNING BENCHMARK 1
echo ""
echo "Running benchmark 1: Reading uncompressed data"
 
T0=`date +%s`
echo "cat < data.txt | wc -l"
cat < data.txt | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
 
# RUNNING BENCHMARK 2
echo ""
echo "Running benchmark 2: Reading compressed data"
 
T0=`date +%s`
echo "zcat < data.txt.gz | wc -l"
zcat < data.txt.gz | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
 
# CLEANING UP
echo ""
echo "Benchmarks done. Cleaning up."
 
rm data.txt data.txt.gz

Leave a Reply