In this post I’ll compare the running time of reading uncompressed and compressed files from disc.
I’ll run a test using two files, data.txt (858M) and data.txt.gz (83M), that have the same content.
About cat and zcat
The well-known command cat, prints the contents of a file. The lesser-known zcat, prints the contents of a GZIP’ed file.
Storing log files in gzip’ed format, not only takes up less space. It also makes reading them from disc alot faster, even taking into account the extra overhead of on-the-fly decompression.
This is because I/O is slow and unzipping is fast. You want to transfer as few blocks from disc as possible, and pay with a little CPU.
To read data.txt using cat:
cat < data.txt
# or
cat data.txt
To read data.txt.gz using zcat:
zcat < data.txt.gz
# or
zcat data.txt.gz # Doesn't work on my Mac (10.6.8), zcat 1.3.12
Benchmark: Reading compressed and uncompressed files from disc
To compare the running time of reading uncompressed and compressed files, I'll do the following:
- Generate two fake log files with identical content: data.txt (858M) and data.txt.gz (83M). Obviously, the latter file is compressed
- Simulate a cold disc cache by running the purge command to flush the disc cache (only works on Mac OS X)
- Read the files using cat and zcat respectively. To simulate some processesing, I count the lines using wc -l
- Perform the same test on the warm cache
Here are the results of the benchmark (also see run_benchmark.sh below):
Creating data.txt
Creating data.txt.gz
File sizes:
-rw-r--r-- 1 kostas staff 858M Feb 28 16:58 data.txt
-rw-r--r-- 1 kostas staff 83M Feb 28 16:59 data.txt.gz
Flushing disc cache to approximate cold disc buffer for benchmark
Uses the purge command, which only works on Mac OS X.
Running benchmark on cold disc cache
Running benchmark 1: Reading uncompressed data
cat < data.txt | wc -l
10000000
Time: 23 s
Running benchmark 2: Reading compressed data
zcat < data.txt.gz | wc -l
10000000
Time: 6 s
Rerunning benchmark on wark disc cache
Running benchmark 1: Reading uncompressed data
cat < data.txt | wc -l
10000000
Time: 1 s
Running benchmark 2: Reading compressed data
zcat < data.txt.gz | wc -l
10000000
Time: 5 s
Benchmarks done. Cleaning up.
Conclusion
Assuming data that compresses to about 10% in size, typical of log-files.
- For files on disc: Reading compressed files is ~4 times faster
- For files in disc cache: Reading uncompressed files is ~5 faster
- For very large files that don't fit into the disc cache: Reading compressed files will always be faster
- For smaller files that are read often: Reading uncompressed files will likely be faster.
Code listings
create_data.py:
#!/usr/bin/python
import sys
import random
import hashlib
def main(argv):
NUM_WORDS = 10
LETTERS_PER_WORD = 8
WORDS_PER_LINE = 10
NUM_LINES = 10000000 # ten million
random.seed(42)
# create word array
words = []
for i in range(NUM_WORDS):
md5 = hashlib.md5()
md5.update("%s" % i)
word = md5.hexdigest()[:LETTERS_PER_WORD]
words.append(word)
for i in range(NUM_LINES):
r = random.random
line = []
for j in range(WORDS_PER_LINE):
line.append(words[int(r()*NUM_WORDS % NUM_WORDS)])
print " ".join(line)
return None
if __name__ == "__main__":
main(sys.argv)
run_benchmark.sh:
#!/bin/sh
# CREATING DATA
echo "Creating data.txt"
./create_data.py > data.txt
echo "Creating data.txt.gz"
gzip < data.txt > data.txt.gz
echo "File sizes:"
ls -lh data.txt*
# PURGE DISC CACHE
echo ""
echo "Flushing disc cache to approximate cold disc buffer for benchmark"
echo "Uses the purge command, which only works on Mac OS X."
purge # this is a blocking operation. Mac OS X only
# COLD START
echo ""
echo "Running benchmark on cold disc cache"
# RUNNING BENCHMARK 1
echo ""
echo "Running benchmark 1: Reading uncompressed data"
T0=`date +%s`
echo "cat < data.txt | wc -l"
cat < data.txt | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
# RUNNING BENCHMARK 2
echo ""
echo "Running benchmark 2: Reading compressed data"
T0=`date +%s`
echo "zcat < data.txt.gz | wc -l"
zcat < data.txt.gz | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
# WARM START
echo ""
echo "Rerunning benchmark on wark disc cache"
# RUNNING BENCHMARK 1
echo ""
echo "Running benchmark 1: Reading uncompressed data"
T0=`date +%s`
echo "cat < data.txt | wc -l"
cat < data.txt | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
# RUNNING BENCHMARK 2
echo ""
echo "Running benchmark 2: Reading compressed data"
T0=`date +%s`
echo "zcat < data.txt.gz | wc -l"
zcat < data.txt.gz | wc -l
T1=`date +%s`
echo "Time:\t$((T1-T0)) s"
# CLEANING UP
echo ""
echo "Benchmarks done. Cleaning up."
rm data.txt data.txt.gz
Leave a Reply
You must be logged in to post a comment.