In this post I'll compare the running time of reading uncompressed and compressed files from disc.
I'll run a test using two files, data.txt (858M) and data.txt.gz (83M), that have the same content.
About cat and zcat
The well-known command cat, prints the contents of a file. The lesser-known zcat, prints the contents of a GZIP'ed file.
Storing log files in gzip'ed format, not only takes up less space. It also makes reading them from disc alot faster, even taking into account the extra overhead of on-the-fly decompression.
This is because I/O is slow and unzipping is fast. You want to transfer as few blocks from disc as possible, and pay with a little CPU.
To read data.txt using cat:
cat < data.txt # or cat data.txt |
To read data.txt.gz using zcat:
zcat < data.txt.gz # or zcat data.txt.gz # Doesn't work on my Mac (10.6.8), zcat 1.3.12 |
Benchmark: Reading compressed and uncompressed files from disc
To compare the running time of reading uncompressed and compressed files, I'll do the following:
- Generate two fake log files with identical content: data.txt (858M) and data.txt.gz (83M). Obviously, the latter file is compressed
- Simulate a cold disc cache by running the purge command to flush the disc cache (only works on Mac OS X)
- Read the files using cat and zcat respectively. To simulate some processesing, I count the lines using wc -l
- Perform the same test on the warm cache
Here are the results of the benchmark (also see run_benchmark.sh below):
Creating data.txt Creating data.txt.gz File sizes: -rw-r--r-- 1 kostas staff 858M Feb 28 16:58 data.txt -rw-r--r-- 1 kostas staff 83M Feb 28 16:59 data.txt.gz Flushing disc cache to approximate cold disc buffer for benchmark Uses the purge command, which only works on Mac OS X. Running benchmark on cold disc cache Running benchmark 1: Reading uncompressed data cat < data.txt | wc -l 10000000 Time: 23 s Running benchmark 2: Reading compressed data zcat < data.txt.gz | wc -l 10000000 Time: 6 s Rerunning benchmark on wark disc cache Running benchmark 1: Reading uncompressed data cat < data.txt | wc -l 10000000 Time: 1 s Running benchmark 2: Reading compressed data zcat < data.txt.gz | wc -l 10000000 Time: 5 s Benchmarks done. Cleaning up. |
Conclusion
Assuming data that compresses to about 10% in size, typical of log-files.
- For files on disc: Reading compressed files is ~4 times faster
- For files in disc cache: Reading uncompressed files is ~5 faster
- For very large files that don't fit into the disc cache: Reading compressed files will always be faster
- For smaller files that are read often: Reading uncompressed files will likely be faster.
Code listings
create_data.py:
#!/usr/bin/python import sys import random import hashlib def main(argv): NUM_WORDS = 10 LETTERS_PER_WORD = 8 WORDS_PER_LINE = 10 NUM_LINES = 10000000 # ten million random.seed(42) # create word array words = [] for i in range(NUM_WORDS): md5 = hashlib.md5() md5.update("%s" % i) word = md5.hexdigest()[:LETTERS_PER_WORD] words.append(word) for i in range(NUM_LINES): r = random.random line = [] for j in range(WORDS_PER_LINE): line.append(words[int(r()*NUM_WORDS % NUM_WORDS)]) print " ".join(line) return None if __name__ == "__main__": main(sys.argv) |
run_benchmark.sh:
#!/bin/sh # CREATING DATA echo "Creating data.txt" ./create_data.py > data.txt echo "Creating data.txt.gz" gzip < data.txt > data.txt.gz echo "File sizes:" ls -lh data.txt* # PURGE DISC CACHE echo "" echo "Flushing disc cache to approximate cold disc buffer for benchmark" echo "Uses the purge command, which only works on Mac OS X." purge # this is a blocking operation. Mac OS X only # COLD START echo "" echo "Running benchmark on cold disc cache" # RUNNING BENCHMARK 1 echo "" echo "Running benchmark 1: Reading uncompressed data" T0=`date +%s` echo "cat < data.txt | wc -l" cat < data.txt | wc -l T1=`date +%s` echo "Time:\t$((T1-T0)) s" # RUNNING BENCHMARK 2 echo "" echo "Running benchmark 2: Reading compressed data" T0=`date +%s` echo "zcat < data.txt.gz | wc -l" zcat < data.txt.gz | wc -l T1=`date +%s` echo "Time:\t$((T1-T0)) s" # WARM START echo "" echo "Rerunning benchmark on wark disc cache" # RUNNING BENCHMARK 1 echo "" echo "Running benchmark 1: Reading uncompressed data" T0=`date +%s` echo "cat < data.txt | wc -l" cat < data.txt | wc -l T1=`date +%s` echo "Time:\t$((T1-T0)) s" # RUNNING BENCHMARK 2 echo "" echo "Running benchmark 2: Reading compressed data" T0=`date +%s` echo "zcat < data.txt.gz | wc -l" zcat < data.txt.gz | wc -l T1=`date +%s` echo "Time:\t$((T1-T0)) s" # CLEANING UP echo "" echo "Benchmarks done. Cleaning up." rm data.txt data.txt.gz |