Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don't treat byte-sequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.
Luckily for me, I happened to share an office today with the inventor of QuickLZ, which made it a bit easier to test the idea. Lasse suggested that I try to replace floats with a '$' plus 8 random characters, which is the result referred to as "lz4-geo" in the benchmark section. The idea was to try to gauge how well the approach would work.
It's a good idea?
This makes me wonder whether there would be value in an extension of an existing algorithm, such that string floats are stored as float-floats or as integers (by multiplying by a constant and truncating). That is, store "1234567.1234567" as a float/int instead of a bunch of bytes.
It's a bad idea?
Although existing algorithms don't recognize number literals, and even though geospatial data might not contain repeated coordinates very often, these algorithms (Snappy etc) might still compress them well, given that substrings inside number literals, say sequences like "23" or "0.1", are much more frequent than the entire literal may be.
Trying it out: Benchmark
OK, first things first. How well do the existing algorithms actually compress geographical data like GeoJSON? To get an idea of this, I ran a simple benchmark written in Python (only compared compression rate).
I tested with some openstreetmap data for Denmark. One file contains a GeoJSON FeatureCollection (Denmark highway dataset). The other file contains 1000 WKT points (Denmark POI dataset).
Here is the result:
BENCHMARK: example1.json algorithm ratio --- lz4: 3.798459 zlib(1) 5.112881 zlib(9) 6.230600 lz4-geo: 3.517534 BENCHMARK: example2.wkt algorithm ratio --- lz4: 1.606806 zlib(1) 2.398542 zlib(9) 2.499737 lz4-geo: 2.729407
Benchmark code (Python):
import lz4 import zlib import re import string import random floatregex = r'[0-9]+\.[0-9]+' def randomchars(match): result = ['$'] result.extend([random.choice(string.letters[0:8]) for i in range(8)]) return "".join(result) def benchmark(filename): print "" print "BENCHMARK: %s" % filename data = open(filename,'r').read() orig = len(data) #print "length uncompressed: %d" % orig print "algorithm ratio" print "---" print "lz4: %f" % (orig / float(len(lz4.dumps(data)))) print "zlib(1) %f" % (orig / float(len(zlib.compress(data,1)))) print "zlib(9) %f" % (orig / float(len(zlib.compress(data,9)))) lasse_idea = re.sub(floatregex,randomchars,data) print "lz4-geo: %f" % (orig / float(len(lz4.dumps(lasse_idea)))) benchmark('example1.json') benchmark('example2.wkt') print "" print "benchmark done"
GeoJSON FeatureCollection benchmark
For GeoJSON FeatureCollections the algorithms do fairly well, with lz4 having a compression ratio of 3.5. My assumption that coordinates are random does not hold. There are many long runs of zeros and nines in the data, which compress nicely when viewed as text. The string-replace approach actually did worse.
For a dataset that contains mostly floating points, the algorithms do fairly bad.