Is there a need for a fast compression algorithm for geospatial data?

Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don’t treat byte-sequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.

Luckily for me, I happened to share an office today with the inventor of QuickLZ, which made it a bit easier to test the idea. Lasse suggested that I try to replace floats with a ‘$’ plus 8 random characters, which is the result referred to as “lz4-geo” in the benchmark section. The idea was to try to gauge how well the approach would work.

It’s a good idea?

This makes me wonder whether there would be value in an extension of an existing algorithm, such that string floats are stored as float-floats or as integers (by multiplying by a constant and truncating). That is, store “1234567.1234567” as a float/int instead of a bunch of bytes.

It’s a bad idea?

Although existing algorithms don’t recognize number literals, and even though geospatial data might not contain repeated coordinates very often, these algorithms (Snappy etc) might still compress them well, given that substrings inside number literals, say sequences like “23” or “0.1”, are much more frequent than the entire literal may be.

Trying it out: Benchmark

OK, first things first. How well do the existing algorithms actually compress geographical data like GeoJSON? To get an idea of this, I ran a simple benchmark written in Python (only compared compression rate).

I tested with some openstreetmap data for Denmark. One file contains a GeoJSON FeatureCollection (Denmark highway dataset). The other file contains 1000 WKT points (Denmark POI dataset).

Here is the result:

BENCHMARK: example1.json
algorithm	ratio
---
lz4:		3.798459
zlib(1)		5.112881
zlib(9)		6.230600
lz4-geo:	3.517534

BENCHMARK: example2.wkt
algorithm	ratio
---
lz4:		1.606806
zlib(1)		2.398542
zlib(9)		2.499737
lz4-geo:	2.729407

Benchmark code (Python):

import lz4
import zlib
import re
import string
import random

floatregex = r'[0-9]+\.[0-9]+'

def randomchars(match):
	result = ['$']
	result.extend([random.choice(string.letters[0:8]) for i in range(8)])
	return "".join(result)

def benchmark(filename):

	print ""
	print "BENCHMARK: %s" % filename

	data = open(filename,'r').read()
	orig = len(data)

	#print "length uncompressed: %d" % orig

	print "algorithm	ratio"
	print "---"
	print "lz4:		%f" % (orig / float(len(lz4.dumps(data))))
	print "zlib(1)		%f" % (orig / float(len(zlib.compress(data,1))))
	print "zlib(9)		%f" % (orig / float(len(zlib.compress(data,9))))


	lasse_idea =  re.sub(floatregex,randomchars,data)

	print "lz4-geo:	%f" % (orig / float(len(lz4.dumps(lasse_idea))))

benchmark('example1.json')
benchmark('example2.wkt')

print ""
print "benchmark done"

Conclusion

GeoJSON FeatureCollection benchmark

For GeoJSON FeatureCollections the algorithms do fairly well, with lz4 having a compression ratio of 3.5. My assumption that coordinates are random does not hold. There are many long runs of zeros and nines in the data, which compress nicely when viewed as text. The string-replace approach actually did worse.

WKT Points

For a dataset that contains mostly floating points, the algorithms do fairly bad.