Note: It is worth noting that (the way I interpret it) one of the authors of LevelDB (Sanjay) recommends using TokyoCabinet if the workload is more read-oriented. In turn the TokyoCabinet authors recommend using KyotoCabinet :-) (see video with a guy who really likes KyotoCabinet)
LevelDB is a database library (C++, 350 kB) written by people (Sanjay Ghemawat and Jeff Dean of GFS, BigTable, MapReduce, Spanner etc fame) at Google. It is an embedded database, i.e. one you use directly in your programming language and it runs in the same process as your program. The interface is a key-value store.
In this post I want to do three things. 1) Learn about the LevelDB API, 2) try it out in Python, 3) Learn details of how LevelDB is implemented. Actually I want to do 1+2 at the same time. I find that examples are a fast way to learn what something is.
Going off on a tangent: Automake
There are two code-trees that essentially make up LevelDB: LevelDB and Snappy (compression library used by LevelDB). It seems to me that Google are using autoconf/automake, which I’m not so familiar with. So if I’m to build leveldb and snappy myself, I should first get up-to-speed with that way of building (I’m a ./configure && make && make install kind of guy). Someone wrote a tutorial. The tools can be downloaded here:
- autoconf: download + extract + ./configure && make && make install
- automake: download + extract + ./configure && make && make install
- libtool: download + extract + ./configure && make && make install
What LevelDB is, the API
Put, Get, Delete, CreateSnapshot, WriteBatch, and RangeIteration with options for
Before going crazy with hello worlds and such, I wanted to briefly get an idea about LevelDB. There are two sides to that coin. One is understanding the API, and another is understanding the implementation (that comes later).
The best explanation of the LevelDB interface is perhaps to read the C++ examples in the LevelDB documentation. It highlights the best features of LevelDB (apart from Put,Get,Delete):
- Making multiple operations atomic
- Creating consistent snapshots
- Iterating over key ranges (something you can’t do in most key-value stores)
- Implementing alternative sorting orders for keys using custom Comparators
Even though I don’t normally code in C++, I found these examples easy enough to understand.
In the following I’ll recreate the examples in Python, and show you how to quickly install the Python binding (including the leveldb library).
Trying LevelDB in Python
Note: The Python bindings don’t support the full feature set of LevelDB. Sadly it seems that it is not possible to provide a custom Comparator. This means that implementing alternative orderings like e.g. a space-filling curve is not as easy as it could have been with arbitrary ordering of keys.
Note: In the C++ examples, a call to e.g. Get returns a Status: Did the read go well? In C++ the value to read is passed by reference to Get, so that the status can be returned I guess. I’m not sure how access the Status of an operation when using Python. Well, just ignore that…
Note: This is the extremely lazy way of getting leveldb up and running in Python. I’m not convinced this is how you should really do it in the “proper” way. However, it works. This also installs the C++ leveldb library, as it comes bundled with py-leveldb.
pip install leveldb
Making a clean slate:
import leveldb leveldb.DestroyDB('./db') db = leveldb.LevelDB('./db')
db.Put('hello', 'world') # persisted to disc print db.Get('hello')
Atomically moving a value from one key to another:
# read value to move value = db.Get('hello') # create a batch object batch = leveldb.WriteBatch() batch.Put('hello2',value) batch.Delete('hello') db.Write(batch, sync=True) # The extra cost of the synchronous write will be amortized across all of the writes in the batch.
Iterate over a key-range:
# put some values db.Put('aaa','1') db.Put('bbb','2') db.Put('ccc', '3') db.Put('ddd', '4') # iterate over a range of keys for k,v in db.RangeIter('b','d'): print k,v # Prints: # bbb, 2 # ccc, 3
Create a snapshot:
# put a value db.Put('a','xxx') # create a read-only snapshot snapshot = db.CreateSnapshot() # overwrite value in database db.Put('a', 'yyy') # read old value from snapshot print snapshot.Get('a') # prints 'xxx'
Note: I’m not really sure how to persist the snapshop to disc using Python.
Going off on a tangent: LevelDB and spatial data
Keys and spatial data
The most obvious way to use keys in LevelDB to store spatial data, is to store keys corresponding to values on a space-filling curve (e.g. Z-order or Hilbert curve). For the Python binding the trouble is that custom comparators are not possible, so the only ordering is alphabetical. I have to think of how this can be mapped to a space-filling curve (maybe it is totally obvious!).
I find that Z-orders are the easiest to understand and remember, since they correspond to QuadTrees.
Values and spatial data
LevelDB uses Snappy to compress the string keys and value, if it is available. Spatial data contains a lot of coordiantes, i.e. floats, so it is interesting how Snappy handles things like integers and floats.
Reading the Snappy format description it seems to me that both integers and floats are stored just as strings. Compression in Snappy comes from repeated byte-sequences in the stream (run-length coding and back-references). So maybe not so hot for spatial data that usually contains a large number of floats, i.e. coordinates. Maybe I missed something in the Snappy spec.
This is just assuming that spatial data is something like GeoJSON, in which case nothing magical will happen :-) One could of course be a little less lazy and do the compression in the application, using e.g. Protocol Buffers or something even more efficient for spatial data.
Deeper into LevelDB
This is the section that talks about how LevelDB is implemented. This section will gradually get better.
Comparison to bitcask
LevelDB is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration). – Sanjay
Bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications. bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. – Sanjay
To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.. – Sanjay
Comparison to TokyoCabinet
The short story. LevelDB is optimized for writes, and TokyoCabinet for reads. See the following comments by Sanjay:
TokyoCabinet is something we seriously considered using instead of writing leveldb. TokyoCabinet has great performance usually. I haven’t done a careful head-to-head comparison, but it wouldn’t surprise me if it was somewhat faster than leveldb for many workloads. Plus TokyoCabinet is more mature, has matching server code etc. and may therefore be a better fit for many projects. – Sanjay
However because of a fundamental difference in data structures (TokyoCabinet uses btrees for ordered storage; leveldb uses log structured merge trees), random write performance (which is important for our needs) is significantly better in leveldb. This part we did measure. IIRC, we could fill TokyoCabinet with a million 100-byte writes in less than two seconds if writing sequentially, but the time ballooned to ~2000 seconds if we wrote randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds (random). – Sanjay
Sanjay commented on HN about the durability of LevelDB. Sanjay also commented on what exactly it means that writes may be lost (for asynchronous writes), namely that the database is not corrupted following a crash, but merely result in incomplete data files.
Because of the way leveldb data is organized on disk, a single Get() call may involve multiple reads from disk. The optional FilterPolicy mechanism can be used to reduce the number of disk reads substantially. – LevelDB documentation
Bloom filter based filtering relies on keeping some number of bits of data in memory per key (in this case 10 bits per key since that is the argument we passed to NewBloomFilterPolicy). This filter will reduce the number of unnecessary disk reads needed for Get() calls by a factor of approximately a 100. Increasing the bits per key will lead to a larger reduction at the cost of more memory usage.
An LSM-Tree is used to increase write-throughput. This makes LevelDB more suitable for random writes-heavy workloads. In contrast KyotoCabinet implements ordered keys using a B-Tree, which is more suitable for read-heavy workloads.
Memtables and SSTables
LevelDB might be similar to Cassandra and other BigTable inspired databases when it comes to writes. In Cassandra this is how a write is processed:
Cassandra writes are first written to the CommitLog, and then to a per-ColumnFamily structure called a Memtable. When a Memtable is full, it is written to disk as an SSTable. – Cassandra wiki
Cache-oblivious B-trees are LSM-trees with faster searches.