Spacebase is a spatial datastore that began life as military-grade software, which at least sounds kind of cool. It’s an in-memory database, really, so switch off the cluster and the data is gone. Apparently the same thing was (unknown to the SpaceBase people?) invented in the 90’s by some americans also having the military as their first customer.
A distribution algorithm is used to map keys to servers in a distributed key-value store. There are several different ones, implemented in different systems, and with different properties. In this blog post I’ll briefly cover the best-known key hashing schemes, before I get to vbuckets.
wget http://fallabs.com/kyotocabinet/pkg/kyotocabinet-1.2.76.tar.gz tar xzvf kyotocabinet-1.2.76.tar.gz cd kyotocabinet-1.2.76 ./configure && make && make install # takes a couple of minutes
Note: It is worth noting that (the way I interpret it) one of the authors of LevelDB (Sanjay) recommends using TokyoCabinet if the workload is more read-oriented. In turn the TokyoCabinet authors recommend using KyotoCabinet 🙂 (see video with a guy who really likes KyotoCabinet)
LevelDB is a database library (C++, 350 kB) written by people (Sanjay Ghemawat and Jeff Dean of GFS, BigTable, MapReduce, Spanner etc fame) at Google. It is an embedded database, i.e. one you use directly in your programming language and it runs in the same process as your program. The interface is a key-value store.
BigQuery is an API developed by Google for querying big data using an SQL like language. For beginners I can recommend this video presentation on BigQuery. It also covers the open source version (Apache Drill) of the software (Dremel) underlying BigQuery. There is also a Google I/O 2012 video.
Install the Python CLI (here I’m using PIP):
pip install bigquery # this installs a program called 'bg', the Python CLI to BigQuery
Found these links that have valuable tips for achieving nginx performance:
I’ll try to benchmark before and after with
ab and post the results here.
First I’ll describe the database that backs the service.
PostGIS database backend
Here is how to make a simple table in PostgreSQL, that can store geo-tagged “observations”. It uses a hstore type for key-value pairs and a geography point for the GPS dot. It’s very versatile, and could store anything from bird observations to endomondo like GPS tracks.
CREATE TABLE observations( id SERIAL PRIMARY KEY, utc_timestamp TIMESTAMP, geog GEOGRAPHY(Point, 4326), kvp HSTORE );
Warning: This is work in progress, so this post will be updated
TODO: Make a table of the file systems mentioned e.g. in CRUSH paper (related work section).
A subset of properties that can considered when designing a new distributed file system:
- Is the cluster static or dynamic?
- Is the cluster heterogenous or homogenous?
- Object based or block based?
- Is data ever migrated once it is written? If so under which circumstances (storage added and/or storage deleted)?
- Is the allocation based on metadata or a mapping function (consistent hashing etc)?
- Is there a central allocator?
- Is there replication? Across failure domains?
- What are the assumptions about the workload (skew on new/old items etc)?
Today I’m reading about the PNUTS distributed database that was developed by Yahoo. I’m giving the PNUTS paper pass 1:3, and reading some background material I found on highscalability about the database.
- PNUTS: Yahoo!’s Hosted Data Serving Platform
- Highscalability: Yahoo!’s PNUTS Database: Too Hot, Too Cold Or Just Right?
- Yahoo research: PNUTS – Platform for Nimble Universal Table Storage
- Idleprocess: Yahoo!’s Geo-Replication Service, PNUTS
My conclusions about PNUTS:
- See the HS article instead. It’s a nice summary of the properties of PNUTS.