Giving KyotoCabinet a go

Install KC:

tar xzvf kyotocabinet-1.2.76.tar.gz
cd kyotocabinet-1.2.76
./configure && make && make install # takes a couple of minutes

Read more

Having a look at LevelDB

Note: It is worth noting that (the way I interpret it) one of the authors of LevelDB (Sanjay) recommends using TokyoCabinet if the workload is more read-oriented. In turn the TokyoCabinet authors recommend using KyotoCabinet 🙂 (see video with a guy who really likes KyotoCabinet)


LevelDB is a database library (C++, 350 kB) written by people (Sanjay Ghemawat and Jeff Dean of GFS, BigTable, MapReduce, Spanner etc fame) at Google. It is an embedded database, i.e. one you use directly in your programming language and it runs in the same process as your program. The interface is a key-value store.

Read more

Toying around with Google BigQuery Python CLI

BigQuery is an API developed by Google for querying big data using an SQL like language. For beginners I can recommend this video presentation on BigQuery. It also covers the open source version (Apache Drill) of the software (Dremel) underlying BigQuery. There is also a Google I/O 2012 video.

Install the Python CLI (here I’m using PIP):

pip install bigquery
# this installs a program called 'bg', the Python CLI to BigQuery

Read more

Simple Rest API for storing “point” observations

Database stuff

First I’ll describe the database that backs the service.

PostGIS database backend

Here is how to make a simple table in PostgreSQL, that can store geo-tagged “observations”. It uses a hstore type for key-value pairs and a geography point for the GPS dot. It’s very versatile, and could store anything from bird observations to endomondo like GPS tracks.

CREATE TABLE observations(
    utc_timestamp TIMESTAMP, 
    geog GEOGRAPHY(Point, 4326), 
    kvp HSTORE

Read more

Design space of distributed file systems

Warning: This is work in progress, so this post will be updated
TODO: Make a table of the file systems mentioned e.g. in CRUSH paper (related work section).

Design space

A subset of properties that can considered when designing a new distributed file system:

  • Is the cluster static or dynamic?
  • Is the cluster heterogenous or homogenous?
  • Object based or block based?
  • Is data ever migrated once it is written? If so under which circumstances (storage added and/or storage deleted)?
  • Is the allocation based on metadata or a mapping function (consistent hashing etc)?
  • Is there a central allocator?
  • Is there replication? Across failure domains?
  • What are the assumptions about the workload (skew on new/old items etc)?

Reading diary, 29 August 2012

Today I’m reading about the PNUTS distributed database that was developed by Yahoo. I’m giving the PNUTS paper pass 1:3, and reading some background material I found on highscalability about the database.

The paper:

  • PNUTS: Yahoo!’s Hosted Data Serving Platform

Background articles:

My conclusions about PNUTS:

  • See the HS article instead. It’s a nice summary of the properties of PNUTS.