When to be most careful about catching the flu?

Continuing on my blogification of Peter Norvigs excellent talk, the question is, when to watch out for the flu, e.g. if you live in Denmark?

1) Go to www.google.com/trends/
2) Type in the word “influenza”
3) Select your geographical region (Denmark in my case)
4) See data up to year 2008, to avoid the graph being squished by the outbreak of A(H1N1) (which leads to unusually many people talking about the flu)

Turns out the answer is: watch out in October and February.

Geocoding Python function for PostgreSQL

Gratefully making use of what others have provided, i.e. geopy, Google and plpythonu.

Type to hold result of geocoding:

CREATE TYPE geocoding AS (
  place text,
  latitude DOUBLE PRECISION,
  longitude DOUBLE PRECISION
);

Function that does the actual geocoding (to be extended with more vendors. Hint: look at geopy wiki). Takes an (arbitrary) input string to be geocoded:

CREATE OR REPLACE FUNCTION python_geocode
(
  input text,
  vendor text DEFAULT 'google'
) RETURNS SETOF geocoding AS
$$
  import time
  from geopy import geocoders
  # https://code.google.com/p/geopy/wiki/GettingStarted
 
  time.sleep(0.2)
  # TODO: Add other available vendors, e.g. Yahoo.
  if vendor.lower() == 'google':
    geocoder = geocoders.GoogleV3()
  else:
    raise ValueError("Invalid geocoder: %s" % vendor)
  try:
    for res in geocoder.geocode(input, exactly_one=False):
      yield {'place': res[0], 'latitude': res[1][0], 'longitude': res[1][1]}
  except:
    pass
$$ LANGUAGE plpythonu VOLATILE;

Example:

SELECT place, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)
FROM python_geocode('Kostas');

Playing with GraphViz and MathGenealogy data

Math in Genealogy is a great project (donate online). Sven Köhler from Potsdam, Germany has written a python script for visualizing the database, which I’m going to try.

First step is to clone the git repo:

$ git clone git@github.com:tzwenn/MathGenealogy.git

His instructions are quite simple:

$ ./genealogy.py --search 38586  # 30 seconds
$ ./genealogy.py --display 38586 > euler.dot  # 0.1 seconds

Next step is to install e.g. GraphViz, which is needed to visualize the dot file as a graph. Go to the download page for GraphViz, and follow instructions for your OS.

This should install the commandline tool also. Now you can visualize Leonard Euler’s supervisor family tree (direct descendants) like this:

$ dot euler.dot -Tpng -o euler.png

Looking at the database is easy. Every invocation of ./genealogy.py –search writes to a sqlite3 database file (genealogy.db).

$ sqlite3 genealogy.db

This opens up a prompt. Have a look at the schema of the database like this:

sqlite> .schema

And see what is inside the thesis table like this:

sqlite> select * from thesis;

Gregory Palamas 1363

This stuff blows my mind.

Gregory Palamas (1296–135), spelled Γρηγόριος Παλαμάς in greek, was a monk on Mount Athos, a place I’ve visited with my father two times. It is a beautiful peninsula in northern Greece, scattered with old monasteries, the entire half-island being the sole domain of men.

Simonopetra, Mount Athos.
Simonopetra, Mount Athos.

Palamas eventually became the Archbishop of Thessaloniki, which is a city I incidentally happened to live in from 1995-1996. Below is a picture of Gregory Palamas, in the form of an icon.

Gregorio Palamas
Gregorio Palamas

In his early youth, My father (Georgios Kefaloukos) was also a monk on Mount Athos. There he learned the art of icon painting, and could have painted one of Palamas, although I don’t think he did. Below is a picture of my father taken on Mount Athos.

My Father, Georgios Kefaloukos on Mount Athos.
My Father on Mount Athos in 1966

When I first heard about the Math in Genealogy project, I was thrilled to find out that a Gregory Palamas, who lived long ago and was the Archbishop of Thessaloniki, apparently had a transitive relationship with people in science through an unbroken chain of mentoring (112861 “descendants” in total). I became curious, and wanted to find out which famous people he might be connected to.

While Palamas was the Archbishop of Thessaloniki he mentored Nilos Kabasilas (1298-1363), who later replaced him as Archbishop. Nilos in turn mentored Demetrios Kydones (1333-1397) and this lineage of mentoring continues in an unbroken line, through many scholars and countries, until we eventually arrive in Germany and at the famous mathematician Carl Friedrich Gauß in 1799.

Gauß himself mentored a few students, one of whom was Christian Ludwig Gerling (1788-1864), who went on to mentor Julius Plücker (1801-1868) and so forth. Again the chain of mentoring continues until we reach Marcos Vaz Salles, a Brazilian Tenure-Track Assistant Professor at the University of Copenhagen, which is the city I was born in… And here comes the surprising part, for me at least, because Marcos is now mentoring me, together with Professor Martin Zachariasen!

An unbroken line of guys mentoring guys:

  1. Gregory Palamas
  2. Nilos Kabasilas, 1363
  3. Demetrios Kydones
  4. Manuel Chrysoloras
  5. Guarino da Verona, 1408
  6. Vittorino da Feltre, Università di Padova, 1416
  7. Ognibene (Omnibonus Leonicenus) Bonisoli da Lonigo, Università di Mantova
  8. Niccolò Leoniceno, Medicinae Dr., Università di Padova, 1453
  9. Antonio Musa Brasavola, Medicinae Dr., Università degli Studi di Ferrara, 1520
  10. Gabriele Falloppio, Medicinae Dr., Università di Padova / Università degli Studi di Ferrara, 1547
  11. Hieronymus (Girolamo Fabrici d’Acquapendente) Fabricius, Medicinae Dr., Università di Padova, 1559
  12. Adriaan van den Spieghel, Medicinae Dr., Université Catholique de Louvain / Università di Padova, 1603
  13. Adolph Vorstius, Philosophiae Dr., Medicinae Dr., Universiteit Leiden / Università di Padova, 1619, 1622
  14. Franciscus de le Boë Sylvius, Medicinae Dr., Universiteit Leiden / Universität Basel, 1634, 1637
  15. Rudolf Wilhelm Krause, Medicinae Dr., Universiteit Leiden, 1671
  16. Simon Paul Hilscher, Medicinae Dr., Friedrich-Schiller-Universität Jena, 1704
  17. Johann Andreas Segner, Magister artium, Medicinae Dr. Friedrich-Schiller-Universität Jena, 1726, 1734
  18. Johann Georg Büsch, Magister, Georg-August-Universität Göttingen, 1752
  19. Johann Elert Bode, Handelsakademie Hamburg
  20. Johann Friedrich Pfaff, Dr. phil. Georg-August-Universität Göttingen 1786
  21. Carl Friedrich Gauß, Ph.D., Universität Helmstedt, 1799
  22. Christian Ludwig Gerling, Dr. phil., Georg-August-Universität Göttingen, 1812
  23. Julius Plücker, Ph.D., Philipps-Universität Marburg, 1823
  24. C. Felix (Christian) Klein, Dr. phil., Rheinische Friedrich-Wilhelms-Universität Bonn, 1868
  25. Philipp Furtwängler, Ph.D., Georg-August-Universität Göttingen, 1896
  26. Nikolaus Hofreiter, Dr. phil., Universität Wien, 1927
  27. Edmund Hlawka, Dr. phil., Universität Wien, 1938
  28. Hermann Adolf Maurer, Ph.D., Technische Universität Wien, 1965
  29. Hans-Peter Kriegel, Dr. rer. nat., Universität Fridericiana zu Karlsruhe, 1976
  30. Bernhard Seeger, Dr.-Ing., Universität Bremen, 1989
  31. Jens-Peter Dittrich, Dr. rer. nat., Philipps-Universität Marburg, 2002
  32. Marcos Antonio Vaz Salles, Ph.D., Eidgenössische Technische Hochschule Zürich, 2008
  33. Me, getting mentored in 2013

Daniel Grosu, an Associate Professor at Wayne State University has managed to track the mentor lineage through Palamas and even further back to John Mauropous (990-1092), who was a scholar at the University of Constantinople. He was a Byzantine Greek poet, hymnographer and author of letters and orations, living in the 11th century AD. And that is where the tale ends. For now.

Things related to Docker

Docker is a cool idea and open-source product, that seems to be taking the tech community by storm. Wired will tell you why it is cool in a story titled The Man Who Built a Computer the Size of the Internet.

The short version goes: Docker is a way to deploy and move applications with dependencies between Linux servers, using a container concept. The idea is similar to how applications are installed on a Mac, i.e. “everything in a single package”.

There are a number of supporting and related technologies, which I will now list:

  • Google Borg/Apache Mesos are a related technologies, and perhaps Borg is the original rolemodel for Docker. Borg is apparently being replaced by a new system codenamed Omega (video). According to a Wired Story, it influenced Twitter to develop Mesos (originally developed by researchers at the University of California at Berkeley), now Apache Mesos, to do a similar thing as Borg. It might be fair to say that Docker is an easy version of Borg/Mesos/Omega, for non-geniuses (people generally hired by Google, Twitter etc).
  • CoreOS is a supporting technology, an OS designed for deploying containers such as Docker. As mentioned in Wired, the project is based on Google’s ChromeOS. According to the website of this operating system, CoreOS is “Linux kernel + systemd. That’s about it.”

This is it for now about Docker. Just heard about it a few hours ago in an email from a friend and supervisor.

Watched the RAMCloud video

Today I watched a video on RAMCloud. I have made an index over the various sections of the video, with direct links. You’ll find this index in the bottom of this post.

“The RAMCloud project is creating a new class of storage, based entirely in DRAM, that is 2-3 orders of magnitude faster than existing storage systems”

Notable features (in my oppinion) a fast DRAM backed key-value interface with durability, fast recovery and the potential for adding transactions.

Also, RAMCloud is open source.

How many requests per second can I get out of Redis?

Warning: This is not a very interesting post. I’m toying around with the Redis benchmarking tool. What would be significantly more interesting would be to toy around with the Lua API in Redis, which I’ll do in a subsequent post.

In this post, I’ll try to squeeze as many get/set requests out of Redis as I can. I’ll use the redis-benchmark tool to test just the set and get commands. This is not meant to be a benchmark, but a learning experience to see “what works”.

I’m testing the current stable version of Redis: 2.6.15.

Basic testing approach

First, compile Redis from source (it should “just work”) and place the binaries somewhere useful. Next, start Redis server (I use port 7777 for no specific reason):

redis-server --port 7777

To test (set and get):

redis-benchmark -p 7777 -t set,get -q

You should use the redis-benchmark tool to benchmark Redis, for exactly the reasons mentioned in the pitfalls and misconception section on the Redis benchmarking page. The primary reason is that the tool uses multiple connections, and easily enables commands to be pipelined.

This command above uses the -p switch to set the port, uses the -t to limit the commands we test, and finally the -q switch to limit the output to just the throughput.

Additional notes

Redis is a single-threaded server. Unfortunately it does not seems possible to use the benchmark tool to load-balance over several Redis instances, say running on different ports on the same machine. I guess nothing is keeping me from using consistent hashing (or another partitioning technique) with Redis, but the benchmarking tool does not seem to support any kind of partitioning.

Antirez has a blog post about using a Redis proxy called Twemproxy for doing partitioning with Redis. It can potentially increase the throughput. Unfortunately the Proxy uses the epoll system call in Linux, which does not exist on Mac OS X (where kqueue is used instead), so I can not try it.

All in all, I’ll be evaluating Redis in a purely single-node setup, using a TCP loopback connection to the Redis server running on my laptop.

A further thing that is noted on the benchmarking page is that:

Finally, when very efficient servers are benchmarked (and stores like Redis or memcached definitely fall in this category), it may be difficult to saturate the server. Sometimes, the performance bottleneck is on client side, and not server-side. In that case, the client (i.e. the benchmark program itself) must be fixed, or perhaps scaled out, in order to reach the maximum throughput.

Another reason that Redis may not be saturated by the benchmark is that Redis throughput may is limited by the network well before being limited by the CPU. As I’m running on a local machine, I’m assuming that this is not the case, but I’m not entirely sure that there are not other bottlenecks in the OS regarding communication between the benchmark process and the redis-server process. As noted on the benchmarking page: When client and server run on the same box, the CPU is the limiting factor with redis-benchmark.

Let’s keep all that in mind.

1: Running Redis server on my slightly old Macbook Pro

This is the 100% lazy installation. I compiled Redis from source on my laptop, using all defaults, and simply started it.

Hardware: 2.66 GHz Intel Core 2 Duo, 4 GB 1067 Mhz DDR3
OS: Mac OS X 10.6.8 (Snow Leopard)

The result is 37K and 38K requests per second for set and get respectively:

$ redis-benchmark -p 7777 -t set,get -q
SET: 37174.72 requests per second
GET: 37313.43 requests per second

The standard test uses just a single key. To increase the number of expected cache misses, I’ll run the same test using a million random keys (using the -r switch to set number of keys) to see if it makes a huge difference:

redis-benchmark -p 7777 -t set,get -r 1000000 -q

The difference is roughly 2.8% for set and 3% for get. Nothing dramatic. The performance overall is however not great for this initial setup running unmodified on my laptop.

2: Using pipelining

Now I’ll read the fucking manual. Maybe it helps. Redis has a page about benchmarking Redis. The first suggestion is to use pipelining. It is enabled by using the -P switch with an argument of the number of commands to bunch together in each request. I’ll try 16 as suggested on the page.

$ redis-benchmark -p 7777 -t set,get -P 16 -q
SET: 222222.22 requests per second
GET: 256410.25 requests per second

Actually the throughput varies a lot between different runs of this test, much more than the non-pipelined test. With that in mind, it seems that using a pipeline level of 100 is better than 16, about 30% higher throughput:

$ redis-benchmark -p 7777 -t set,get -P 100 -q
SET: 312500.00 requests per second
GET: 333333.34 requests per second

But using a pipeline level of 1000 is worse. Again there is a lot of variance, so I’d need to do a proper statistical analysis. Here I’m doing a rough estimation, and using pipelining of 100 dominates 1000, and that is all I care about.

Bottom line is that you can get 1 order of magnitude improvement to throughput by using pipelining, at least on my old Macbook Pro. Maybe it will be more or less on a “proper” server.

The question is, can we do better?

3: Using lua scripting

Redis supports Lua scripts that are evaluated server side. This can improve the throughput in the situation where a read is followed by a computation follow by say a write. Without scripting, even if using pipelining, there would be a roundtrip following the initial read in order to do the computation. The benefits of scripting are really application specific, and I’ll not explore that further.

4: Various potential optimizations

  • Use another memory allocation library. Default is libc. Unlikely to have any dramatic effect on the test
  • Other things to consider?

I have not tried any of these optimizations.

5: Givin’er all she’s got!

On the Redis page there are results posted for a high-end server, using TCP loopback (like I am) and without pipelining.

Here are the results for a 2 x Intel X5670 @ 2.93 GHz (without pipelining):

SET: 142653.36 requests per second
GET: 142450.14 requests per second

For Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining):

SET: 552028.75 requests per second
GET: 707463.75 requests per second

Note that these are not the same machines.

That is roughly 3.8x increase in throughput (compared to my laptop), in the non-pipelining case (run on the high-end server) and roughly 2x in the pipelining case (run on the not-quite-as-high-end server). Again, take the numbers with a big grain of salt. They essentially say nothing wildly interesting. The main conclusion is that pipelining and perhaps Lua scripting is a good idea. Also that partitioning may improve throughput, in which case you could try the Twemproxy code if you’re on Linux.

Conclusion and next steps

Using a single-node instance of Redis running on my laptop I managed to get 300K requests per second (both get and set). This is achieved only if using pipelining (100 commands at a time). On a high-end machine someone got 700K get requests per second using pipelining, i.e. a bit more than twice the throughput.

My goal is to squeeze 1 million get requests per second out of Redis, for a “realistic workload”. For this I’ll use a partitioning approach. The approach is to use Twemproxy running on a multi-core Linux machine with several Redis instances. The exact setup will take some experimenting to get right.

Out of the box, both pipelining and Lua scripting are good avenues for improving performance with Redis server. I saw 1 order of magnitude improvement to throughput when using pipelining. Both approaches are quite application specific, perhaps Lua scripting more so than pipelining. I did not experiment with lua scripting. That would also be very interesting to try.