
Cosine similarity in Python
Cosine similarity is the normalised dot product between two vectors. I guess it is called “cosine” similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia. Here is how […]

(Integer) Linear Programming in Python
Step one: brew install glpk pip install pulp Step two: from pulp import * prob = LpProblem(“test1”, LpMinimize) # Variables x = LpVariable(“x”, 0, 4, cat=”Integer”) y = LpVariable(“y”, 1, 1, cat=”Integer”) z = LpVariable(“z”, 0, cat=”Integer”) # Objective prob += x + 4*y + 9*z # Constraints prob += x+y = 10 prob += […]

How to compute the pagerank of almost anything
Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in […]

Running LPsolver in Postgres
Having reinstalled PostgreSQL with support for Python and pointing at my nonsystem python, it is time to test whether I can use the convex optimizer library I’ve installed in my Python 2.7 (pip install cvxopt). Install PL/Python if not already installed — if not already installed. Doesn’t hurt. create extension plpythonu; Create a function that […]

Clustering in Python
In a project I’m going to use clustering algorithms implemented in Python, such as kmeans. Clustering http://stackoverflow.com/questions/1545606/pythonkmeansalgorithm scipy.cluster has been reported to have some problems, so for now I’ll use PyCluster (following advice given on stackoverflow). Install PyCluster: pip install PyCluster

Is there a need for a fast compression algorithm for geospatial data?
Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don’t treat bytesequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.

Trying a Python Rtree implementation
Rtree is a ctypes Python wrapper of libspatialindex that provides a number of advanced spatial indexing features for the spatially curious Python user.

A presentation on spatial indexing
A friend of mine, who is the CEO of a company that develops an embedded database, asked me to do a presentation on spatial indexing. This was an opportunity for me to brush up on Rtrees and similar datastructures. Download the slides The presentation introduces Rtrees and spatial indexing to a technical audience, who are […]

Having a look at vbuckets
A distribution algorithm is used to map keys to servers in a distributed keyvalue store. There are several different ones, implemented in different systems, and with different properties. In this blog post I’ll briefly cover the bestknown key hashing schemes, before I get to vbuckets.

Idea: Automatic theft prevention in public spaces
Background When I’m at the library, I’d like to be able to go to the toilet, without collecting all my stuff from the table. Part of the solution is to have a camera installed that films all the tables, but assuming we can hire someone to look at the camerafeeds, that person might not notice […]