## Cosine similarity in Python

Cosine similarity is the normalised dot product between two vectors. I guess it is called "cosine" similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

```import numpy as np from sklearn.metrics.pairwise import cosine_similarity   # vectors a = np.array([1,2,3]) b = np.array([1,1,4])   # manually compute cosine similarity dot = np.dot(a, b) norma = np.linalg.norm(a) normb = np.linalg.norm(b) cos = dot / (norma * normb)   # use library, operates on sets of vectors aa = a.reshape(1,3) ba = b.reshape(1,3) cos_lib = cosine_similarity(aa, ba)   print( dot, norma, normb, cos, cos_lib )```

The values might differ a slight bit on the smaller decimals. On my computer I get:

• 0.9449111825230682 (manual)
• 0.9449111825230683 (library)

## (Integer) Linear Programming in Python

Step one:

```brew install glpk pip install pulp```

Step two:

```from pulp import *   prob = LpProblem("test1", LpMinimize)   # Variables x = LpVariable("x", 0, 4, cat="Integer") y = LpVariable("y", -1, 1, cat="Integer") z = LpVariable("z", 0, cat="Integer")   # Objective prob += x + 4*y + 9*z   # Constraints prob += x+y <= 5 prob += x+z >= 10 prob += -y+z == 7   GLPK().solve(prob)   # Solution for v in prob.variables(): print v.name, "=", v.varValue   print "objective=", value(prob.objective)```

In the documentation there are further examples, e.g. one to minimise the cost of producing cat food.

## How to compute the pagerank of almost anything

Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in what direction the relationship points, e.g. who lost to who in chess. Intuitively, you may think of directional relationships as a transferal of some abstract value between two parties. For example, when one chess player loses to another in chess, then the value (i.e. relative skill level) of the winner will increase and the value of the loser decrease. Furthermore, the amount of value that is transfered depends on the starting value of each party. For example, if a master chess player loses to a novice chess player in chess, then the relative skill level of the novice will dramatically increase. Conversely, if a novice chess player loses to a master chess player in chess, then that is to be expected. In this situation, the relative skill level of each player should remain roughly the same as before - the status quo.

Below you'll see an illustration of a small graph with seven nodes and seven edges. The pagerank of each node is illustrated by shading it, where a darker color denotes a higher rank. If you study this figure, you should notice that:

• Nodes 1 through 4 all have low rank, because no other nodes point to them
• Node 5 has a medium rank, because a low-rank node points to it
• Node 6 has high rank, because many low-rank nodes point to it
• Node 7 has the highest rank, because a high-rank node points to it, while it points to nothing

## Compute pagerank with Python

The pageranks of the nodes in the example graph (see figure above) was computed in Python with the help of the networkx library, which can be installed with pip: `pip install networkx`. The code that creates a graph and computes pagerank is listed below:

```import networkx as nx   # Initialize directed graph G = nx.DiGraph()   # Add edges (implicitely adds nodes) G.add_edge(1,6) G.add_edge(2,6) G.add_edge(3,6) G.add_edge(4,6) G.add_edge(5,6) G.add_edge(4,5) G.add_edge(6,7)   # Compute pagerank (keys are node IDs, values are pageranks) pr = nx.pagerank(G) """ { 1: 0.06242340798778012, 2: 0.06242340798778012, 3: 0.06242340798778012, 4: 0.06242340798778012, 5: 0.08895357136701444, 6: 0.32374552689540625, 7: 0.33760726978645894 } """```

Notice that each nodes is represented by an integer ID, with no specific semantics tied to the nodes nor the edges. In other words, the graph could equally well represent relationships between web pages, scientists and chess players (or something else entirely).

If your relationships can be assigned weights, e.g. the strength of a victory in chess or the prominence of a link on a web page, then you can add weights to the edges in the graph. Luckily, weighted edges can be easily added in networkx:

`G.add_edge(1, 2, weight=0.5)`

## Dealing with time

You may ask yourself, should a chess game that took place last year impact a player's rank as much as a game that was won or lost just last week? In many situations, the most meaningful answer would be no. A good way to represent the passing of time in a relationship graph is to use edge weights that decrease over time by some function. For example, an exponential decay function can be used, such that relationships that were formed a long time ago have exponentially lower weight than recently formed relationships. This can be achieved in Python with the ** operator with a negative exponent:

```time_decayed_weight = max(.00001, time_passed) ** -1 G.add_edge(1, 2, weight=time_decayed_weight)```

We use the trick `max(.00001, time_passed)` to ensure that we do not raise zero to the power of a negative number. The unit of time passed depends on the domain, and is not essential to the computation. For example, the unit could be milliseconds, years or millennia.

To be continued...

## Running LP-solver in Postgres

Having reinstalled PostgreSQL with support for Python and pointing at my non-system python, it is time to test whether I can use the convex optimizer library I've installed in my Python 2.7 (pip install cvxopt).

Install PL/Python if not already installed

```-- if not already installed. Doesn't hurt. CREATE extension plpythonu;```

Create a function that imports cvxopt:

```CREATE OR REPLACE FUNCTION hello_cvxopt() RETURNS text AS \$\$ import cvxopt RETURN cvxopt.__doc__ \$\$ LANGUAGE plpythonu IMMUTABLE;```

See if it works:

```SELECT hello_cvxopt(); -- should return a documentation string```

Try the linear programming example:

```CREATE OR REPLACE FUNCTION cvxopt_lp_example() RETURNS FLOAT[] AS \$\$ FROM cvxopt import matrix, solvers A = matrix([ [-1.0, -1.0, 0.0, 1.0], [1.0, -1.0, -1.0, -2.0] ]) b = matrix([ 1.0, -2.0, 0.0, 4.0 ]) c = matrix([ 2.0, 1.0 ]) solvers.options['show_progress'] = FALSE sol=solvers.lp(c,A,b) RETURN list(sol['x']) \$\$ LANGUAGE plpythonu IMMUTABLE;   SELECT cvxopt_lp_example(); -- should return something like "{0.499999995215,1.49999999912}"```

Try

## Is there a need for a fast compression algorithm for geospatial data?

Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don't treat byte-sequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.

## Trying a Python R-tree implementation

Rtree is a ctypes Python wrapper of libspatialindex that provides a number of advanced spatial indexing features for the spatially curious Python user.

## A presentation on spatial indexing

A friend of mine, who is the CEO of a company that develops an embedded database, asked me to do a presentation on spatial indexing. This was an opportunity for me to brush up on R-trees and similar datastructures.

The presentation introduces R-trees and spatial indexing to a technical audience, who are however not spatial indexing experts.

## Having a look at vbuckets

A distribution algorithm is used to map keys to servers in a distributed key-value store. There are several different ones, implemented in different systems, and with different properties. In this blog post I'll briefly cover the best-known key hashing schemes, before I get to vbuckets.

## Background

When I'm at the library, I'd like to be able to go to the toilet, without collecting all my stuff from the table. Part of the solution is to have a camera installed that films all the tables, but assuming we can hire someone to look at the camera-feeds, that person might not notice that my laptop was stolen. Of course they could be notified, and the culprit identified from the tapes, but what if the culprit is "disguised? The only solution is to capture the thief before he/she leaves the building. For that to work, the security personel must be notified of the theft exactly when it happens!

## Formal problem definition

Problem:

• Person A, me, leaves an artifact (computer) at a table in a public space and goes somewhere (restroom).
• Person B drops by table and steals computer
• By the time A is back, B has left the building
• Because B was wearing sunglasses and a blue beard, B can not be identified from the surveillance tape