Cosine similarity is the normalised dot product between two vectors. I guess it is called "cosine" similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

import numpy as np
from sklearn.metrics.pairwiseimport cosine_similarity
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)print(
dot,
norma,
normb,
cos,
cos_lib[0][0])

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)
print(
dot,
norma,
normb,
cos,
cos_lib[0][0]
)

The values might differ a slight bit on the smaller decimals. On my computer I get:

Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in what direction the relationship points, e.g. who lost to who in chess.

Intuitively, you may think of directional relationships as a transferal of some abstract value between two parties. For example, when one chess player loses to another in chess, then the value (i.e. relative skill level) of the winner will increase and the value of the loser decrease. Furthermore, the amount of value that is transfered depends on the starting value of each party. For example, if a master chess player loses to a novice chess player in chess, then the relative skill level of the novice will dramatically increase. Conversely, if a novice chess player loses to a master chess player in chess, then that is to be expected. In this situation, the relative skill level of each player should remain roughly the same as before - the status quo.

Below you'll see an illustration of a small graph with seven nodes and seven edges. The pagerank of each node is illustrated by shading it, where a darker color denotes a higher rank.

If you study this figure, you should notice that:

Nodes 1 through 4 all have low rank, because no other nodes point to them

Node 5 has a medium rank, because a low-rank node points to it

Node 6 has high rank, because many low-rank nodes point to it

Node 7 has the highest rank, because a high-rank node points to it, while it points to nothing

Compute pagerank with Python

The pageranks of the nodes in the example graph (see figure above) was computed in Python with the help of the networkx library, which can be installed with pip: pip install networkx. The code that creates a graph and computes pagerank is listed below:

Notice that each nodes is represented by an integer ID, with no specific semantics tied to the nodes nor the edges. In other words, the graph could equally well represent relationships between web pages, scientists and chess players (or something else entirely).

If your relationships can be assigned weights, e.g. the strength of a victory in chess or the prominence of a link on a web page, then you can add weights to the edges in the graph. Luckily, weighted edges can be easily added in networkx:

G.add_edge(1,2, weight=0.5)

G.add_edge(1, 2, weight=0.5)

Dealing with time

You may ask yourself, should a chess game that took place last year impact a player's rank as much as a game that was won or lost just last week? In many situations, the most meaningful answer would be no. A good way to represent the passing of time in a relationship graph is to use edge weights that decrease over time by some function. For example, an exponential decay function can be used, such that relationships that were formed a long time ago have exponentially lower weight than recently formed relationships. This can be achieved in Python with the ** operator with a negative exponent:

We use the trick max(.00001, time_passed) to ensure that we do not raise zero to the power of a negative number. The unit of time passed depends on the domain, and is not essential to the computation. For example, the unit could be milliseconds, years or millennia.

Having reinstalled PostgreSQL with support for Python and pointing at my non-system python, it is time to test whether I can use the convex optimizer library I've installed in my Python 2.7 (pip install cvxopt).

Install PL/Python if not already installed

-- if not already installed. Doesn't hurt.CREATE extension plpythonu;

-- if not already installed. Doesn't hurt.
create extension plpythonu;

Create a function that imports cvxopt:

CREATEORREPLACEFUNCTION hello_cvxopt()RETURNS text
AS $$
import cvxopt
RETURN cvxopt.__doc__
$$ LANGUAGE plpythonu IMMUTABLE;

CREATE OR REPLACE FUNCTION hello_cvxopt()
RETURNS text
AS $$
import cvxopt
return cvxopt.__doc__
$$ LANGUAGE plpythonu IMMUTABLE;

See if it works:

SELECT hello_cvxopt();
-- should return a documentation string

select hello_cvxopt();
-- should return a documentation string

CREATEORREPLACEFUNCTION cvxopt_lp_example()RETURNSFLOAT[]AS $$
FROM cvxopt import matrix, solvers
A = matrix([[-1.0,-1.0,0.0,1.0],[1.0,-1.0,-1.0,-2.0]])
b = matrix([1.0,-2.0,0.0,4.0])
c = matrix([2.0,1.0])
solvers.options['show_progress']=FALSE
sol=solvers.lp(c,A,b)RETURN list(sol['x'])
$$ LANGUAGE plpythonu IMMUTABLE;
SELECT cvxopt_lp_example();
-- should return something like "{0.499999995215,1.49999999912}"

CREATE OR REPLACE FUNCTION cvxopt_lp_example()
RETURNS float[]
AS $$
from cvxopt import matrix, solvers
A = matrix([ [-1.0, -1.0, 0.0, 1.0], [1.0, -1.0, -1.0, -2.0] ])
b = matrix([ 1.0, -2.0, 0.0, 4.0 ])
c = matrix([ 2.0, 1.0 ])
solvers.options['show_progress'] = False
sol=solvers.lp(c,A,b)
return list(sol['x'])
$$ LANGUAGE plpythonu IMMUTABLE;
select cvxopt_lp_example();
-- should return something like "{0.499999995215,1.49999999912}"

Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don't treat byte-sequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.

Rtree is a ctypes Python wrapper of libspatialindex that provides a number of advanced spatial indexing features for the spatially curious Python user.

A friend of mine, who is the CEO of a company that develops an embedded database, asked me to do a presentation on spatial indexing. This was an opportunity for me to brush up on R-trees and similar datastructures.

A distribution algorithm is used to map keys to servers in a distributed key-value store. There are several different ones, implemented in different systems, and with different properties. In this blog post I'll briefly cover the best-known key hashing schemes, before I get to vbuckets.

When I'm at the library, I'd like to be able to go to the toilet, without collecting all my stuff from the table. Part of the solution is to have a camera installed that films all the tables, but assuming we can hire someone to look at the camera-feeds, that person might not notice that my laptop was stolen. Of course they could be notified, and the culprit identified from the tapes, but what if the culprit is "disguised? The only solution is to capture the thief before he/she leaves the building. For that to work, the security personel must be notified of the theft exactly when it happens!

Formal problem definition

Problem:

Person A, me, leaves an artifact (computer) at a table in a public space and goes somewhere (restroom).

Person B drops by table and steals computer

By the time A is back, B has left the building

Because B was wearing sunglasses and a blue beard, B can not be identified from the surveillance tape

Person A is sad

Solution requirement: Person B should be apprehended before he/she leaves the building, namely before person A is back and notices the theft. This means that an algorithm must detect the theft as it happens!

Solution approach:

Camera feed is routed to bank of algorithms

Algorithm X detects people and their location, and assigns unique IDs to different people

Algorithm Y detects artifacts, and associates each artifact with the ID of its owner

Algorithm Z detects the situations: 1) An owner has lefts his/her artifact 2) A person which is not the owner is very near an artifact. If both 1 and 2 hold for a given artifact, an event is fired

The events from algorithm Z are handed to the security staff, who can investigate visually whether a theft is taking place.