How to compute the pagerank of almost anything

Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in what direction the relationship points, e.g. who lost to who in chess.

Intuitively, you may think of directional relationships as a transferal of some abstract value between two parties. For example, when one chess player loses to another in chess, then the value (i.e. relative skill level) of the winner will increase and the value of the loser decrease. Furthermore, the amount of value that is transfered depends on the starting value of each party. For example, if a master chess player loses to a novice chess player in chess, then the relative skill level of the novice will dramatically increase. Conversely, if a novice chess player loses to a master chess player in chess, then that is to be expected. In this situation, the relative skill level of each player should remain roughly the same as before – the status quo.

Below you’ll see an illustration of a small graph with seven nodes and seven edges. The pagerank of each node is illustrated by shading it, where a darker color denotes a higher rank.

If you study this figure, you should notice that:

  • Nodes 1 through 4 all have low rank, because no other nodes point to them
  • Node 5 has a medium rank, because a low-rank node points to it
  • Node 6 has high rank, because many low-rank nodes point to it
  • Node 7 has the highest rank, because a high-rank node points to it, while it points to nothing

Compute pagerank with Python

The pageranks of the nodes in the example graph (see figure above) was computed in Python with the help of the networkx library, which can be installed with pip: pip install networkx. The code that creates a graph and computes pagerank is listed below:

import networkx as nx
 
# Initialize directed graph
G = nx.DiGraph()
 
# Add edges (implicitely adds nodes)
G.add_edge(1,6)
G.add_edge(2,6)
G.add_edge(3,6)
G.add_edge(4,6)
G.add_edge(5,6)
G.add_edge(4,5)
G.add_edge(6,7)
 
# Compute pagerank (keys are node IDs, values are pageranks)
pr = nx.pagerank(G)
"""
{
  1: 0.06242340798778012, 
  2: 0.06242340798778012, 
  3: 0.06242340798778012, 
  4: 0.06242340798778012, 
  5: 0.08895357136701444, 
  6: 0.32374552689540625, 
  7: 0.33760726978645894
}
"""

Notice that each nodes is represented by an integer ID, with no specific semantics tied to the nodes nor the edges. In other words, the graph could equally well represent relationships between web pages, scientists and chess players (or something else entirely).

If your relationships can be assigned weights, e.g. the strength of a victory in chess or the prominence of a link on a web page, then you can add weights to the edges in the graph. Luckily, weighted edges can be easily added in networkx:

G.add_edge(1, 2, weight=0.5)

Dealing with time

You may ask yourself, should a chess game that took place last year impact a player’s rank as much as a game that was won or lost just last week? In many situations, the most meaningful answer would be no. A good way to represent the passing of time in a relationship graph is to use edge weights that decrease over time by some function. For example, an exponential decay function can be used, such that relationships that were formed a long time ago have exponentially lower weight than recently formed relationships. This can be achieved in Python with the ** operator with a negative exponent:

time_decayed_weight = max(.00001, time_passed) ** -1
G.add_edge(1, 2, weight=time_decayed_weight)

We use the trick max(.00001, time_passed) to ensure that we do not raise zero to the power of a negative number. The unit of time passed depends on the domain, and is not essential to the computation. For example, the unit could be milliseconds, years or millennia.

To be continued…

Running LP-solver in Postgres

Having reinstalled PostgreSQL with support for Python and pointing at my non-system python, it is time to test whether I can use the convex optimizer library I’ve installed in my Python 2.7 (pip install cvxopt).

Install PL/Python if not already installed

-- if not already installed. Doesn't hurt.
CREATE extension plpythonu;

Create a function that imports cvxopt:

CREATE OR REPLACE FUNCTION hello_cvxopt()
  RETURNS text
AS $$
  import cvxopt
  RETURN cvxopt.__doc__
$$ LANGUAGE plpythonu IMMUTABLE;

See if it works:

SELECT hello_cvxopt();
-- should return a documentation string

Try the linear programming example:

CREATE OR REPLACE FUNCTION cvxopt_lp_example()
  RETURNS FLOAT[]
AS $$
  FROM cvxopt import matrix, solvers
  A = matrix([ [-1.0, -1.0, 0.0, 1.0], [1.0, -1.0, -1.0, -2.0] ])
  b = matrix([ 1.0, -2.0, 0.0, 4.0 ])
  c = matrix([ 2.0, 1.0 ])
  solvers.options['show_progress'] = FALSE
  sol=solvers.lp(c,A,b)
  RETURN list(sol['x'])
$$ LANGUAGE plpythonu IMMUTABLE;
 
SELECT cvxopt_lp_example();
-- should return something like "{0.499999995215,1.49999999912}"

Try

Clustering in Python

In a project I’m going to use clustering algorithms implemented in Python, such as k-means.

Clustering

http://stackoverflow.com/questions/1545606/python-k-means-algorithm

scipy.cluster has been reported to have some problems, so for now I’ll use PyCluster (following advice given on stackoverflow).

Install PyCluster:

pip install PyCluster

Continue reading “Clustering in Python”

Is there a need for a fast compression algorithm for geospatial data?

Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don’t treat byte-sequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.

Continue reading “Is there a need for a fast compression algorithm for geospatial data?”

A presentation on spatial indexing

A friend of mine, who is the CEO of a company that develops an embedded database, asked me to do a presentation on spatial indexing. This was an opportunity for me to brush up on R-trees and similar datastructures.

Download the slides

The presentation introduces R-trees and spatial indexing to a technical audience, who are however not spatial indexing experts.

Idea: Automatic theft prevention in public spaces

Background

When I’m at the library, I’d like to be able to go to the toilet, without collecting all my stuff from the table. Part of the solution is to have a camera installed that films all the tables, but assuming we can hire someone to look at the camera-feeds, that person might not notice that my laptop was stolen. Of course they could be notified, and the culprit identified from the tapes, but what if the culprit is “disguised? The only solution is to capture the thief before he/she leaves the building. For that to work, the security personel must be notified of the theft exactly when it happens!

Formal problem definition

Problem:

  • Person A, me, leaves an artifact (computer) at a table in a public space and goes somewhere (restroom).
  • Person B drops by table and steals computer
  • By the time A is back, B has left the building
  • Because B was wearing sunglasses and a blue beard, B can not be identified from the surveillance tape
  • Person A is sad

Solution requirement: Person B should be apprehended before he/she leaves the building, namely before person A is back and notices the theft. This means that an algorithm must detect the theft as it happens!

Solution approach:

  • Camera feed is routed to bank of algorithms
  • Algorithm X detects people and their location, and assigns unique IDs to different people
  • Algorithm Y detects artifacts, and associates each artifact with the ID of its owner
  • Algorithm Z detects the situations: 1) An owner has lefts his/her artifact 2) A person which is not the owner is very near an artifact. If both 1 and 2 hold for a given artifact, an event is fired

The events from algorithm Z are handed to the security staff, who can investigate visually whether a theft is taking place.

How to read computer science papers

Situation: You have a large pile of computer science papers in front of you. You want to read them all. What to do?

My suggestion is that you read the two guides below. They are really short and helpful. I’m one year into my CS PhD, and I still find reading a large pile of papers to be quite hard. Especially if the papers are exploring problems within a field that I’m not super familiar with.

Continue reading “How to read computer science papers”

Finding a route from one wikipedia page to another

Here’s a game I like to play. Select two wikipedia pages at random, and find a route from one to the other. I stated a theorem once that:

you can get from any page on wikipedia to the page on pollination in 7 steps or less. (it was actually another page, but let’s say it was pollination)

I devised a method for doing this using Google search. Let’s call the random page s, and the page you want to reach t, e.g. pollination. A given page on wikipedia has a set of incoming links (other pages linked to the page), and a set of outgoing links (other pages linked to by the page). Let’s call these two sets in[p] and out[p]. These two sets contain direct decendants and ancestors of p respectively.

Continue reading “Finding a route from one wikipedia page to another”