Finding a route from one wikipedia page to another

Here’s a game I like to play. Select two wikipedia pages at random, and find a route from one to the other. I stated a theorem once that:

you can get from any page on wikipedia to the page on pollination in 7 steps or less. (it was actually another page, but let’s say it was pollination)

I devised a method for doing this using Google search. Let’s call the random page s, and the page you want to reach t, e.g. pollination. A given page on wikipedia has a set of incoming links (other pages linked to the page), and a set of outgoing links (other pages linked to by the page). Let’s call these two sets in[p] and out[p]. These two sets contain direct decendants and ancestors of p respectively.

Computing in[p] using google search

This is how to compute in[http://en.wikipedia.org/wiki/Pollination]. Type the following into the standard Google search page:

link:http://en.wikipedia.org/wiki/Pollination site:en.wikipedia.org (link)

Computing out[p]

This is trivial. To compute out[http://en.wikipedia.org/wiki/Ketchup] simply visit the page on Ketchup and look at the links.

Algorithm

The idea is to use BFS (breadth first search) simultaneously from s and t. Searching from s involves recursively computing out[p] to discover the set of decendants of s. Searching from t involves recursively computing in[p] to discover the set of ancestors of t. If there is a route from s to t, at some point the intersection between the decendants of s and the ancestors of t will be non-empty. This situation is illustrated by the green node:

Now, Google is a bit stingy with the requests, and do not allow requests made e.g. by curl. So for now the algorithm is run by hand.

Optimization using guided depth-first-search

One optimization when computing in[p], is to include a word that is very general or popular, e.g. the word “sex”. The idea is that the likelyhood that a given page p is both ancestor of t and a decendant of s, increases with the popularity of p. In this case the algorithm uses depth first to first explore paths going through popular pages.

As an example, the page on sex is a member of in[http://en.wikipedia.org/wiki/Pollination]. One of the reason that I chose Pollination to begin with…

2 Comments

Agustin

January 5, 2012 / 4:12 pm Log in to Reply

Incredible idea:
Searching from s involves recursively computing out[p] to discover the set of decendants of s. Searching from t involves recursively computing in[p] to discover the set of ancestors of t.

for my work Im doing this and I just did from p, not from both, telling my boss this idea and asking him what he thinks about it.

regards
- kostas
  
  January 5, 2012 / 7:45 pm Log in to Reply
  
  The algorithm was just made for fun, so that it is possible to find the path using only a browser. For a work application I’d consider downloading the Wikipedia database, and doing normal shortest path on a graph representation of the data. That would be much faster than my play idea. But thanks for the kind words.

Computing in[p] using google search

Computing out[p]

Algorithm

Optimization using guided depth-first-search

2 Comments

Leave a ReplyCancel Reply