Finding a route from one wikipedia page to another

Here’s a game I like to play. Select two wikipedia pages at random, and find a route from one to the other. I stated a theorem once that:

you can get from any page on wikipedia to the page on pollination in 7 steps or less. (it was actually another page, but let’s say it was pollination)

I devised a method for doing this using Google search. Let’s call the random page s, and the page you want to reach t, e.g. pollination. A given page on wikipedia has a set of incoming links (other pages linked to the page), and a set of outgoing links (other pages linked to by the page). Let’s call these two sets in[p] and out[p]. These two sets contain direct decendants and ancestors of p respectively.

Read more

Using CORS instead of JSONP to make cross site requests

Introduction to CORS

CORS (Cross Origin Resource Sharing) is a mechanism specifies by W3C (draft), for allowing browsers to make cross origin requests for resources on other domains under certain conditions. It’s related to JSONP because it solves a similar problem, namely loading data from one domain, into a web application running on a different domain. A difference is that CORS supports the full palette of HTTP verbs, not just GET.

See also: http://en.wikipedia.org/wiki/Cross-Origin_Resource_Sharing

Read more

How to put each word in a file on a separate line

Place each word on a separate line with sed and awk:

sed -e 's/[^[:alpha:]]/ /g' | awk '{ for (i=1;i<=NF;i++) print $i }'

sed is used to replace non alpha characters with spaces (optional)

awk places each word on a separate line.

Taking it a step further you can keep only unique words with the good ol’ lowercase, sort -u trick or sort|uniq if you prefer that:

awk '{ for (i=1;i<=NF;i++) print $i }' | tr "[:upper:]" "[:lower:]" | sort -u

How to split a log file into smaller files

In this example I had a big log file (many million lines), that I wanted to split into smaller logfiles (each one million lines) for processing on Elastic MapReduce.

-rw-r--r--  1 kostas staff 543067012012 Oct 11 13:45 huge_logfile

This is a job for the split command. Because individual lines in the log file must be kept intact, the -l option is used to specify the number of lines in each file. In this example, certain lines are first filtered out with grep, to show how split is used when data is piped in:

grep 'some-pattern' huge_logfile | split -a 6 -l 1000000 - log_

The dash in the split command is used to accept input from standard input, while the log_ is used as a prefix for generated filenames. The -a 6 option tells split to use a 6 character extension after the prefix when naming files. The output looks like this:

huge_logfile
log_aaaaaa
log_aaaaab
log_aaaaac
log_aaaaad
log_aaaaae
...

OpenStreetMap tiles with custom projection and grid using Mapnik

Previous effort

In a previous post I tried generating OpenStreetMap tiles using GeoServer. It ended when I couldn’t find a good style (SLD) to apply the OSM layer.

The (new) idea, that also failed miserably

In this post I tried to use Mapnik to generating OSM map tiles in EPSG:25832. It failed mainly because the Python scripts published by OSM for generating tiles don’t support epsg:25832 out of the box. Mapnik is however an obvious choice for OSM because:

  • Mapnik can read .osm files
  • There is a comprehensive style for Mapnik, that is being maintained by OSM

Read more