Installing pip and virtualenv on Mac

These instructions show how to install pip and virtualevn on a Mac running Snow Leopard 10.6.8 and using Python 2.7. I used this to install Django 1.3.1 (installation instructions included).

Installing pip

(skip if you have pip installed)

First make sure you have either setuptools or distribute installed. Please consult your operating system’s package manager or install it manually:

curl | python

Read more

Finding a route from one wikipedia page to another

Here’s a game I like to play. Select two wikipedia pages at random, and find a route from one to the other. I stated a theorem once that:

you can get from any page on wikipedia to the page on pollination in 7 steps or less. (it was actually another page, but let’s say it was pollination)

I devised a method for doing this using Google search. Let’s call the random page s, and the page you want to reach t, e.g. pollination. A given page on wikipedia has a set of incoming links (other pages linked to the page), and a set of outgoing links (other pages linked to by the page). Let’s call these two sets in[p] and out[p]. These two sets contain direct decendants and ancestors of p respectively.

Read more

Using CORS instead of JSONP to make cross site requests

Introduction to CORS

CORS (Cross Origin Resource Sharing) is a mechanism specifies by W3C (draft), for allowing browsers to make cross origin requests for resources on other domains under certain conditions. It’s related to JSONP because it solves a similar problem, namely loading data from one domain, into a web application running on a different domain. A difference is that CORS supports the full palette of HTTP verbs, not just GET.

See also:

Read more

How to put each word in a file on a separate line

Place each word on a separate line with sed and awk:

sed -e 's/[^[:alpha:]]/ /g' | awk '{ for (i=1;i<=NF;i++) print $i }'

sed is used to replace non alpha characters with spaces (optional)

awk places each word on a separate line.

Taking it a step further you can keep only unique words with the good ol’ lowercase, sort -u trick or sort|uniq if you prefer that:

awk '{ for (i=1;i<=NF;i++) print $i }' | tr "[:upper:]" "[:lower:]" | sort -u

How to split a log file into smaller files

In this example I had a big log file (many million lines), that I wanted to split into smaller logfiles (each one million lines) for processing on Elastic MapReduce.

-rw-r--r--  1 kostas staff 543067012012 Oct 11 13:45 huge_logfile

This is a job for the split command. Because individual lines in the log file must be kept intact, the -l option is used to specify the number of lines in each file. In this example, certain lines are first filtered out with grep, to show how split is used when data is piped in:

grep 'some-pattern' huge_logfile | split -a 6 -l 1000000 - log_

The dash in the split command is used to accept input from standard input, while the log_ is used as a prefix for generated filenames. The -a 6 option tells split to use a 6 character extension after the prefix when naming files. The output looks like this: