Finding the most quoted main author using linux command line

I have a text file containing article references. It looks like this

- Miller HJ (2004) Tobler’s First Law and spatial analysis. Ann Assoc Am Geogr 94:284–289.
- Onsrud H, ed (2007) Research and Theory in Advanced Spatial Data Infrastructure Concepts (ESRI Press, Redlands, CA).
- Egenhofer M (2002) Toward the geospatial semantic web. Advances in Geographic Information Systems International Symposium, eds Makki Y, Pissinou N (Association for Computing Machinery, McLean, VA), pp 1–4.
- Anselin L, Florax R, Rey S, eds (2004) Advances in Spatial Econometrics: Methodology, Tools and Applications (Springer, Berlin). 
- Wang S, Armstrong M (2009) A theoretical approach to the use of cyberinfrastructure in geographical analysis. Int J Geogr Inf Sci 23:169–193. 
- Wang S (2010) A cyberGIS framework for the synthesis of cyberinfrastructure, GIS, and spatial analysis. Ann Assoc Am Geogr 100:535–557.
- Penninga F, Van Oosterom PJM (2008) A simplicial complex-based DBMS approach to 3D topographic data modelling. Int J Geogr Inf Sci 22:751–779. 
- Baker KS, Chandler CL (2008) Enabling long-term oceanographic research: Changing data practices, in- formation management strategies and informatics. Deep-Sea Res II 55(18–19):2132–2142.

I wanted to find out what the most common first author is in that long list of articles, and this is what I did:

cat refs-2009+.txt | \
sed -e '/^ *$/d' -e 's/^- //' | \
cut -d"(" -f1 | \
cut -d, -f1 | \
cut -d' ' -f1 | \
sort | \
uniq -c | \
sort -r > \

The result is this:

   6 Craglia
   4 Wang
   4 Rajabifard
   4 Onsrud
   4 Masser
   4 Grus
   4 Crompvoets
   3 Yang
   3 Steiniger
   3 Gartner
   3 European
   3 Anselin
   2 Wright
   2 Smits
   2 Sieber
   2 Ramsey
   2 Poore
   2 Miller
   2 Lance
   2 Helly
   2 Georgiadou
   2 Fox
   2 Foster
   2 Bregt
   1 Zhang
   1 World

Sticking bicycle paths in CouchDB

In this installment of How To Stick Anything In CouchDB, I’ll take some bicycle path data released by the municipality of Copenhagen and stick them in CouchDB. As always I’m working in Terminal under Mac OS X.

Bicycle paths in Copenhagen, served live by CouchDB:

The data is in shape format, so it’s almost too easy using shp2geocouch. First download the data.

mv B12983F7B5E3451F8716F7C072A5B101.ashx

What have we downloaded?

$ ls *.shp
Cykelmidter_2006_r2_ETRS89zone32N.shp	Cykelmidter_2006_r2_WGS84.shp

Create a CouchDB database to hold the data. I’m using my CouchDB installation. You should use yours:

curl -X PUT

Use shp2geocouch to upload the data. It shouldn’t matter which one of the shape files we use, because shp2geocouch does reprojection:

shp2geocouch Cykelmidter_2006_r2_WGS84.shp


There is a problem with encoding. If you click on one of the street features on the map above, streets that contain the danish letters [æ,ø,å] are missing those letters. I tried testing a conversion to GeoJSON from the shapefiles with ogr2ogr and it’s the same. God I hate encoding!

By the way, this is how to convert to GeoJSON from Shape using ogr2ogr:

ogr2ogr -f "GeoJSON" cykelmidter.json Cykelmidter_2006_r2_WGS84.shp

If you don’t have shp2geocouch, it’s really easy to install (at least it was on my Mac):

sudo gem install shp2geocouch

There is a “Ruby gem” install home page here, and github here.

How to create JSON data from a text file on the internet

The following assumes a linux command line to be present (or Mac OS X terminal in my case).

I want to wrangle text from the internet, turn it into JSON data, and ultimately stick it in CouchDB. Here I’m trying to turn a random text file containing prime numbers into structured JSON data that looks like this:

[2, 3, 5, 7,...]

The original file is here: It is fairly structured to begin with, but it’s not JSON.

                         The First 1,000 Primes
                          (the 1,000th is 7919)
         For more information on primes see

      2      3      5      7     11     13     17     19     23     29 
     31     37     41     43     47     53     59     61     67     71 
     73     79     83     89     97    101    103    107    109    113 

The following line does turns it into JSON:

curl | \
tail +4 | \
tr -cs "[:digit:]" "," | \
sed -e 's/^,/\[/' -e 's/,$/\]/' \
> primes.json

Let’s look at it with cat to make sure:

$ cat primes.json

Explanation of the command

curl is used to download the file and print it on standard output in the terminal. With no arguments it issues a HTTP GET for

tail +4 discards the first four lines.

tr -cs "[:digit:]" "," converts the text into digits followed by commas. The new text has a comma before the first digit, and a comma after the last one. No linebreaks or spaces: ,2,3,5,7...,7919,

sed -e 's/^,/\[/' -e 's/,$/\]/' is perhaps a bit hard to read. It replaces the comma before the first digit with '[', and replaces the comma after the last digit with ']'.

Who gains from blocked content on YouTube?

When I want to hear a particular rap song from 1992 on YouTube, the video service shows me this:

Yeah yeah, copyright or whatever, but what is the point? Who gains what exactly? By the way, if you can hear the song in the country you’re in, then fuck you 🙂

Who’s involved, let’s see. Me (the user/customer), Google (the owner of YouTube), the EU (makers of regional copyright laws), Sony (the copyright holder), and CMW (the artist).

Does Google the owner of YouTube win?

No. Google looses straight away, because I can hear the song on GrooveShark just fine (albeit without the video):

Does the EU win?

No. The EU might gain a little bit, because CMW is an american band, so chances are that I’ll listen to a EU artist like Dizzie Rascal instead:

But that’s not going to happen, because I wanted to listen to CMW, and I’ve already found the song on another service, GrooveShark.

Does Sony Music Entertainment win?

No. I already bought the song on iTunes a couple of days ago. If I hadn’t bought it, I would have downloaded it with a torrent. The YouTube video being there or not, did not factor in to my decision to buy the song. I bought the song because it was insanely easy to do on my iPhone. Period. In fact, I might choose to not buy a song in the future if it’s owned by SME.

Does the artist win?

Hardly, in fact they loose. I’m sure they appreciate that I bought the song, though I’m sure Sony Music Entertainment appreciates it a hell of a lot more if I know anything about royalty splits! And the song being blocked on YouTube did not make me buy the song, as I’ve already said. I was about to make CMW more famous, by linking their video on my blog, but couldn’t. Sorry CMW.

Do concerned mothers win?

Does the fictional organization of “concerned mothers against gangster rap” gain anything by a blocked gangster rap tune on the internet? Sure, but that is mere coincidence, it could just as well have been a song about flowers or teddy bears or a praise for “concerned mothers against gangster rap”.

By the way, you may check out the song CMW sampled on “N 2 Deep”. It’s by Lyn Collins, and features the distinct sound of the JB’s. Apparently the copyright holder (Polydor) is not insane:


I find that this blogpost and video on innovation from Edinburg by Ed Parsons is somehow related to this issue.

By the way Ed. If you watch the ping back. Sorry that I stole your look for WordPress. I kinda liked it, and I do listen to gangster rap occasionally so my morals are questionable.

Using shp2geocouch to push OSM data into geocouch

Today I installed the utility shp2geocouch on Mac OS X 1.6.

First I needed to update RubyGems…

sudo gem update --system

Then I could install shp2geocouch

sudo gem install shp2geocouch

Next I downloaded OSM data for Copenhagen, Denmark

cd copenhagen.shapefiles

Finally I used shp2geocouch to upload one of the shape files to (database

shp2geocouch europe_northern_europe_denmark_copenhagen_highway.shp

This takes a while and the job is still running on my MacBook Pro (after ~10 minutes 16000 documents have been loaded into The final count was 33306 documents.

As a final touch, the script replicates geocouch-utils + map browser and tells me

view your data on a map at

The map uses OSM tiles from cloudmade as background, and fetches clickable road data from iriscouch using XHR:

Clicking the link, gives you this:

Installing spatial databases on EC2

The spatial databases covered are PostGIS, MySQL spatial and MongoDB, Apache Cassandra.

UPDATE: I’ll change this post or create a page to give the actual linux commands to run on the remote server.

PostGIS on EC2

I have found a nice tutorial that describes setting up Postgres on EC2 on an Ubuntu instance with all the trimmings. The blogger (Ron Evans) explains how he does things, including choice of filesystem on EBS, setting up security groups and general architectural decisions. It is quite detailed so you might even learn some linux admin tips from reading it.

I’m using the Amazon Linux AMI for now, and most of what is described should apply for that image as well. I noticed that he installs Postgres with the package manager (apt-get), and Amazon Linux AMI’s come with yum.

There is a different tutorial that describes using yum instead of apt-get to install postgres. As a sidenote that writer also seems to prefer the EXT3 filesystem over the XFS filesystem.

There is also a tutorial for installing Postgres 9.0 with yum that includes installation of PostGIS, which is probably the one I’ll end up following. There is a separate description for Postgres 8.4.

I recommend following this tutorial up to the point of installing Postgres, and then switching to this tutorial.

MySQL with Spatial Extension on EC2

The procedure for installing MySQL on EC2 is described on the MySQL website. The examples given include one using yum, so that is as easy as it gets.

It should be noted that there are community images on EC2 which come preinstalled with MySQL.

ec2-describe-images -a | grep -i mysql

The MySQL website also has a very good section for setting up replication for MySQL on EC2 and related subjects.

One aspect that is mentioned is about scalability, and that it is “easier to create more EC2 instances to support more users than to upgrade the instance to a larger machine”. Good point I think, and there are more, so I recommend reading that page and many of the hints also apply directly to running Postgres and MongoDB on EC2.

Another tutorial by Sam Starling describes setting MySQL on an Amazon Linux AMI instance, which is the image that I’m using.

Spatial extensions are included in MySQL from version 4.1 and up.

MongoDB on EC2

UPDATE: All posts I’ve come across on MongoDB and spatial data seem to mention some kind of problem. Either query times are long or there is inacuracy. Perhaps I shoud take a look at Apache Cassandra for spatial data instead..

There is a tutorial for installing MongoDB on an Amazon Linux AMI 64 bit instance using yum, which is exactly what I have.

The MongoDB homepage also has a section specifically for installing MongoDB on EC2. Either way it seems easy enough.

The spatial capabilities of MongoDB are described on the MongoDB homepage, and also here.

I’ve come across criticism of MongoDB for spatial purposes. I’ll look at MongoDB and form my own oppinion but keep this poster in mind if I run into problems. I’d like to understand the algorithms and datastructures used in MongoDB before forming a final oppinion.

Apache Cassandra on EC2

A colleague at the university sent me a link describing using Apache Cassandra for spatial data. An overview of Apache Cassandra articles can be found on the Cassandra website.

It seems that Cassandra can not be installed via a package manager. Installation instructions are given as a quick guide. It requires Java 1.6 update 19 or later, and Amazon Linux AMI’s come with Java 1.6 update 20 at present.

tar -zxf apache-cassandra-0.7.6-2-bin.tar.gz
cd apache-cassandra-0.7.6-2
less README.txt

General tips

When running databases instances on EC2 use EBS (Elastic Block Storage) to store the data. That way the data is persisted even when the database instance crashes and burns.

Create separate security groups for different tiers like database, web and others.

Do what this page describes with regards to replication etc.

Oh, and running applications with high demands for availability should perhaps be spread out over multiple EC2 regions.

Opening and closing ports on EC2 instances

Assuming that the EC2 tools have been installed like described in a previous post, opening and closing ports is done with the ec2-authorize and ec2-revoke commands respectively. These commands work on security groups rather than on instances. Recall that a set of instances belong to a security group.

Opening port 80 on EC2 instances in the ‘default’ security group.

ec2-authorize default -p 80

Close port 80 on EC2 instances in the ‘default’ security group

ec2-revoke default -p 80

See also the Amazon command reference for the EC2 API.

Hints for managing Amazon Linux on EC2

I’m using Mac OS X and running instances in the EU West Region. My instances are of the Amazon Linux AMI.

Installing the EC2 command line tools

Having command-line tools installed is a supplement to the AWS management console found online. I found a good tutorial about how to get started with the tools for EC2 on Mac OS X.

After downloading the tools from Amazon download site, the tutorial describes how to set environment variables and how to create X.509 certificates etc.

The only detail missing was that I’m running my instances in the EU West region. I found a hint in another tutorial on setting an additional environment variable. My resulting .profile file looks like this:

# Setup Amazon EC2 Command-Line Tools
export EC2_HOME=~/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=`ls $EC2_HOME/pk-*.pem`
export EC2_CERT=`ls $EC2_HOME/cert-*.pem`
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
# This line is from second tutorial, for use with EU West Region:
export EC2_URL=

The first tutorial show many examples of using the command-line tools to start instances, open ports etc.

Package manager for Amazon Linux AMI

Maybe the tools can be used to install packages on the Amazon Linux AMI instance, but you could also use a package manager.

Amazon Linux AMI comes with the yum package manager installed. A tutorial which is specifically aimed at installing PHP on a Amazon Linux AMI instances also gives a quick tour of yum. Basically you do like this:

$ sudo yum install <PACKAGE_NAME>

Installing Apache Web Server

As an example of using the EC2 tools and the yum package manager is installing the Apache Web Server. The command ec2-describe-instances lists running instances in the region given in the environment variable EC2_URL.

$ ec2-describe-instances
RESERVATION	r-xxxxxxxx	xxxxxxxxxxxxx	default
INSTANCE	i-xxxxxxxx	ami-xxxxxxx

default is the name of the security group for the instance. You may have used a different security group name. Security groups are used to make it easier to apply a set of permissions to a range of instances. The command ec2-authorize applies a permission to a security group, like opening up port 80 for httpd.

# open up port 80 on instances belonging to security group 'default'
$ ec2-authorize default -p 80
PERMISSION  default  ALLOWS  tcp  80 80  FROM  CIDR

Logging into the instance with ssh and then using the package manager to install httpd.

# use the key pair that you used when launcing your instance
$ ssh -i ~/.ec2/ec2-keypair
# install httpd - starts an install process
$ sudo yum install httpd

Good indian computer science videos on youtube

While browsing the web for for good videos to help me land a cool job at high profile tech firm, I came across this series from an Indian university.

Lecture – 16 Disk Based Data Structures

You should be able to easily find the other videos in the series through this one. Generally the subjects that are covered relate to data structures and algorithms:

  • Trees (Red-Black, B, AVL)
  • Hashing
  • Heaps
  • Sorting

The videos are very practical and relate the data structures to scenarios where they would be used, like for bank transactions etc.