How to create JSON data from a text file on the internet

The following assumes a linux command line to be present (or Mac OS X terminal in my case).

I want to wrangle text from the internet, turn it into JSON data, and ultimately stick it in CouchDB. Here I’m trying to turn a random text file containing prime numbers into structured JSON data that looks like this:

[2, 3, 5, 7,...]

The original file is here: It is fairly structured to begin with, but it’s not JSON.

                         The First 1,000 Primes
                          (the 1,000th is 7919)
         For more information on primes see

      2      3      5      7     11     13     17     19     23     29 
     31     37     41     43     47     53     59     61     67     71 
     73     79     83     89     97    101    103    107    109    113 

The following line does turns it into JSON:

curl | \
tail +4 | \
tr -cs "[:digit:]" "," | \
sed -e 's/^,/\[/' -e 's/,$/\]/' \
> primes.json

Let’s look at it with cat to make sure:

$ cat primes.json

Explanation of the command

curl is used to download the file and print it on standard output in the terminal. With no arguments it issues a HTTP GET for

tail +4 discards the first four lines.

tr -cs "[:digit:]" "," converts the text into digits followed by commas. The new text has a comma before the first digit, and a comma after the last one. No linebreaks or spaces: ,2,3,5,7...,7919,

sed -e 's/^,/\[/' -e 's/,$/\]/' is perhaps a bit hard to read. It replaces the comma before the first digit with '[', and replaces the comma after the last digit with ']'.

Who gains from blocked content on YouTube?

When I want to hear a particular rap song from 1992 on YouTube, the video service shows me this:

Yeah yeah, copyright or whatever, but what is the point? Who gains what exactly? By the way, if you can hear the song in the country you’re in, then fuck you 🙂

Who’s involved, let’s see. Me (the user/customer), Google (the owner of YouTube), the EU (makers of regional copyright laws), Sony (the copyright holder), and CMW (the artist).

Does Google the owner of YouTube win?

No. Google looses straight away, because I can hear the song on GrooveShark just fine (albeit without the video):

Does the EU win?

No. The EU might gain a little bit, because CMW is an american band, so chances are that I’ll listen to a EU artist like Dizzie Rascal instead:

But that’s not going to happen, because I wanted to listen to CMW, and I’ve already found the song on another service, GrooveShark.

Does Sony Music Entertainment win?

No. I already bought the song on iTunes a couple of days ago. If I hadn’t bought it, I would have downloaded it with a torrent. The YouTube video being there or not, did not factor in to my decision to buy the song. I bought the song because it was insanely easy to do on my iPhone. Period. In fact, I might choose to not buy a song in the future if it’s owned by SME.

Does the artist win?

Hardly, in fact they loose. I’m sure they appreciate that I bought the song, though I’m sure Sony Music Entertainment appreciates it a hell of a lot more if I know anything about royalty splits! And the song being blocked on YouTube did not make me buy the song, as I’ve already said. I was about to make CMW more famous, by linking their video on my blog, but couldn’t. Sorry CMW.

Do concerned mothers win?

Does the fictional organization of “concerned mothers against gangster rap” gain anything by a blocked gangster rap tune on the internet? Sure, but that is mere coincidence, it could just as well have been a song about flowers or teddy bears or a praise for “concerned mothers against gangster rap”.

By the way, you may check out the song CMW sampled on “N 2 Deep”. It’s by Lyn Collins, and features the distinct sound of the JB’s. Apparently the copyright holder (Polydor) is not insane:


I find that this blogpost and video on innovation from Edinburg by Ed Parsons is somehow related to this issue.

By the way Ed. If you watch the ping back. Sorry that I stole your look for WordPress. I kinda liked it, and I do listen to gangster rap occasionally so my morals are questionable.

Using shp2geocouch to push OSM data into geocouch

Today I installed the utility shp2geocouch on Mac OS X 1.6.

First I needed to update RubyGems…

sudo gem update --system

Then I could install shp2geocouch

sudo gem install shp2geocouch

Next I downloaded OSM data for Copenhagen, Denmark

cd copenhagen.shapefiles

Finally I used shp2geocouch to upload one of the shape files to (database

shp2geocouch europe_northern_europe_denmark_copenhagen_highway.shp

This takes a while and the job is still running on my MacBook Pro (after ~10 minutes 16000 documents have been loaded into The final count was 33306 documents.

As a final touch, the script replicates geocouch-utils + map browser and tells me

view your data on a map at

The map uses OSM tiles from cloudmade as background, and fetches clickable road data from iriscouch using XHR:

Clicking the link, gives you this:

Installing spatial databases on EC2

The spatial databases covered are PostGIS, MySQL spatial and MongoDB, Apache Cassandra.

UPDATE: I’ll change this post or create a page to give the actual linux commands to run on the remote server.

PostGIS on EC2

I have found a nice tutorial that describes setting up Postgres on EC2 on an Ubuntu instance with all the trimmings. The blogger (Ron Evans) explains how he does things, including choice of filesystem on EBS, setting up security groups and general architectural decisions. It is quite detailed so you might even learn some linux admin tips from reading it.

I’m using the Amazon Linux AMI for now, and most of what is described should apply for that image as well. I noticed that he installs Postgres with the package manager (apt-get), and Amazon Linux AMI’s come with yum.

There is a different tutorial that describes using yum instead of apt-get to install postgres. As a sidenote that writer also seems to prefer the EXT3 filesystem over the XFS filesystem.

There is also a tutorial for installing Postgres 9.0 with yum that includes installation of PostGIS, which is probably the one I’ll end up following. There is a separate description for Postgres 8.4.

I recommend following this tutorial up to the point of installing Postgres, and then switching to this tutorial.

MySQL with Spatial Extension on EC2

The procedure for installing MySQL on EC2 is described on the MySQL website. The examples given include one using yum, so that is as easy as it gets.

It should be noted that there are community images on EC2 which come preinstalled with MySQL.

ec2-describe-images -a | grep -i mysql

The MySQL website also has a very good section for setting up replication for MySQL on EC2 and related subjects.

One aspect that is mentioned is about scalability, and that it is “easier to create more EC2 instances to support more users than to upgrade the instance to a larger machine”. Good point I think, and there are more, so I recommend reading that page and many of the hints also apply directly to running Postgres and MongoDB on EC2.

Another tutorial by Sam Starling describes setting MySQL on an Amazon Linux AMI instance, which is the image that I’m using.

Spatial extensions are included in MySQL from version 4.1 and up.

MongoDB on EC2

UPDATE: All posts I’ve come across on MongoDB and spatial data seem to mention some kind of problem. Either query times are long or there is inacuracy. Perhaps I shoud take a look at Apache Cassandra for spatial data instead..

There is a tutorial for installing MongoDB on an Amazon Linux AMI 64 bit instance using yum, which is exactly what I have.

The MongoDB homepage also has a section specifically for installing MongoDB on EC2. Either way it seems easy enough.

The spatial capabilities of MongoDB are described on the MongoDB homepage, and also here.

I’ve come across criticism of MongoDB for spatial purposes. I’ll look at MongoDB and form my own oppinion but keep this poster in mind if I run into problems. I’d like to understand the algorithms and datastructures used in MongoDB before forming a final oppinion.

Apache Cassandra on EC2

A colleague at the university sent me a link describing using Apache Cassandra for spatial data. An overview of Apache Cassandra articles can be found on the Cassandra website.

It seems that Cassandra can not be installed via a package manager. Installation instructions are given as a quick guide. It requires Java 1.6 update 19 or later, and Amazon Linux AMI’s come with Java 1.6 update 20 at present.

tar -zxf apache-cassandra-0.7.6-2-bin.tar.gz
cd apache-cassandra-0.7.6-2
less README.txt

General tips

When running databases instances on EC2 use EBS (Elastic Block Storage) to store the data. That way the data is persisted even when the database instance crashes and burns.

Create separate security groups for different tiers like database, web and others.

Do what this page describes with regards to replication etc.

Oh, and running applications with high demands for availability should perhaps be spread out over multiple EC2 regions.

Opening and closing ports on EC2 instances

Assuming that the EC2 tools have been installed like described in a previous post, opening and closing ports is done with the ec2-authorize and ec2-revoke commands respectively. These commands work on security groups rather than on instances. Recall that a set of instances belong to a security group.

Opening port 80 on EC2 instances in the ‘default’ security group.

ec2-authorize default -p 80

Close port 80 on EC2 instances in the ‘default’ security group

ec2-revoke default -p 80

See also the Amazon command reference for the EC2 API.

Hints for managing Amazon Linux on EC2

I’m using Mac OS X and running instances in the EU West Region. My instances are of the Amazon Linux AMI.

Installing the EC2 command line tools

Having command-line tools installed is a supplement to the AWS management console found online. I found a good tutorial about how to get started with the tools for EC2 on Mac OS X.

After downloading the tools from Amazon download site, the tutorial describes how to set environment variables and how to create X.509 certificates etc.

The only detail missing was that I’m running my instances in the EU West region. I found a hint in another tutorial on setting an additional environment variable. My resulting .profile file looks like this:

# Setup Amazon EC2 Command-Line Tools
export EC2_HOME=~/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=`ls $EC2_HOME/pk-*.pem`
export EC2_CERT=`ls $EC2_HOME/cert-*.pem`
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
# This line is from second tutorial, for use with EU West Region:
export EC2_URL=

The first tutorial show many examples of using the command-line tools to start instances, open ports etc.

Package manager for Amazon Linux AMI

Maybe the tools can be used to install packages on the Amazon Linux AMI instance, but you could also use a package manager.

Amazon Linux AMI comes with the yum package manager installed. A tutorial which is specifically aimed at installing PHP on a Amazon Linux AMI instances also gives a quick tour of yum. Basically you do like this:

$ sudo yum install <PACKAGE_NAME>

Installing Apache Web Server

As an example of using the EC2 tools and the yum package manager is installing the Apache Web Server. The command ec2-describe-instances lists running instances in the region given in the environment variable EC2_URL.

$ ec2-describe-instances
RESERVATION	r-xxxxxxxx	xxxxxxxxxxxxx	default
INSTANCE	i-xxxxxxxx	ami-xxxxxxx

default is the name of the security group for the instance. You may have used a different security group name. Security groups are used to make it easier to apply a set of permissions to a range of instances. The command ec2-authorize applies a permission to a security group, like opening up port 80 for httpd.

# open up port 80 on instances belonging to security group 'default'
$ ec2-authorize default -p 80
PERMISSION  default  ALLOWS  tcp  80 80  FROM  CIDR

Logging into the instance with ssh and then using the package manager to install httpd.

# use the key pair that you used when launcing your instance
$ ssh -i ~/.ec2/ec2-keypair
# install httpd - starts an install process
$ sudo yum install httpd

Good indian computer science videos on youtube

While browsing the web for for good videos to help me land a cool job at high profile tech firm, I came across this series from an Indian university.

Lecture – 16 Disk Based Data Structures

You should be able to easily find the other videos in the series through this one. Generally the subjects that are covered relate to data structures and algorithms:

  • Trees (Red-Black, B, AVL)
  • Hashing
  • Heaps
  • Sorting

The videos are very practical and relate the data structures to scenarios where they would be used, like for bank transactions etc.

Image search by sketching – continued

It’s a simple question

Can you search for images by sketching a similar image?

I went looking online for a search engine that had implemented this feature, which I’ll call image-search-by-sketching.

Update: Since I wrote this piece, GaZoPa no longer exists. In the meantime Google has implemented image-search-by-image. You can’t sketch, but you can use an existing image.

Googles implemetation of image-search-by-image is did both a good and bad job when I tried it last (December 2011). When I tried with my test image (dog-shape below), I got this blog post, which is good. But the related images are way off, number one related image is a picture of a shoe?

I can see the similarity to my dog-shape in the results that Google suggested, but I didn’t get a dog. No doubt it is a hard problem, and what I wish for is highly semantic, in the sense that I want the search engine to recognize that I’m looking for a dog. In my test below, GaZoPa could have gotten it right for a number of reasons. Maybe they simply had many fewer items in their database to match the dog against, and the best match happened to be… a dog? I guess I’ll never know. R.I.P. GaZoPa.

And so I went looking for such a search engine…

First thing I did, was ask this question on Stackoverflow and got an reply which pointed my to a couple of cool websites.

These are all cool websites, but at first not exactly what I was looking for. After trying GazoPa I realized that the website is almost exactly what I was looking for (a service that allows you to sketch-up an image query).

Trying GazoPa

GazoPa allows you (among other things) to upload an image, and performs a search for similar images. I’m not quite sure which images are in its index, but I proceeded with the following experiment. I drew up a rather crude dog in Dia, and uploaded this image to GazoPa. Here is the dog:

It actually gave some pretty decent results, with this one being the first hit:

It is not hard to imagine a site that combines the sketching I did in Dia with the GazoPa service.

Update: Unfortunately GazoPa no longer exists. I guess you combine Google image search with a drawing program, but it would be more fun to do it with an indie search engine.

Image search by sketching in 2007

This is a post in my technology archaeology series.

What is search by sketching?

The idea is to search for images by drawing a sketch that roughly resembles what you are looking for. The sketch is your query. This idea was mentioned in years 2007, 2010 and sometime in the late 90’s (according to my friend Rasmus)

The idea is not new. A friend told me about an art search engine (i forget the name) where you could search for works of art by splashing colors on a crude web canvas, e.g. drawing some purple in the top, some yellow in the corner, and voila: “Is this the painting you where looking for?”

That is, based on your quick sketch, the algorithm finds matches in an art image database.

Applications of the technology

Here are some ideas for applications of the top of my head

  • Search for vector data in a spatial datasource. The user draws a sketch on top of a map (to get scale correct), and relevant vectors are returned. I and my colleague talked about how Denmark looks like the word Foo.

    So we naturally thought about something geographical that looks like the word Bar. This could be a chain of islands or a series of lakes. In essence you’d draw the word “Bar” and ask for vector data that looks similar.

Online mentions of search by sketching

There is a blogpost that also talks about the idea and mentions concrete technology:

This guy has something that looks like a product and even a youtube video

Also Microsoft in Asia apparently has been working on this

But where is it? Why doesn’t Google support this on their image search?

I’ve asked on stackoverflow