Using shp2geocouch to push OSM data into geocouch

Today I installed the utility shp2geocouch on Mac OS X 1.6.

First I needed to update RubyGems…

sudo gem update --system

Then I could install shp2geocouch

sudo gem install shp2geocouch

Next I downloaded OSM data for Copenhagen, Denmark

wget http://download.cloudmade.com/europe/northern_europe/denmark/copenhagen/copenhagen.shapefiles.zip
unzip copenhagen.shapefiles.zip
cd copenhagen.shapefiles

Finally I used shp2geocouch to upload one of the shape files to iriscouch.com (database gd.iriscouch.com/cphosm).

shp2geocouch europe_northern_europe_denmark_copenhagen_highway.shp gd.iriscouch.com/cphosm

This takes a while and the job is still running on my MacBook Pro (after ~10 minutes 16000 documents have been loaded into iriscouch.com). The final count was 33306 documents.

As a final touch, the script replicates geocouch-utils + map browser and tells me

view your data on a map at http://gd.iriscouch.com/cphosm/_design/geo/_rewrite

The map uses OSM tiles from cloudmade as background, and fetches clickable road data from iriscouch using XHR:

Clicking the link, gives you this:

Installing spatial databases on EC2

The spatial databases covered are PostGIS, MySQL spatial and MongoDB, Apache Cassandra.

UPDATE: I’ll change this post or create a page to give the actual linux commands to run on the remote server.

PostGIS on EC2

I have found a nice tutorial that describes setting up Postgres on EC2 on an Ubuntu instance with all the trimmings. The blogger (Ron Evans) explains how he does things, including choice of filesystem on EBS, setting up security groups and general architectural decisions. It is quite detailed so you might even learn some linux admin tips from reading it.

I’m using the Amazon Linux AMI for now, and most of what is described should apply for that image as well. I noticed that he installs Postgres with the package manager (apt-get), and Amazon Linux AMI’s come with yum.

There is a different tutorial that describes using yum instead of apt-get to install postgres. As a sidenote that writer also seems to prefer the EXT3 filesystem over the XFS filesystem.

There is also a tutorial for installing Postgres 9.0 with yum that includes installation of PostGIS, which is probably the one I’ll end up following. There is a separate description for Postgres 8.4.

I recommend following this tutorial up to the point of installing Postgres, and then switching to this tutorial.

MySQL with Spatial Extension on EC2

The procedure for installing MySQL on EC2 is described on the MySQL website. The examples given include one using yum, so that is as easy as it gets.

It should be noted that there are community images on EC2 which come preinstalled with MySQL.

ec2-describe-images -a | grep -i mysql

The MySQL website also has a very good section for setting up replication for MySQL on EC2 and related subjects.

One aspect that is mentioned is about scalability, and that it is “easier to create more EC2 instances to support more users than to upgrade the instance to a larger machine”. Good point I think, and there are more, so I recommend reading that page and many of the hints also apply directly to running Postgres and MongoDB on EC2.

Another tutorial by Sam Starling describes setting MySQL on an Amazon Linux AMI instance, which is the image that I’m using.

Spatial extensions are included in MySQL from version 4.1 and up.

MongoDB on EC2

UPDATE: All posts I’ve come across on MongoDB and spatial data seem to mention some kind of problem. Either query times are long or there is inacuracy. Perhaps I shoud take a look at Apache Cassandra for spatial data instead..

There is a tutorial for installing MongoDB on an Amazon Linux AMI 64 bit instance using yum, which is exactly what I have.

The MongoDB homepage also has a section specifically for installing MongoDB on EC2. Either way it seems easy enough.

The spatial capabilities of MongoDB are described on the MongoDB homepage, and also here.

I’ve come across criticism of MongoDB for spatial purposes. I’ll look at MongoDB and form my own oppinion but keep this poster in mind if I run into problems. I’d like to understand the algorithms and datastructures used in MongoDB before forming a final oppinion.

Apache Cassandra on EC2

A colleague at the university sent me a link describing using Apache Cassandra for spatial data. An overview of Apache Cassandra articles can be found on the Cassandra website.

It seems that Cassandra can not be installed via a package manager. Installation instructions are given as a quick guide. It requires Java 1.6 update 19 or later, and Amazon Linux AMI’s come with Java 1.6 update 20 at present.

wget http://apache.mirrors.webname.dk//cassandra/0.7.6/apache-cassandra-0.7.6-2-bin.tar.gz
tar -zxf apache-cassandra-0.7.6-2-bin.tar.gz
cd apache-cassandra-0.7.6-2
less README.txt

General tips

When running databases instances on EC2 use EBS (Elastic Block Storage) to store the data. That way the data is persisted even when the database instance crashes and burns.

Create separate security groups for different tiers like database, web and others.

Do what this page describes with regards to replication etc.

Oh, and running applications with high demands for availability should perhaps be spread out over multiple EC2 regions.

Opening and closing ports on EC2 instances

Assuming that the EC2 tools have been installed like described in a previous post, opening and closing ports is done with the ec2-authorize and ec2-revoke commands respectively. These commands work on security groups rather than on instances. Recall that a set of instances belong to a security group.

Opening port 80 on EC2 instances in the ‘default’ security group.

ec2-authorize default -p 80

Close port 80 on EC2 instances in the ‘default’ security group

ec2-revoke default -p 80

See also the Amazon command reference for the EC2 API.

Hints for managing Amazon Linux on EC2

I’m using Mac OS X and running instances in the EU West Region. My instances are of the Amazon Linux AMI.

Installing the EC2 command line tools

Having command-line tools installed is a supplement to the AWS management console found online. I found a good tutorial about how to get started with the tools for EC2 on Mac OS X.

After downloading the tools from Amazon download site, the tutorial describes how to set environment variables and how to create X.509 certificates etc.

The only detail missing was that I’m running my instances in the EU West region. I found a hint in another tutorial on setting an additional environment variable. My resulting .profile file looks like this:

# Setup Amazon EC2 Command-Line Tools
export EC2_HOME=~/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=`ls $EC2_HOME/pk-*.pem`
export EC2_CERT=`ls $EC2_HOME/cert-*.pem`
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
# This line is from second tutorial, for use with EU West Region:
export EC2_URL=https://eu-west-1.ec2.amazonaws.com

The first tutorial show many examples of using the command-line tools to start instances, open ports etc.

Package manager for Amazon Linux AMI

Maybe the tools can be used to install packages on the Amazon Linux AMI instance, but you could also use a package manager.

Amazon Linux AMI comes with the yum package manager installed. A tutorial which is specifically aimed at installing PHP on a Amazon Linux AMI instances also gives a quick tour of yum. Basically you do like this:

$ sudo yum install <PACKAGE_NAME>

Installing Apache Web Server

As an example of using the EC2 tools and the yum package manager is installing the Apache Web Server. The command ec2-describe-instances lists running instances in the region given in the environment variable EC2_URL.

$ ec2-describe-instances
RESERVATION	r-xxxxxxxx	xxxxxxxxxxxxx	default
INSTANCE	i-xxxxxxxx	ami-xxxxxxx	ec2-xx-xxx-xx-xx.eu-west-1.compute.amazonaws.com

default is the name of the security group for the instance. You may have used a different security group name. Security groups are used to make it easier to apply a set of permissions to a range of instances. The command ec2-authorize applies a permission to a security group, like opening up port 80 for httpd.

# open up port 80 on instances belonging to security group 'default'
$ ec2-authorize default -p 80
PERMISSION  default  ALLOWS  tcp  80 80  FROM  CIDR  0.0.0.0/0

Logging into the instance with ssh and then using the package manager to install httpd.

# use the key pair that you used when launcing your instance
$ ssh -i ~/.ec2/ec2-keypair ec2-user@c2-xx-xxx-xx-xx.eu-west-1.compute.amazonaws.com
# install httpd - starts an install process
$ sudo yum install httpd

Good indian computer science videos on youtube

While browsing the web for for good videos to help me land a cool job at high profile tech firm, I came across this series from an Indian university.

Lecture – 16 Disk Based Data Structures

http://www.youtube.com/watch?v=VbVroFR4mq4

You should be able to easily find the other videos in the series through this one. Generally the subjects that are covered relate to data structures and algorithms:

  • Trees (Red-Black, B, AVL)
  • Hashing
  • Heaps
  • Sorting

The videos are very practical and relate the data structures to scenarios where they would be used, like for bank transactions etc.

Image search by sketching – continued

It’s a simple question

Can you search for images by sketching a similar image?

I went looking online for a search engine that had implemented this feature, which I’ll call image-search-by-sketching.

Update: Since I wrote this piece, GaZoPa no longer exists. In the meantime Google has implemented image-search-by-image. You can’t sketch, but you can use an existing image.

Googles implemetation of image-search-by-image is did both a good and bad job when I tried it last (December 2011). When I tried with my test image (dog-shape below), I got this blog post, which is good. But the related images are way off, number one related image is a picture of a shoe?

I can see the similarity to my dog-shape in the results that Google suggested, but I didn’t get a dog. No doubt it is a hard problem, and what I wish for is highly semantic, in the sense that I want the search engine to recognize that I’m looking for a dog. In my test below, GaZoPa could have gotten it right for a number of reasons. Maybe they simply had many fewer items in their database to match the dog against, and the best match happened to be… a dog? I guess I’ll never know. R.I.P. GaZoPa.

And so I went looking for such a search engine…

First thing I did, was ask this question on Stackoverflow and got an reply which pointed my to a couple of cool websites.

These are all cool websites, but at first not exactly what I was looking for. After trying GazoPa I realized that the website is almost exactly what I was looking for (a service that allows you to sketch-up an image query).

Trying GazoPa

GazoPa allows you (among other things) to upload an image, and performs a search for similar images. I’m not quite sure which images are in its index, but I proceeded with the following experiment. I drew up a rather crude dog in Dia, and uploaded this image to GazoPa. Here is the dog:

It actually gave some pretty decent results, with this one being the first hit:

It is not hard to imagine a site that combines the sketching I did in Dia with the GazoPa service.

Update: Unfortunately GazoPa no longer exists. I guess you combine Google image search with a drawing program, but it would be more fun to do it with an indie search engine.

Image search by sketching in 2007

This is a post in my technology archaeology series.

What is search by sketching?

The idea is to search for images by drawing a sketch that roughly resembles what you are looking for. The sketch is your query. This idea was mentioned in years 2007, 2010 and sometime in the late 90’s (according to my friend Rasmus)

The idea is not new. A friend told me about an art search engine (i forget the name) where you could search for works of art by splashing colors on a crude web canvas, e.g. drawing some purple in the top, some yellow in the corner, and voila: “Is this the painting you where looking for?”

That is, based on your quick sketch, the algorithm finds matches in an art image database.

Applications of the technology

Here are some ideas for applications of the top of my head

  • Search for vector data in a spatial datasource. The user draws a sketch on top of a map (to get scale correct), and relevant vectors are returned. I and my colleague talked about how Denmark looks like the word Foo.

    So we naturally thought about something geographical that looks like the word Bar. This could be a chain of islands or a series of lakes. In essence you’d draw the word “Bar” and ask for vector data that looks similar.

Online mentions of search by sketching

There is a blogpost that also talks about the idea and mentions concrete technology:

This guy has something that looks like a product and even a youtube video

Also Microsoft in Asia apparently has been working on this

But where is it? Why doesn’t Google support this on their image search?

I’ve asked on stackoverflow

http://stackoverflow.com/questions/5458174/image-search-by-sketching-who-has-implemented

BitTorrent for geodata was big in 2005

Big in 2005…

Today I’m trying to find out whether BitTorrent + geodata is a “thing”. I have found out that it WAS a thing… in 2005! Just like Coldplay, Gorillaz, Eminem, 50 cents, James Blunt, Green Day… but it never really took off.

  • In 2006 Chris Holmes had a blog post titled Distribution of Geodata, where he said stuff like «What is needed is a standard, so that clients tile up the earth in the same way and make the same requests.» and «instead of asking the server to return a set of tiles that represents an area, it could ask a p2p network»
  • In 2005 Ed Parsons has a blog post titled Peer to Peer Geodata anyone ?, where he said stuff like «The idea of distributing large geodata datasets as small chucks is quite appealing and I have no doubt that when open geodata becomes more mature – this will be the obvious mechanism of supply» and «peer to peer means piracy in many minds, an unfortunate perception».
  • He and others mention GeoTorrent.org, a site offering geographical datasets via bittorrent.
  • In 2008 people ask: What happened to geotorrent.org?
  • In 2011, I’m asking the same thing: What is going on with P2P and geodata? Either I’m hopelessly old school, or a good idea simply went missing without a trace…

Ok, so people are still talking about P2P+geodata in 2006, 2007 and 2008, but the fact is that it has not seen a wide breakthrough in 2011. Or am I missing something?

GeoTorrent.org no longer answers HTTP requests, but it is still registered. GeoTorrent.org was run by ERMapper, who was bought by Leica Geosystems, who merged with Erdas, according to some person in 2008. It was a site devoted to offering geodata via bittorrent. Richard Orchard was one of the people behind GeoTorrent.org. Maybe he knows what happened to geotorrent.org?

Using the keywords “P2P” and “geodata” I went looking on scholar.google.com. I did not find that much, and nothing that has been generally adopted (see some of the hits in the Links section below).

What am I looking for in 2011?

What I’m looking for is something like a plugin for GeoServer, or a web-gis framework that fetches tiles via P2P, or something like GeoNode with a P2P twist. Actually GeoNode could be it… is GeoNode it?

Conclusion: Some pros and cons of P2P geodata

  • In 2009 a guy on a mailing list said: «Pure P2P solutions are great for exchanging large files, but typically have too much latency to be practical»
  • In 2010 some chinese guys said: «P2P technology offered a novel solution to spatial information online service and a good platform for sharing mass spatial data, it can avoid “single point of failure”and “hot spots bottleneck” problem»
  • In 2007 some austrians said: «As disaster management inherently happens in highly dynamic environments, these applications suffer from deficiencies with respect to maintaining connections to the server representing their sole source of information. We propose to exploit peer-to-peer networks to interconnect field workers.»
  • They also said: «P2P oriented raster geo-data online services have been widely applied, whereas vector geo-data online services still have many issues that can′t be handled, such as vector geo-data organization pattern, segmentation, lossless reconstruction etc»
  • In 2006 Chris Holmes said: «The damn brilliant thing about using an architecture of participation for geospatial data information is that as a layer gets more popular it scales perfectly, since more people downloading and checking out the layer means that more people are serving it up to others.»

If «P2P oriented raster geo-data online services have been widely applied», then where has it gone now? I’d like to find out…

Links

How translate.google.com works

Actually, this is not about how translate.google.com works. It’s about loading HTML from a random URL, adding some extra Javascript and CSS and redisplaying the page on a different domain.

A simple test page

I’ve made a simple test web page:

1
2
3
4
5
6
7
<html>
  <head></head>
  <body>
    <h1>Hello world</h1>
    <img src="world.jpg" />
  </body>
</html>

Let’s translate it with Google Translate.

The translation page has two iframes (details omitted):

<html>
<head>
<title>Google Translate</title>
</head>
<frameset>
	<frame src="/translate_n?...">
	<frame src="/translate_p?...">
</frameset>
</html>

Let’s look at the second iframe, the one which begins with ...src="/translate_p?.... This is the actual translated page.

Compare source code of the original page with the translated page. It’s been buffed up considerably.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
<html>
<head>
<script>
	(function() {
		function ti_a(b) {
			this.t = {};
			this.tick = function(c, d, a) {
				a = a ? a : (new Date).getTime();
				this.t[c] = [ a, d ]
			};
			this.tick("start", null, b)
		}
		var ti_b = new ti_a;
		window.jstiming = {
			Timer : ti_a,
			load : ti_b
		};
		try {
			var ti_ = null;
			if (window.chrome && window.chrome.csi)
				ti_ = Math.floor(window.chrome.csi().pageT);
			if (ti_ == null)
				if (window.gtbExternal)
					ti_ = window.gtbExternal.pageT();
			if (ti_ == null)
				if (window.external)
					ti_ = window.external.pageT;
			if (ti_)
				window.jstiming.pt = ti_
		} catch (ti_c) {
		}
		;
	})()
</script>
<script
	src="http://translate.googleusercontent.com/translate/static/biEfM_qFbxU/js/translate_c.js"></script>
<script>
	_infowindowVersion = 1;
	_intlStrings._originalText = "Original English text:";
	_intlStrings._interfaceDirection = "ltr";
	_intlStrings._interfaceAlign = "left";
	_intlStrings._langpair = "en|da";
	_intlStrings._feedbackUrl = "http://translate.google.com/translate_suggestion";
	_intlStrings._currentBy = "Current translation on %1$s by %2$s";
	_intlStrings._unknown = "unknown";
	_intlStrings._suggestTranslation = "Contribute a better translation";
	_intlStrings._submit = "Contribute";
	_intlStrings._suggestThanks = "Thank you for contributing your translation suggestion to Google Translate.";
	_intlStrings._reverse = false;
</script>
<style type="text/css">
.google-src-text {
	display: none !important
}
 
.google-src-active-text {
	display: block !important;
	color: black !important;
	font-size: 12px !important;
	font-family: arial, sans-serif !important
}
 
.google-src-active-text a {
	font-size: 12px !important
}
 
.google-src-active-text a:link {
	color: #00c !important;
	text-decoration: underline !important
}
 
.google-src-active-text a:visited {
	color: purple !important;
	text-decoration: underline !important
}
 
.google-src-active-text a:active {
	color: red !important;
	text-decoration: underline !important
}
</style>
<meta http-equiv="X-Translated-By" content="Google">
<base href=http://skipperkongen.dk/tmp/test.html />
</head>
<body>
<iframe
	src="http://translate.google.com/translate_un?hl=en&ie=UTF-8&sl=en&tl=da&u=http://skipperkongen.dk/tmp/test.html&prev=_t&rurl=translate.google.com&twu=1&lang=en&usg=ALkJrhhpCjCAYEWwbQX9TROT-522jGdGEw"
	width=0 height=0 frameborder=0
	style="width: 0px; height: 0px; border: 0px;"></iframe>
<h1><span onmouseover=
	_tipon(this);
onmouseout=
	_tipoff();
>
<span class="google-src-text" style="direction: ltr; text-align: left">Hello
world</span> Hej verden </span></h1>
<img src=world.jpg />
</body>
<script>
	_addload(function() {
		_setupIW();
		_csi('en', 'da', 'http://skipperkongen.dk/tmp/test.html');
	});
</script>
</html>

Leaving only the really important stuff:

1
2
3
4
5
6
7
8
<head>
	<script src="http://translate.googleusercontent.com/translate/static/biEfM_qFbxU/js/translate_c.js"></script>
	<base href=http://skipperkongen.dk/tmp/test.html />
</head>
<body>
	<h1>Hej verden</h1>
	<img src=world.jpg />
</body>

Two notable things have changed from the original. The head section has extra script tags and a base tag. The body section the phrase Hello World has been translated into danish.

Summary of modifications to original page

In summary Google translate has done the following to the original page:

  1. Added script tags
  2. Added a style tag
  3. Added a base tag
  4. Added an iframe tag
  5. Replaced content text with translated version
  6. Marked up content with some span tags, for a fancy tooltip

What does the script tags do?

This is the most complex part. I’m not done analysing this yet.

What does the style tag do?

This is simply to provide some styling of the fancy tooltip added with the span tags.

What does the iframe tag do?

In short I don’t know yet.

What does the base tag do?

The base tag is there to make sure that relative paths like the image path works, even if the HTML is loaded from a different domain than the original skipperkongen.dk domain.

This:

3
	<base href=http://skipperkongen.dk/tmp/test.html />

Makes this work:

7
	<img src="world.jpg" />

Would this work with AJAX?

Many pages use Ajax to load content. I’m expecting Google Translate to not work in this case, because of cross site scripting restrictions. In theory it could be done by creating a dynamic service proxy on the google domain, not taking authentication issues into account.

Let’s try with a page that replaces the header text with AJAX.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<html>
	<head>
	<script src="http://code.jquery.com/jquery-1.5.min.js" type="text/javascript"></script>
	<script type="text/javascript">
		$(function() {
			$.get('message.txt', function(data) {
				$('h1').text(data);	
				})
 
			})
	</script>
	</head>
	<body>
		<h1>...</h1>
	</body>
</html>

When loading via http://skipperkongen.dk/tmp/test2.html, the page look like this:

Hello World

When loading via Google Translate, the page looks like this:

...

So the conclusion is that Google does not do anything about data fetched via AJAX.

Building osm2pgsql on Mac OS X using homebrew

General instructions are here: http://wiki.openstreetmap.org/wiki/Osm2pgsql#Mac_OS_X

Note: I’m running Snow Leopard (10.6.6 )

1. Install homebrew

Check that you don’t have it already:

$ which brew

If you don’t have homebrew install it from here:

E.g. like this:

$ ruby -e "$(curl -fsSLk https://gist.github.com/raw/323731/install_homebrew.rb)"

2. Install proj

$ brew install proj
$ which proj
/usr/local/bin/proj

3. Install geos

$ brew install geos

4. Install osm2pgsql

First add pg_config to the path, then install osm2pgsql:

$ PATH=$PATH:/Library/PostgreSQL/9.0/bin/
$ brew install osm2pgsql
$ which osm2pgsql
/usr/local/bin/osm2pgsql

You should now have osm2pgsql installed.

Import OSM data into PostgreSQL

I did the following to import OSM data into PostgreSQL.

# create a user for the osm database
createuser -U postgres osm-user
# create the osm database
createdb -U postgres -E utf8 -O osm-user osm
# download som osm data from cloudmade.com, I chose Copenhagen, Denmark.
wget http://downloads.cloudmade.com/europe/northern_europe/denmark/copenhagen/copenhagen.osm.bz2
# unzip it
bzip2 -d copenhagen.osm.bz2
# install the mercator projection 900913 on the database
wget http://svn.openstreetmap.org/applications/utils/export/osm2pgsql/900913.sql
psql -U postgres -d osm -f 900913.sql
# install PostGIS on database
psql -U postgres -d osm -f /Library/PostgreSQL/9.0/share/postgresql/contrib/postgis.sql
# find the style to use with osm2pgsql 
brew list osm2pgsql # list locations installed by homebrew, including location of the default style
 
# Ready to import! use -s if you chose a large OSM dataset, this keeps RAM usage down.
# Use location of style file found with brew list osm2pgsql
osm2pgsql -d osm -U postgres -S /usr/local/Cellar/osm2pgsql/HEAD/share/osm2pgsql/default.style copenhagen.osm

You should now have some OSM data in your PostgreSQL database.