Python script for geocoding a text file

Assume that you have a file with some locations as text with one location per line.

For example, here are some school names in Copenhagen, Denmark, stored in schools.csv:

Hyltebjerg Skole
Ellebjerg Skole
Katrinedals Skole
Peder Lykke Skolen
Amager Fælled Skole
Tingbjerg Heldagsskole
Øster Farimagsgades Skole
Sankt Annæ Gymnasiums Grundskole
Lykkebo Skole
Randersgades Skole
Grøndalsvængets Skole
Sølvgades Skole
Skolen ved Sundet
Hanssted Skole
Den Classenske Legatskole
Tove Ditlevsens Skole
Lergravsparkens Skole
Vigerslev Allés Skole
Bavnehøj Skole
Ålholm Skole
Guldberg Skole
Husum Skole
Nyboder Skole
Vanløse Skole
Kirkebjerg Skole
Christianshavns Skole
Bellahøj Skole
Korsager Skole
Nørrebro Park Skole
Utterslev Skole
Skolen på Islands Brygge
Brønshøj Skole
Kirsebærhavens Skole
Rødkilde Skole
Vesterbro Ny Skole
Blågård Skole
Sønderbro Skole
Højdevangens Skole
Oehlenschlægersgades Skole
Vibenshus Skole
Valby Skole
Rådmandsgades Skole
Tagensbo Skole

Here is a script,, that will attempt to geocode each location in an input stream. It prints CSV output to stdout with the fields input_line, input_line_no, result_no, place, latitude, longitude:

from geopy import geocoders
import sys
import time
import pdb
geocoder = geocoders.GoogleV3()
SEPARATOR='|'  # can also use tab. Comma is bad, since the place will most likely contain a comma.
dummy = ['', ['', '']]
i = 0
header = ['input_line', 'input_line_no', 'result_no', 'place', 'latitude', 'longitude']
for line in sys.stdin:
    line = line.strip()
    results = geocoder.geocode(line, exactly_one=False) or [dummy]
    for j, res in enumerate(results):
        place = res[0]
        lat = str(res[1][0])
        lon = str(res[1][1])
        out = SEPARATOR.join([line, str(i), str(j), place, lat, lon])
        print (out)
    i += 1

Here is how you might use the script:

cat schools.csv | python 

Tip: you might want to

How to work with spatial data in Amazon Redshift

While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.

So, let’s get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).

CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float)
    from math import radians, sin, cos, asin, sqrt, pi, atan2
    import numpy as np
    earth_radius_miles = 3956.0
    def haversine(lat1, lon1, lat2, lon2):
        """Gives the distance between two points on earth.
        lat1, lon1 = radians(lat1), radians(lon1)
        lat2, lon2 = radians(lat2), radians(lon2)
        dlat, dlon = (lat2 - lat1, lon2 - lon1)
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        great_circle_distance = 2 * asin(min(1,sqrt(a)))
        return earth_radius_miles * great_circle_distance
    return haversine(lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;

Now, let’s use our new UDF to calculate the great-circle distance between a pair of points.

SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625)
-- 2293.1324218790523

One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):

SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2

Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.

Geocoding Python function for PostgreSQL

Gratefully making use of what others have provided, i.e. geopy, Google and plpythonu.

Type to hold result of geocoding:

CREATE TYPE geocoding AS (
  place text,

Function that does the actual geocoding (to be extended with more vendors. Hint: look at geopy wiki). Takes an (arbitrary) input string to be geocoded:

  input text,
  vendor text DEFAULT 'google'
) RETURNS SETOF geocoding AS
  import time
  from geopy import geocoders
  # TODO: Add other available vendors, e.g. Yahoo.
  if vendor.lower() == 'google':
    geocoder = geocoders.GoogleV3()
    raise ValueError("Invalid geocoder: %s" % vendor)
    for res in geocoder.geocode(input, exactly_one=False):
      yield {'place': res[0], 'latitude': res[1][0], 'longitude': res[1][1]}


SELECT place, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)
FROM python_geocode('Kostas');

The new Google Maps: Explorers delight!

Being interested in maps in general, and specifically vector data, of course I had to take the new Google Maps for a spin. Vector data aside, what fascinated me most was the option to display photos of the current location in the bottom of the screen. Seeing actual photos is incredibly more useful than being shown 100 identical icons indicating that someone took a photo of something here.

Continue reading “The new Google Maps: Explorers delight!”

Downloading environment data for Denmark

This post is very Denmark-specific. I was annoyed that downloading environmental data from the ministry of the environment was so cumbersome.

Long story made short, here are download URLs for environmental data:

Dataset URL
EF Fuglebeskyttelsesområder
EF Habitatområder

There are more. I’ll add them when I have the time. Why there ministry homepage does not provide direct links like this is beyond me.

A presentation on spatial indexing

A friend of mine, who is the CEO of a company that develops an embedded database, asked me to do a presentation on spatial indexing. This was an opportunity for me to brush up on R-trees and similar datastructures.

Download the slides

The presentation introduces R-trees and spatial indexing to a technical audience, who are however not spatial indexing experts.

Having a look at Spacebase

Spacebase is a spatial datastore that began life as military-grade software, which at least sounds kind of cool. It’s an in-memory database, really, so switch off the cluster and the data is gone. Apparently the same thing was (unknown to the SpaceBase people?) invented in the 90’s by some americans also having the military as their first customer.

Continue reading “Having a look at Spacebase”