Python script for geocoding a text file

Assume that you have a file with some locations as text with one location per line.

For example, here are some school names in Copenhagen, Denmark, stored in schools.csv:

Hyltebjerg Skole
Heibergskolen
Ellebjerg Skole
Katrinedals Skole
Peder Lykke Skolen
Amager Fælled Skole
Tingbjerg Heldagsskole
Øster Farimagsgades Skole
Sankt Annæ Gymnasiums Grundskole
Lykkebo Skole
Randersgades Skole
Strandvejsskolen
Sortedamskolen
Grøndalsvængets Skole
Sølvgades Skole
Skolen ved Sundet
Hanssted Skole
Holbergskolen
Den Classenske Legatskole
Tove Ditlevsens Skole
Lergravsparkens Skole
Vigerslev Allés Skole
Bavnehøj Skole
Ålholm Skole
Langelinieskolen
Guldberg Skole
Husum Skole
Nyboder Skole
Vanløse Skole
Kirkebjerg Skole
Christianshavns Skole
Bellahøj Skole
Kildevældsskolen
Korsager Skole
Nørrebro Park Skole
Utterslev Skole
Skolen på Islands Brygge
Brønshøj Skole
Kirsebærhavens Skole
Rødkilde Skole
Vesterbro Ny Skole
Blågård Skole
Sønderbro Skole
Højdevangens Skole
Oehlenschlægersgades Skole
Vibenshus Skole
Valby Skole
Rådmandsgades Skole
Lundehusskolen
Tagensbo Skole

Here is a script, geocode.py, that will attempt to geocode each location in an input stream. It prints CSV output to stdout with the fields input_line, input_line_no, result_no, place, latitude, longitude:

from geopy import geocoders
import sys
import time
import pdb
 
geocoder = geocoders.GoogleV3()
 
SEPARATOR='|'  # can also use tab. Comma is bad, since the place will most likely contain a comma.
dummy = ['', ['', '']]
i = 0
 
header = ['input_line', 'input_line_no', 'result_no', 'place', 'latitude', 'longitude']
print(SEPARATOR.join(header))
 
for line in sys.stdin:
    line = line.strip()
    results = geocoder.geocode(line, exactly_one=False) or [dummy]
    for j, res in enumerate(results):
        place = res[0]
        lat = str(res[1][0])
        lon = str(res[1][1])
        out = SEPARATOR.join([line, str(i), str(j), place, lat, lon])
        print (out)
    time.sleep(0.05)
    i += 1

Here is how you might use the script:

cat schools.csv | python geocode.py 

Tip: you might want to

How to work with spatial data in Amazon Redshift

While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.

So, let’s get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).

CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float)
RETURNS float IMMUTABLE AS
$$
 
    from math import radians, sin, cos, asin, sqrt, pi, atan2
    import numpy as np
 
    earth_radius_miles = 3956.0
 
    def haversine(lat1, lon1, lat2, lon2):
        """Gives the distance between two points on earth.
        """
        lat1, lon1 = radians(lat1), radians(lon1)
        lat2, lon2 = radians(lat2), radians(lon2)
        dlat, dlon = (lat2 - lat1, lon2 - lon1)
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        great_circle_distance = 2 * asin(min(1,sqrt(a)))
        return earth_radius_miles * great_circle_distance
 
    return haversine(lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;

Now, let’s use our new UDF to calculate the great-circle distance between a pair of points.

SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625)
-- 2293.1324218790523

One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):

SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2

Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.

Geocoding Python function for PostgreSQL

Gratefully making use of what others have provided, i.e. geopy, Google and plpythonu.

Type to hold result of geocoding:

CREATE TYPE geocoding AS (
  place text,
  latitude DOUBLE PRECISION,
  longitude DOUBLE PRECISION
);

Function that does the actual geocoding (to be extended with more vendors. Hint: look at geopy wiki). Takes an (arbitrary) input string to be geocoded:

CREATE OR REPLACE FUNCTION python_geocode
(
  input text,
  vendor text DEFAULT 'google'
) RETURNS SETOF geocoding AS
$$
  import time
  from geopy import geocoders
  # https://code.google.com/p/geopy/wiki/GettingStarted
 
  time.sleep(0.2)
  # TODO: Add other available vendors, e.g. Yahoo.
  if vendor.lower() == 'google':
    geocoder = geocoders.GoogleV3()
  else:
    raise ValueError("Invalid geocoder: %s" % vendor)
  try:
    for res in geocoder.geocode(input, exactly_one=False):
      yield {'place': res[0], 'latitude': res[1][0], 'longitude': res[1][1]}
  except:
    pass
$$ LANGUAGE plpythonu VOLATILE;

Example:

SELECT place, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)
FROM python_geocode('Kostas');

The new Google Maps: Explorers delight!

Being interested in maps in general, and specifically vector data, of course I had to take the new Google Maps for a spin. Vector data aside, what fascinated me most was the option to display photos of the current location in the bottom of the screen. Seeing actual photos is incredibly more useful than being shown 100 identical icons indicating that someone took a photo of something here.

Continue reading “The new Google Maps: Explorers delight!”

List of cloud GIS platforms

This is work in progress…

http://geocommons.com/
http://www.giscloud.com/
http://cartodb.com/
http://www.google.com/drive/start/apps.html#fusiontables
http://tables.googlelabs.com/
http://cloudmade.com/
http://mapbox.com/
http://mapquest.com
http://maps.google.com

Downloading environment data for Denmark

This post is very Denmark-specific. I was annoyed that downloading environmental data from the ministry of the environment was so cumbersome.

Long story made short, here are download URLs for environmental data:

Dataset URL
Kirkebyggelinjer KIRKEBYGGELINJER_SHAPE.zip
Skovbyggelinjer SKOVBYGGELINJER_SHAPE.zip
Åbeskyttelseslinjer AA_BES_LINJER_SHAPE.zip
Søbeskyttelseslinjer SOE_BES_LINJER_SHAPE.zip
EF Fuglebeskyttelsesområder EF_FUGLE_BES_OMR_SHAPE.zip
EF Habitatområder EF_HABITAT_OMR_SHAPE.zip
RAMSAR-områder RAMSAR_OMR_SHAPE.zip
Naturvildtreservat NATUR_VILDT_RESERVAT_SHAPE.zip

There are more. I’ll add them when I have the time. Why there ministry homepage does not provide direct links like this is beyond me.

A presentation on spatial indexing

A friend of mine, who is the CEO of a company that develops an embedded database, asked me to do a presentation on spatial indexing. This was an opportunity for me to brush up on R-trees and similar datastructures.

Download the slides

The presentation introduces R-trees and spatial indexing to a technical audience, who are however not spatial indexing experts.