How to work with spatial data in Amazon Redshift

While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.

So, let's get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).

CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float)
    from math import radians, sin, cos, asin, sqrt, pi, atan2
    import numpy as np
    earth_radius_miles = 3956.0
    def haversine(lat1, lon1, lat2, lon2):
        """Gives the distance between two points on earth.
        lat1, lon1 = radians(lat1), radians(lon1)
        lat2, lon2 = radians(lat2), radians(lon2)
        dlat, dlon = (lat2 - lat1, lon2 - lon1)
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        great_circle_distance = 2 * asin(min(1,sqrt(a)))
        return earth_radius_miles * great_circle_distance
    return haversine(lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;

Now, let's use our new UDF to calculate the great-circle distance between a pair of points.

SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625)
-- 2293.1324218790523

One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):

SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2

Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.