Pandas and SQLAlchemy offer powerful conversions between CSV files and tables in databases. Here is a small example:
import pandas as pd
from sqlalchemy import create_engine
df = pd.read_csv('mydata.csv')
engine = create_engine('sqlite:///mydata.db')
While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.
So, let’s get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).
CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float)
RETURNS float IMMUTABLE AS
from math import radians, sin, cos, asin, sqrt, pi, atan2
import numpy as np
earth_radius_miles = 3956.0
def haversine(lat1, lon1, lat2, lon2):
"""Gives the distance between two points on earth.
lat1, lon1 = radians(lat1), radians(lon1)
lat2, lon2 = radians(lat2), radians(lon2)
dlat, dlon = (lat2 - lat1, lon2 - lon1)
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
great_circle_distance = 2 * asin(min(1,sqrt(a)))
return earth_radius_miles * great_circle_distance
return haversine(lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;
Now, let’s use our new UDF to calculate the great-circle distance between a pair of points.
SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625)
One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):
SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2
Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.
Inspired and simplified from a set of slides on using RDBMS for storing, managing, and querying graphs:
WITH recursive fib(i,j) AS (
SELECT j, i+j FROM fib WHERE j<1000
SELECT i FROM fib