How to work with spatial data in Amazon Redshift

While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.

So, let’s get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).

CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float)
    from math import radians, sin, cos, asin, sqrt, pi, atan2
    import numpy as np
    earth_radius_miles = 3956.0
    def haversine(lat1, lon1, lat2, lon2):
        """Gives the distance between two points on earth.
        lat1, lon1 = radians(lat1), radians(lon1)
        lat2, lon2 = radians(lat2), radians(lon2)
        dlat, dlon = (lat2 - lat1, lon2 - lon1)
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        great_circle_distance = 2 * asin(min(1,sqrt(a)))
        return earth_radius_miles * great_circle_distance
    return haversine(lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;

Now, let’s use our new UDF to calculate the great-circle distance between a pair of points.

SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625)
-- 2293.1324218790523

One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):

SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2

Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.

How to randomly sample k lines from a file in *nix

You can use the shell to extract a random sample of lines from a file in *nix. The two commands you need are “shuf” and “head” (+ “tail” for CSV files with a header). The shuf command will randomly shuffle all the lines of its input. The head command will cut of the input after the first k lines. Examples for both general files and CSV files are given below.

General pattern

To randomly sample 100 lines from any file in *nix:

shuf INPUT_FILE | head -n 100 > SAMPLE_FILE

Pattern for CSV

If you file is a CSV file, you probably want to extract the header and only sample the body. You can use the head and tail commands, respectively, to extract the header and sample the contents of the CSV file.

Extract the header of the CSV file:

head -1 INPUT_FILE.csv > SAMPLE_FILE.csv

Sample 100 lines from the body of the CSV file and append to sample file (notice “>” above versus “>>” below):

tail +2 INPUT_FILE.csv | shuf | head -100 >> SAMPLE_FILE.csv

Install dependencies on Mac

On Mac, the shuf command is not shipped with the OS. You can get it via brew. It will be named “gshuf”:

brew install coreutils

So, on Mac you should replace shuf with gshuf in the example above.

Apache Zeppelin (incubator) rocks!

At Spark Summit Europe 2015, several presenters made use of Apache Zeppeling, which is a notebook (a la IPython) for Spark.

I immediately wanted to try it out myself. I also highly recommend you to download and try it out if you like Spark. But one note: download Zeppelin from GitHub rather than from the apache homepage. The GitHub one is significantly more up to date (today). You do not need to preinstall Spark (but you can if you want), because Zeppelin comes with a stand-alone installation of Spark.

PyBrain quickstart and beyond

After pip install bybrain, the PyBrain the quick start essentially goes as follows:

from import buildNetwork
from pybrain.structure import TanhLayer
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
# Create a neural network with two inputs, three hidden, and one output
net = buildNetwork(2, 3, 1, bias=True, hiddenclass=TanhLayer)
# Create a dataset that matches NN input/output sizes:
xor = SupervisedDataSet(2, 1)
# Add input and target values to dataset
# Values correspond to XOR truth table
xor.addSample((0, 0), (0,))
xor.addSample((0, 1), (1,))
xor.addSample((1, 0), (1,))
xor.addSample((1, 1), (0,))
trainer = BackpropTrainer(net, xor)
for epoch in range(1000):

However, it does not work, which can be seen by running the following test?

testdata = xor
trainer.testOnData(testdata, verbose = True)  # Works if you are lucky!

Kristina Striegnitz code has written and published an XOR example that works more reliably. The code is effectively reproduced below, in case the original should disappear:

# ... continued from above
# Create a recurrent neural network with four hidden nodes (default is SigmoidLayer) 
net = buildNetwork(2, 4, 1, recurrent = True)
# Train the network using arguments for learningrate and momentum
trainer = BackpropTrainer(net, xor, learningrate = 0.01, momentum = 0.99, verbose = True)
for epoch in range(1000):
# This should work every time...
trainer.testOnData(testdata, verbose = True)

Log devices on your network

Fing logger (

# append current public ip
echo `date +"%Y/%m/%d %T"`";publicip;"`curl -s`";;;" >> $FING_LOG_FILE
# append current fing output
/usr/bin/fing -r1 -o log,csv,$FING_LOG_FILE,1000 --silent

Add to cron (run every hour):

0 * * * * /path/to/

How to merge two disjoint random samples?

The problem: Given two random samples, s1 and s2, of size k over two disjoint populations, p1 and p2, how to combine the two k-sized random samples into one k-sized random sample over p1 ∪ p2?

The solution: k times, draw an element s1 ∪ s2; with probability d1 = |p1| / |p1 ∪ p2|, draw the next element from p1; with probability d2 = 1 – d1 draw the next element from p2.

(the solution was on stackoverflow)

In python:

import random
import numpy
# sizes
e1 = 1000
e2 = 1000000
# populations
p1 = xrange(e1)
p2 = xrange(e1, e2)
# sample size
k = 500
# random samples
s1 = random.sample(p1, k)
s2 = random.sample(p2, k)
# merge samples
merge = []
for i in range(k):
  if s1 and s2:
    merge.append(s1.pop() if random.random < len(p1) / float(len(p1)+len(p2)) else s2.pop())
  elif s1:
# Validate
hist = numpy.histogram(merge, bins=[0,500000,1000000])
# The two bins should be roughly equal, i.e. the error should be small.
print abs(hist[0][0] - hist[0][1]) / float(k)
# alternatively, use filter to count values below 500K
print abs(len(filter(lambda x: x<500000, merge)) - 250) / 500.0

Which files has a *nix process opened?

To list all files that are opened by a *nix process with a given pid, say 42, use the lsof command:

(sudo) lsof -p 42

Of course, a process may have many files open. To list only files that have a name containing “log”, use the grep command:

(sudo) lsof -p 42 | grep log

This of course assumes you know the process id (pid) of the process. To find the pid of processes with a given name, e.g. httpd, use the ps command together with the grep command:

# notice that this prints the grep process as well
ps aux | grep httpd