How to work with spatial data in Amazon Redshift

While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.

So, let’s get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).

 CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float) RETURNS float IMMUTABLE AS \$\$   from math import radians, sin, cos, asin, sqrt, pi, atan2 import numpy as np   earth_radius_miles = 3956.0   def haversine(lat1, lon1, lat2, lon2): """Gives the distance between two points on earth. """ lat1, lon1 = radians(lat1), radians(lon1) lat2, lon2 = radians(lat2), radians(lon2) dlat, dlon = (lat2 - lat1, lon2 - lon1) a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2 great_circle_distance = 2 * asin(min(1,sqrt(a))) return earth_radius_miles * great_circle_distance   return haversine(lat1, lon1, lat2, lon2) \$\$ LANGUAGE plpythonu;

Now, let’s use our new UDF to calculate the great-circle distance between a pair of points.

 SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625) -- 2293.1324218790523

One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):

 SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2

Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.

How to randomly sample k lines from a file in *nix

You can use the shell to extract a random sample of lines from a file in *nix. The two commands you need are “shuf” and “head” (+ “tail” for CSV files with a header). The shuf command will randomly shuffle all the lines of its input. The head command will cut of the input after the first k lines. Examples for both general files and CSV files are given below.

General pattern

To randomly sample 100 lines from any file in *nix:

 shuf INPUT_FILE | head -n 100 > SAMPLE_FILE

Pattern for CSV

If you file is a CSV file, you probably want to extract the header and only sample the body. You can use the head and tail commands, respectively, to extract the header and sample the contents of the CSV file.

Extract the header of the CSV file:

Sample 100 lines from the body of the CSV file and append to sample file (notice “>” above versus “>>” below):

 tail +2 INPUT_FILE.csv | shuf | head -100 >> SAMPLE_FILE.csv

Install dependencies on Mac

On Mac, the shuf command is not shipped with the OS. You can get it via brew. It will be named “gshuf”:

 brew install coreutils

So, on Mac you should replace shuf with gshuf in the example above.

Apache Zeppelin (incubator) rocks!

At Spark Summit Europe 2015, several presenters made use of Apache Zeppeling, which is a notebook (a la IPython) for Spark.

I immediately wanted to try it out myself. I also highly recommend you to download and try it out if you like Spark. But one note: download Zeppelin from GitHub rather than from the apache homepage. The GitHub one is significantly more up to date (today). You do not need to preinstall Spark (but you can if you want), because Zeppelin comes with a stand-alone installation of Spark.

PyBrain quickstart and beyond

After pip install bybrain, the PyBrain the quick start essentially goes as follows:

 from pybrain.tools.shortcuts import buildNetwork from pybrain.structure import TanhLayer from pybrain.datasets import SupervisedDataSet from pybrain.supervised.trainers import BackpropTrainer   # Create a neural network with two inputs, three hidden, and one output net = buildNetwork(2, 3, 1, bias=True, hiddenclass=TanhLayer)   # Create a dataset that matches NN input/output sizes: xor = SupervisedDataSet(2, 1)   # Add input and target values to dataset # Values correspond to XOR truth table xor.addSample((0, 0), (0,)) xor.addSample((0, 1), (1,)) xor.addSample((1, 0), (1,)) xor.addSample((1, 1), (0,))   trainer = BackpropTrainer(net, xor) #trainer.trainUntilConvergence() for epoch in range(1000): trainer.train()

However, it does not work, which can be seen by running the following test?

 testdata = xor trainer.testOnData(testdata, verbose = True) # Works if you are lucky!

Kristina Striegnitz code has written and published an XOR example that works more reliably. The code is effectively reproduced below, in case the original should disappear:

 # ... continued from above   # Create a recurrent neural network with four hidden nodes (default is SigmoidLayer) net = buildNetwork(2, 4, 1, recurrent = True)   # Train the network using arguments for learningrate and momentum trainer = BackpropTrainer(net, xor, learningrate = 0.01, momentum = 0.99, verbose = True) for epoch in range(1000): trainer.train()   # This should work every time... trainer.testOnData(testdata, verbose = True)

Protected: Temporary post

Fing logger (finglogger.sh):

 #!/bin/sh   FING_LOG_FILE=/path/to/fing.log   # append current public ip echo `date +"%Y/%m/%d %T"`";publicip;"`curl -s ipecho.net/plain`";;;" >> \$FING_LOG_FILE   # append current fing output /usr/bin/fing -r1 -o log,csv,\$FING_LOG_FILE,1000 --silent

Add to cron (run every hour):

 0 * * * * /path/to/finglogger.sh

How to merge two disjoint random samples?

The problem: Given two random samples, s1 and s2, of size k over two disjoint populations, p1 and p2, how to combine the two k-sized random samples into one k-sized random sample over p1 ∪ p2?

The solution: k times, draw an element s1 ∪ s2; with probability d1 = |p1| / |p1 ∪ p2|, draw the next element from p1; with probability d2 = 1 – d1 draw the next element from p2.

(the solution was on stackoverflow)

In python:

 import random import numpy   # sizes e1 = 1000 e2 = 1000000   # populations p1 = xrange(e1) p2 = xrange(e1, e2)   # sample size k = 500   # random samples s1 = random.sample(p1, k) s2 = random.sample(p2, k)   # merge samples merge = [] for i in range(k): if s1 and s2: merge.append(s1.pop() if random.random < len(p1) / float(len(p1)+len(p2)) else s2.pop()) elif s1: merge.append(s1.pop()) else: merge.append(s2.pop())   # Validate hist = numpy.histogram(merge, bins=[0,500000,1000000]) # The two bins should be roughly equal, i.e. the error should be small. print abs(hist[0][0] - hist[0][1]) / float(k)   # alternatively, use filter to count values below 500K print abs(len(filter(lambda x: x<500000, merge)) - 250) / 500.0

How to compute Fibonacci sequence in SQL

Inspired and simplified from a set of slides on using RDBMS for storing, managing, and querying graphs:

 WITH recursive fib(i,j) AS ( SELECT 0,1 UNION ALL SELECT j, i+j FROM fib WHERE j<1000 ) SELECT i FROM fib

Immerse yourself in the social graph of your email

https://immersion.media.mit.edu/