Scrapy is an excellent Python library for web scraping. For example, you could create an API with data that is populated via web scraping. This article covers some basic scrapy features, such as the shell and selectors.
Install scrapy in virtual environment on your machine:
$ virtualenv venv
$ source venv/bin/activate
$ pip install scrapy
To learn about scrapy, the shell is a good place to start, because it offers an interactive environment where you can try selectors on a concrete web page. Here is how to start the scrapy shell:
$ scrapy shell http://doc.scrapy.org/en/latest/topics/selectors.html
Now, try out different selections.
You can select elements on a page with CSS and XPath; these selectors can be stringed together. For example, use css to select
a tags and xpath to select the
href attribute of those tags:
>>> for link in response.css('a').xpath('@href').extract():
>>> print link
Now you are ready to head over to the documentation to read more about how to become great a using scrapy. Another tip is to follow the scrapinghub blog.
Pandas and SQLAlchemy offer powerful conversions between CSV files and tables in databases. Here is a small example:
import pandas as pd
from sqlalchemy import create_engine
df = pd.read_csv('mydata.csv')
engine = create_engine('sqlite:///mydata.db')
Given an AWS credentials file that looks like this:
aws_access_key_id = DEFAULT
aws_secret_access_key = SECRET1
aws_access_key_id = DEV
aws_secret_access_key = SECRET2
aws_access_key_id = PROD
aws_secret_access_key = SECRET3
You can use any profile, say dev, like this in Python:
dev = boto3.session.Session(profile_name='dev')
s3 = dev.resource('s3')
for bucket in s3.buckets.all():
You may ask yourself, if I’m a machine learning person then what kind am I? See for yourself in Jason Eisner’s Three Cultures of Machine Learning.
While Redshift does not offer native support for spatial data, indexes and functions, there exists a partial workaround. Redshift supports Python UDFs and can also load custom Python libraries. Out of the box, Redshift has numpy, scipy, pandas and many other useful Python libraries. For spatial functionality, one saving grace is the high quality spatial libraries that exist for Python, such as shapely. Of course, the alternative is to simply implement useful spatial functions in Python directly, which we will do here. The drawback is that this does not provide the means for spatial indexes or native spatial types in Redshift. As long as you are working mainly with point data, this should not be a huge obstacle. While polygons and operations on them are useful in many cases, a properly utilized GeoHash can usually do the trick.
So, let’s get into it! Connect to your Redshift cluster using a client of your choosing. I prefer SQLWorkbench/J. Properly connected, attempt to create the following UDF in Python, which implements the haversine formula using NumPy (thanks to jterrace for the solution).
CREATE OR REPLACE FUNCTION haversine (lat1 float, lon1 float, lat2 float, lon2 float)
RETURNS float IMMUTABLE AS
from math import radians, sin, cos, asin, sqrt, pi, atan2
import numpy as np
earth_radius_miles = 3956.0
def haversine(lat1, lon1, lat2, lon2):
"""Gives the distance between two points on earth.
lat1, lon1 = radians(lat1), radians(lon1)
lat2, lon2 = radians(lat2), radians(lon2)
dlat, dlon = (lat2 - lat1, lon2 - lon1)
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
great_circle_distance = 2 * asin(min(1,sqrt(a)))
return earth_radius_miles * great_circle_distance
return haversine(lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;
Now, let’s use our new UDF to calculate the great-circle distance between a pair of points.
SELECT haversine(37.160316546736745, -78.75, 39.095962936305476, -121.2890625)
One very big drawback is that it is incredibly slow (an understatement). The following query computes the function just 100 times, which on my cluster took over 17.21 seconds (jeez!):
SELECT COUNT(haversine(37.160316546736745, -78.75, 39.095962936305476, lon2 % 360 - 180)) FROM generate_series(1, 100) lon2
Because the speed is so slow, I will investigate another way to achieve this goal with Redshift. Expect updates to this post.
You can use the shell to extract a random sample of lines from a file in *nix. The two commands you need are “shuf” and “head” (+ “tail” for CSV files with a header). The shuf command will randomly shuffle all the lines of its input. The head command will cut of the input after the first k lines. Examples for both general files and CSV files are given below.
To randomly sample 100 lines from any file in *nix:
shuf INPUT_FILE | head -n 100 > SAMPLE_FILE
Pattern for CSV
If you file is a CSV file, you probably want to extract the header and only sample the body. You can use the head and tail commands, respectively, to extract the header and sample the contents of the CSV file.
Extract the header of the CSV file:
head -1 INPUT_FILE.csv > SAMPLE_FILE.csv
Sample 100 lines from the body of the CSV file and append to sample file (notice “>” above versus “>>” below):
tail +2 INPUT_FILE.csv | shuf | head -100 >> SAMPLE_FILE.csv
Install dependencies on Mac
On Mac, the shuf command is not shipped with the OS. You can get it via brew. It will be named “gshuf”:
So, on Mac you should replace shuf with gshuf in the example above.
At Spark Summit Europe 2015, several presenters made use of Apache Zeppeling, which is a notebook (a la IPython) for Spark.
I immediately wanted to try it out myself. I also highly recommend you to download and try it out if you like Spark. But one note: download Zeppelin from GitHub rather than from the apache homepage. The GitHub one is significantly more up to date (today). You do not need to preinstall Spark (but you can if you want), because Zeppelin comes with a stand-alone installation of Spark.
pip install bybrain, the PyBrain the quick start essentially goes as follows:
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure import TanhLayer
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
# Create a neural network with two inputs, three hidden, and one output
net = buildNetwork(2, 3, 1, bias=True, hiddenclass=TanhLayer)
# Create a dataset that matches NN input/output sizes:
xor = SupervisedDataSet(2, 1)
# Add input and target values to dataset
# Values correspond to XOR truth table
xor.addSample((0, 0), (0,))
xor.addSample((0, 1), (1,))
xor.addSample((1, 0), (1,))
xor.addSample((1, 1), (0,))
trainer = BackpropTrainer(net, xor)
for epoch in range(1000):
However, it does not work, which can be seen by running the following test?
testdata = xor
trainer.testOnData(testdata, verbose = True) # Works if you are lucky!
Kristina Striegnitz code has written and published an XOR example that works more reliably. The code is effectively reproduced below, in case the original should disappear:
# ... continued from above
# Create a recurrent neural network with four hidden nodes (default is SigmoidLayer)
net = buildNetwork(2, 4, 1, recurrent = True)
# Train the network using arguments for learningrate and momentum
trainer = BackpropTrainer(net, xor, learningrate = 0.01, momentum = 0.99, verbose = True)
for epoch in range(1000):
# This should work every time...
trainer.testOnData(testdata, verbose = True)
Fing logger (finglogger.sh):
# append current public ip
echo `date +"%Y/%m/%d %T"`";publicip;"`curl -s ipecho.net/plain`";;;" >> $FING_LOG_FILE
# append current fing output
/usr/bin/fing -r1 -o log,csv,$FING_LOG_FILE,1000 --silent
Add to cron (run every hour):
0 * * * * /path/to/finglogger.sh