How to merge two disjoint random samples?

The problem: Given two random samples, s1 and s2, of size k over two disjoint populations, p1 and p2, how to combine the two k-sized random samples into one k-sized random sample over p1 ∪ p2?

The solution: k times, draw an element s1 ∪ s2; with probability d1 = |p1| / |p1 ∪ p2|, draw the next element from p1; with probability d2 = 1 – d1 draw the next element from p2.

(the solution was on stackoverflow)

In python:

import random
import numpy
 
# sizes
e1 = 1000
e2 = 1000000
 
# populations
p1 = xrange(e1)
p2 = xrange(e1, e2)
 
# sample size
k = 500
 
# random samples
s1 = random.sample(p1, k)
s2 = random.sample(p2, k)
 
# merge samples
merge = []
for i in range(k):
  if s1 and s2:
    merge.append(s1.pop() if random.random < len(p1) / float(len(p1)+len(p2)) else s2.pop())
  elif s1:
    merge.append(s1.pop())
  else:
    merge.append(s2.pop())
 
# Validate
hist = numpy.histogram(merge, bins=[0,500000,1000000])
# The two bins should be roughly equal, i.e. the error should be small.
print abs(hist[0][0] - hist[0][1]) / float(k)
 
# alternatively, use filter to count values below 500K
print abs(len(filter(lambda x: x<500000, merge)) - 250) / 500.0

Get Weather using JSON web service and Python

Get the current weather for Copenhagen:

import urllib2
import json
 
# hent vejret for Koebenhavn
url = 'http://api.openweathermap.org/data/2.5/weather?q=Copenhagen,dk'
response = urllib2.urlopen(url)
 
# parse JSON resultatet
data = json.load(response)
print 'Weather in Copenhagen:', data['weather'][0]['description']

Writing a parser in Python

This is my base pattern for writing a parser in Python by using the pyparsing library. It is slightly more complicated than a hello world in pyparsing, but I think it is more useful as a small example of writing a parser for a real grammar.

A base class PNode is used to provide utility functions to classes implementing parse tree nodes, e.g. turning a parse tree into the original string (except all whitespace is replaced by single space). It assumes that tokens in the input where separated by whitespace, and that all whitespace is the same.

For a particular grammar, I use Python classes to represent nodes in the parse tree; these classes get created by calling the setParseAction method on the corresponding BNF element. I like having these classes because it adds a nice structure to the parse tree.

from pyparsing import *
 
class PNode(object):
    """Base class for parser elements"""
    def __init__(self, tokens):
        super(PNode, self).__init__()
        self.tokens = tokens
 
    def __str__(self):
        return u" ".join(map(lambda x: unicode(x), self.tokens))
 
    def __repr__(self):
        return self.__str__()
 
# Target classes
 
class Integer(PNode):
    def __init__(self, tokens):
        super(Integer, self).__init__(tokens)
        self.value = int(tokens[0])
 
class Comma(PNode):
    def __init__(self, tokens):
        super(Comma, self).__init__(tokens)
 
class IntegerList(PNode):
    def __init__(self, tokens):
        super(IntegerList, self).__init__(tokens)
        self.integers = filter(lambda x: type(x) == Integer, tokens)
        #pdb.set_trace()
        #self.foo = 'bar'
 
# BNF
 
comma = Literal(',').setParseAction(Comma)
integer = Word(nums).setParseAction(Integer)
integer_list = (integer + ZeroOrMore(comma + integer)).setParseAction(IntegerList)
 
bnf = integer_list
bnf += StringEnd()
 
# Try parser
 
parsed_list = bnf.parseString('1,2,3')[0]
 
print parsed_list

Geocoding Python function for PostgreSQL

Gratefully making use of what others have provided, i.e. geopy, Google and plpythonu.

Type to hold result of geocoding:

CREATE TYPE geocoding AS (
  place text,
  latitude DOUBLE PRECISION,
  longitude DOUBLE PRECISION
);

Function that does the actual geocoding (to be extended with more vendors. Hint: look at geopy wiki). Takes an (arbitrary) input string to be geocoded:

CREATE OR REPLACE FUNCTION python_geocode
(
  input text,
  vendor text DEFAULT 'google'
) RETURNS SETOF geocoding AS
$$
  import time
  from geopy import geocoders
  # https://code.google.com/p/geopy/wiki/GettingStarted
 
  time.sleep(0.2)
  # TODO: Add other available vendors, e.g. Yahoo.
  if vendor.lower() == 'google':
    geocoder = geocoders.GoogleV3()
  else:
    raise ValueError("Invalid geocoder: %s" % vendor)
  try:
    for res in geocoder.geocode(input, exactly_one=False):
      yield {'place': res[0], 'latitude': res[1][0], 'longitude': res[1][1]}
  except:
    pass
$$ LANGUAGE plpythonu VOLATILE;

Example:

SELECT place, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)
FROM python_geocode('Kostas');

Things related to Docker

Docker is a cool idea and open-source product, that seems to be taking the tech community by storm. Wired will tell you why it is cool in a story titled The Man Who Built a Computer the Size of the Internet.

The short version goes: Docker is a way to deploy and move applications with dependencies between Linux servers, using a container concept. The idea is similar to how applications are installed on a Mac, i.e. “everything in a single package”.

There are a number of supporting and related technologies, which I will now list:

  • Google Borg/Apache Mesos are a related technologies, and perhaps Borg is the original rolemodel for Docker. Borg is apparently being replaced by a new system codenamed Omega (video). According to a Wired Story, it influenced Twitter to develop Mesos (originally developed by researchers at the University of California at Berkeley), now Apache Mesos, to do a similar thing as Borg. It might be fair to say that Docker is an easy version of Borg/Mesos/Omega, for non-geniuses (people generally hired by Google, Twitter etc).
  • CoreOS is a supporting technology, an OS designed for deploying containers such as Docker. As mentioned in Wired, the project is based on Google’s ChromeOS. According to the website of this operating system, CoreOS is “Linux kernel + systemd. That’s about it.”

This is it for now about Docker. Just heard about it a few hours ago in an email from a friend and supervisor.

Hello GNU profiling

The profiling tool in GNU is called gprof. Here is a short, boring example of how to use it.

1) Write hello world in C (hello.c)

#include <stdio.h>
 
int foo() {
  int b = 54324;
  int j;
  for (j=0; j < 1000000; j++) {
    b = b^j;
  }
  return b;
}
 
int main() {
  int a = 321782;
  int i;
  for(i=0; i<1000; i++) {
    a = a ^ foo();
  }
  printf("Hello foo: %d\n", a);
  return 0;
}
 
}

2) Compile with -pg option

gcc -pg hello.c

3) Run the program to generate profiling information

./a.out # this generates gmon.out file

4) Run gprof on the program and read output in less:

gprof a.out gmon.out | less