The problem: Given two random samples, s1 and s2, of size k over two disjoint populations, p1 and p2, how to combine the two k-sized random samples into one k-sized random sample over p1 ∪ p2?
The solution: k times, draw an element s1 ∪ s2; with probability d1 = |p1| / |p1 ∪ p2|, draw the next element from p1; with probability d2 = 1 – d1 draw the next element from p2.
(the solution was on stackoverflow)
In python:
import random
import numpy
# sizes
e1 = 1000
e2 = 1000000
# populations
p1 = xrange(e1)
p2 = xrange(e1, e2)
# sample size
k = 500
# random samples
s1 = random.sample(p1, k)
s2 = random.sample(p2, k)
# merge samples
merge = []
for i in range(k):
if s1 and s2:
merge.append(s1.pop() if random.random < len(p1) / float(len(p1)+len(p2)) else s2.pop())
elif s1:
merge.append(s1.pop())
else:
merge.append(s2.pop())
# Validate
hist = numpy.histogram(merge, bins=[0,500000,1000000])
# The two bins should be roughly equal, i.e. the error should be small.
print abs(hist[0][0] - hist[0][1]) / float(k)
# alternatively, use filter to count values below 500K
print abs(len(filter(lambda x: x<500000, merge)) - 250) / 500.0