How to fill missing dates in Pandas

Create a pandas dataframe with a date column:

import pandas as pd
import datetime
 
TODAY = datetime.date.today()
ONE_WEEK = datetime.timedelta(days=7)
ONE_DAY = datetime.timedelta(days=1)
 
df = pd.DataFrame({'dt': [TODAY-ONE_WEEK, TODAY-3*ONE_DAY, TODAY], 'x': [42, 45,127]})

The dates have gaps:

           dt    x
0  2018-11-19   42
1  2018-11-23   45
2  2018-11-26  127

Now, fill in the missing dates:

r = pd.date_range(start=df.dt.min(), end=df.dt.max())
df.set_index('dt').reindex(r).fillna(0.0).rename_axis('dt').reset_index()

Voila! The dataframe no longer has gaps:

          dt      x
0 2018-11-19   42.0
1 2018-11-20    NaN
2 2018-11-21    NaN
3 2018-11-22    NaN
4 2018-11-23   45.0
5 2018-11-24    NaN
6 2018-11-25    NaN
7 2018-11-26  127.0

Cosine similarity in Python

Cosine similarity is the normalised dot product between two vectors. I guess it is called “cosine” similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
 
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
 
# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)
 
print(
    dot,
    norma,
    normb,
    cos,
    cos_lib[0][0]
)

The values might differ a slight bit on the smaller decimals. On my computer I get:

  • 0.9449111825230682 (manual)
  • 0.9449111825230683 (library)

How to display a Choropleth map in Jupyter Notebook

Here is the code:

%matplotlib inline
import geopandas as gpd
import matplotlib as mpl  # make rcParams available (optional)
mpl.rcParams['figure.dpi']= 144  # increase dpi (optional)
 
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world = world[world.name != 'Antarctica']  # remove Antarctica (optional)
world['gdp_per_person'] = world.gdp_md_est / world.pop_est
g = world.plot(column='gdp_per_person', cmap='OrRd', scheme='quantiles')
g.set_facecolor('#A8C5DD')  # make the ocean blue (optional)

Here is what the map looks like:

Dependencies:

pip install matplotlib
pip install geopandas
pip install pysal  # for scheme option

(Integer) Linear Programming in Python

Step one:

brew install glpk
pip install pulp

Step two:

from pulp import * 
 
prob = LpProblem("test1", LpMinimize) 
 
# Variables 
x = LpVariable("x", 0, 4, cat="Integer") 
y = LpVariable("y", -1, 1, cat="Integer") 
z = LpVariable("z", 0, cat="Integer") 
 
# Objective 
prob += x + 4*y + 9*z 
 
# Constraints 
prob += x+y <= 5 
prob += x+z >= 10 
prob += -y+z == 7 
 
GLPK().solve(prob) 
 
# Solution 
for v in prob.variables():
    print v.name, "=", v.varValue 
 
print "objective=", value(prob.objective)

In the documentation there are further examples, e.g. one to minimise the cost of producing cat food.

How to select top-k items from each group in SQL

Here is an analytical query that you (and I) will often need to do if you work in e-commerce, marketing or similar domain. It answers the question, within each group of items (e.g. partitioned by territory, age groups or something else) what are the top-k items for some utility function over the items (e.g. the number of units sold)?

Assume we have a table, aggregated_sales(id INT, item TEXT, country TEXT, total_sold INT). Here is a sample:

ID  ITEM              COUNTRY  TOTAL_SOLD
1   Basketball        DK       125
2   Basketball        US       432943
3   Leather football  FO       64773 
4   Leather football  DK       56230
5   Boxing gloves     CU       9812

In SQL, here is how to get the top-3 selling items by country:

SELECT id, item, country, total_sold
FROM (
  SELECT
    *,
    row_number() OVER (PARTITION BY country ORDER BY total_sold DESC) as rn
  FROM aggregated_sales
) t
WHERE rn = 1;

Bonus info: the table sample was generated from a CSV file with the command column -s , -t filename.csv.