Cosine similarity in Python

Cosine similarity is the normalised dot product between two vectors. I guess it is called "cosine" similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
# manually compute cosine similarity
dot =, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)

The values might differ a slight bit on the smaller decimals. On my computer I get:

  • 0.9449111825230682 (manual)
  • 0.9449111825230683 (library)

Terms used in shipping

Now that I work in shipping, it is necessary to learn a bunch of new terms. Shipping is regulated under Admiralty Law and there are traditional documents and parties involved. Knowing what these are is crucial to understanding shipping.

Legal documents

There are three key documents involved with shipping:


There are quite a few parties involved in shipping:

  • Carrier
  • Charterer
  • Consignee
  • Consignor
  • Shipbroker
  • Ship-manager
  • Ship-owner
  • Shipper
  • Stevedore

How to sample from softmax with temperature

Here is how to sample from a softmax probability vector at different temperatures.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
mpl.rcParams['figure.dpi']= 144
trials = 1000
softmax = [0.1, 0.3, 0.6]
def sample(softmax, temperature):
    EPSILON = 10e-16 # to avoid taking the log of zero
    (np.array(softmax) + EPSILON).astype('float64')
    preds = np.log(softmax) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return probas[0]
temperatures = [(t or 1) / 100 for t in range(0, 101, 10)]
probas = [
    np.asarray([sample(softmax, t) for _ in range(trials)]).sum(axis=0) / trials
    for t in temperatures
plt.plot(temperatures, probas)

Notice how the probabilities change at different temperatures. The softmax probabilities are [0.1, 0.3, 0.6]. At the lowest temperatures of 0.01, the dominant index (value 0.6) has near 100% probability of being sampled. At higher temperatures, the selection probabilities move towards the softmax values, e.g. 60% probability for the third index.

How to display a Choropleth map in Jupyter Notebook

Here is the code:

%matplotlib inline
import geopandas as gpd
import matplotlib as mpl  # make rcParams available (optional)
mpl.rcParams['figure.dpi']= 144  # increase dpi (optional)
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world = world[ != 'Antarctica']  # remove Antarctica (optional)
world['gdp_per_person'] = world.gdp_md_est / world.pop_est
g = world.plot(column='gdp_per_person', cmap='OrRd', scheme='quantiles')
g.set_facecolor('#A8C5DD')  # make the ocean blue (optional)

Here is what the map looks like:


pip install matplotlib
pip install geopandas
pip install pysal  # for scheme option

(Integer) Linear Programming in Python

Step one:

brew install glpk
pip install pulp

Step two:

from pulp import * 
prob = LpProblem("test1", LpMinimize) 
# Variables 
x = LpVariable("x", 0, 4, cat="Integer") 
y = LpVariable("y", -1, 1, cat="Integer") 
z = LpVariable("z", 0, cat="Integer") 
# Objective 
prob += x + 4*y + 9*z 
# Constraints 
prob += x+y <= 5 
prob += x+z >= 10 
prob += -y+z == 7 
# Solution 
for v in prob.variables():
    print, "=", v.varValue 
print "objective=", value(prob.objective)

In the documentation there are further examples, e.g. one to minimise the cost of producing cat food.

Things that are visible from space, the Garzweiler Surface Mine

I was looking at arial photos of north-western Europe in Google Maps when I noticed a big white dot on the map!

I thought, what the hell? To satisfy my curiosity I decided to zoom in for further investigation.

It turns out that the big white dot is a giant surface mine. The 48 km² mine is operated by RWE and used for mining lignite, also known as brown coal.

Fun fact: 50% of Greece's power supply and 27% of Germany's comes from burning lignite. Lignite also has innovative uses in farming and drilling.

Isn't the geometric juxtaposition of farmland, urban area and surface mine quite enchanting? To get a sense of the scale, take a look at the size of cars next to the big heavy machine; then try to find the big heavy machine on the zoomed out image.

Here is a video that displays the grotesque beauty of the place...

Create a European city map with population density


- Urban morphological zones 2000 (EU):
- Population count (World):
- Administrative regions (World):

The map is European since the "urban" data from the European Environmental Agency (EEA) only covers Europe.


The UMZ data ended up in PostGIS with srid 900914. You can use to convert the contents of a .prj file to an estimated SRID code. In this case the UMZ .prj file as the contents:


Which translates to 3035 - ETRS89_LAEA_Europe.

How to create a world-wide PostgreSQL database of administrative regions

The GADM database contains geographical data for administrative regions, e.g. countries, regions and municipalities. As always, once you have the data in the right format, it is easy to import it into a database. The data is available from GADM in several formats. All data has the coordinate reference system in longitude/latitude and theWGS84 datum.


  1. Download data for the whole world or by country. For a change, I will use the GeoPackage format.
  2. Create a PostgreSQL database (assumed to exist)
  3. Import the data with ogr2ogr (see instructions below)

Import data instructions

Download data (example for Denmark):


Next, create a database called "gadm" on my local PostgreSQL server; of course you can use another name if you prefer. Install the PostGIS extension:

create extension postgis

Finally, use ogr2ogr with the GPKG (GeoPackage) driver to import the data:

ogr2ogr -f PostgreSQL "PG:dbname=gadm" DNK_adm.gpkg

Now the data is imported an ready to be queried.

As a test, we can query the adm2 table (municipalities) with a coordinate inside the municipality of Copenhagen, Denmark.

SELECT name_2, ST_AsText(wkb_geometry)
FROM dnk_adm2
WHERE ST_Intersects(ST_SetSRID(ST_Point(12.563585, 55.690628), 4326), wkb_geometry)
-- AND ST_Point(12.563585, 55.690628) && wkb_geometry

You can view the selected well-known string geometry (WKT) in an online viewer, such as openstreetmap-wkt-playground. Other viewers are listed on stackexchange.

Alternative sources

For this post I really wanted a dataset of populated/urban areas. However, the GADM data I downloaded only contains adm0-adm2, which is a tessellation of the land area, i.e. cannot be used to discriminate between urban and rural areas.

Other data sources are listed below:


From the rtwilson list, here are some specific datasets that indicate population density and urbanism:

- (does not cover Europe and North America)

How to assess computers on your local area network

I teach children how to programm and do other things with technology in an organisation called Coding Pirates in Denmark, which aims to be a kind of scout movement for geeks. A best seller among the kids is learning how to hack and I see this as a unique opportunity to convey some basic human values in relation to something that can be potentially harmful.

Yesterday, I and one of the kids played with nmap, the network surveying tool, to investigate our local area network. The aim was to find information about the computers that were attached, such as operating system, system owner's first name (often part of the computer name) and whether any computer had open server ports (SSH, web etc.). We used nmap in combination with Wireshark.

  1. Tell another person about a fun website (any website will do)
  2. Use wireshark to detect the IP address (e.g. of any computer that accesses that website
  3. Use nmap to scan the IP address we found: nmap -vS

We also learned how to detect that someone logs into your computer and e.g. kick the person (assume an Ubuntu host):

# Monitor login attempts
tail -f /var/log/auth.log
# See active sessions
# List remote sessions
ps fax | grep 'pts/'
# Kill sessions
kill -9 [pid of bash processes connected to session]

Other tricks

List all hosts (ping scan) on your local area network:

nmap -sP 192.168.1.*

Find computers on your local area network that run an SSH server:

nmap -p 22 --open -sV 192.168.1.*