How to scrape images from the web

I’m interested in object detection and other computer vision tasks. For example, I’m working on a teddy-bear detector with my son.

So, how do you quickly download images for a certain category? You can use this approach that I learned from a course on Udemy.

# pip install icrawler
from icrawler.builtin import GoogleImageCrawler

keywords = ['cat', 'dog']
for keyword in keywords:
    google_crawler = GoogleImageCrawler(
        parser_threads=2,
        downloader_threads=4,
        storage={'root_dir': 'images/{}'.format(keyword)}
    
    )
    google_crawler.crawl(
        keyword=keyword, max_num=10, min_size=(200, 200))

In the above example, the crawler will find images in two categories — cats and dogs, as if you search for ‘cat’ and ‘dog’ on Google images and downloaded what you found.

Let’s walk through the parameters used in the code. First, there is the constructor, which is called with three arguments in the example. The most important parameter is storage, which specifies where the images will be stored. Second, we have the call to the crawl function. Here, the max_num parameter is used to specify that at most 10 images per category should be downloaded. The min_size argument specifies that the images must be at least 200 x 200 pixels.

That’s it. Happy downloading.

How to fill missing dates in Pandas

Create a pandas dataframe with a date column:

import pandas as pd
import datetime
 
TODAY = datetime.date.today()
ONE_WEEK = datetime.timedelta(days=7)
ONE_DAY = datetime.timedelta(days=1)
 
df = pd.DataFrame({'dt': [TODAY-ONE_WEEK, TODAY-3*ONE_DAY, TODAY], 'x': [42, 45,127]})

The dates have gaps:

           dt    x
0  2018-11-19   42
1  2018-11-23   45
2  2018-11-26  127

Now, fill in the missing dates:

r = pd.date_range(start=df.dt.min(), end=df.dt.max())
df.set_index('dt').reindex(r).fillna(0.0).rename_axis('dt').reset_index()

Voila! The dataframe no longer has gaps:

          dt      x
0 2018-11-19   42.0
1 2018-11-20    NaN
2 2018-11-21    NaN
3 2018-11-22    NaN
4 2018-11-23   45.0
5 2018-11-24    NaN
6 2018-11-25    NaN
7 2018-11-26  127.0

Cosine similarity in Python

Cosine similarity is the normalised dot product between two vectors. I guess it is called “cosine” similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
 
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
 
# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)
 
print(
    dot,
    norma,
    normb,
    cos,
    cos_lib[0][0]
)

The values might differ a slight bit on the smaller decimals. On my computer I get:

  • 0.9449111825230682 (manual)
  • 0.9449111825230683 (library)

Terms used in shipping

Now that I work in shipping, it is necessary to learn a bunch of new terms. Shipping is regulated under Admiralty Law and there are traditional documents and parties involved. Knowing what these are is crucial to understanding shipping.

Legal documents

There are three key documents involved with shipping:

Parties

There are quite a few parties involved in shipping:

  • Carrier
  • Charterer
  • Consignee
  • Consignor
  • Shipbroker
  • Ship-manager
  • Ship-owner
  • Shipper
  • Stevedore

How to sample from softmax with temperature

Here is how to sample from a softmax probability vector at different temperatures.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
 
mpl.rcParams['figure.dpi']= 144
 
trials = 1000
softmax = [0.1, 0.3, 0.6]
 
def sample(softmax, temperature):
    EPSILON = 10e-16 # to avoid taking the log of zero
    #print(preds)
    (np.array(softmax) + EPSILON).astype('float64')
    preds = np.log(softmax) / temperature
    #print(preds)
    exp_preds = np.exp(preds)
    #print(exp_preds)
    preds = exp_preds / np.sum(exp_preds)
    #print(preds)
    probas = np.random.multinomial(1, preds, 1)
    return probas[0]
 
temperatures = [(t or 1) / 100 for t in range(0, 101, 10)]
probas = [
    np.asarray([sample(softmax, t) for _ in range(trials)]).sum(axis=0) / trials
    for t in temperatures
]
 
sns.set_style("darkgrid")
plt.plot(temperatures, probas)
plt.show()

Notice how the probabilities change at different temperatures. The softmax probabilities are [0.1, 0.3, 0.6]. At the lowest temperatures of 0.01, the dominant index (value 0.6) has near 100% probability of being sampled. At higher temperatures, the selection probabilities move towards the softmax values, e.g. 60% probability for the third index.

How to display a Choropleth map in Jupyter Notebook

Here is the code:

%matplotlib inline
import geopandas as gpd
import matplotlib as mpl  # make rcParams available (optional)
mpl.rcParams['figure.dpi']= 144  # increase dpi (optional)
 
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world = world[world.name != 'Antarctica']  # remove Antarctica (optional)
world['gdp_per_person'] = world.gdp_md_est / world.pop_est
g = world.plot(column='gdp_per_person', cmap='OrRd', scheme='quantiles')
g.set_facecolor('#A8C5DD')  # make the ocean blue (optional)

Here is what the map looks like:

Dependencies:

pip install matplotlib
pip install geopandas
pip install pysal  # for scheme option

(Integer) Linear Programming in Python

Step one:

brew install glpk
pip install pulp

Step two:

from pulp import * 
 
prob = LpProblem("test1", LpMinimize) 
 
# Variables 
x = LpVariable("x", 0, 4, cat="Integer") 
y = LpVariable("y", -1, 1, cat="Integer") 
z = LpVariable("z", 0, cat="Integer") 
 
# Objective 
prob += x + 4*y + 9*z 
 
# Constraints 
prob += x+y <= 5 
prob += x+z >= 10 
prob += -y+z == 7 
 
GLPK().solve(prob) 
 
# Solution 
for v in prob.variables():
    print v.name, "=", v.varValue 
 
print "objective=", value(prob.objective)

In the documentation there are further examples, e.g. one to minimise the cost of producing cat food.

Things that are visible from space, the Garzweiler Surface Mine

I was looking at arial photos of north-western Europe in Google Maps when I noticed a big white dot on the map!

I thought, what the hell? To satisfy my curiosity I decided to zoom in for further investigation.

It turns out that the big white dot is a giant surface mine. The 48 km² mine is operated by RWE and used for mining lignite, also known as brown coal.

Fun fact: 50% of Greece’s power supply and 27% of Germany’s comes from burning lignite. Lignite also has innovative uses in farming and drilling.

Isn’t the geometric juxtaposition of farmland, urban area and surface mine quite enchanting? To get a sense of the scale, take a look at the size of cars next to the big heavy machine; then try to find the big heavy machine on the zoomed out image.

Here is a video that displays the grotesque beauty of the place…

Create a European city map with population density

Datasets:

– Urban morphological zones 2000 (EU): https://www.eea.europa.eu/data-and-maps/data/urban-morphological-zones-2000-2
– Population count (World): http://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-count-rev10/
– Administrative regions (World): http://gadm.org/

The map is European since the “urban” data from the European Environmental Agency (EEA) only covers Europe.

Caveats

The UMZ data ended up in PostGIS with srid 900914. You can use prj2epsg.org to convert the contents of a .prj file to an estimated SRID code. In this case the UMZ .prj file as the contents:

PROJCS["ETRS89_LAEA_Europe",GEOGCS["GCS_ETRS_1989",DATUM["D_ETRS_1989",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Lambert_Azimuthal_Equal_Area"],PARAMETER["latitude_of_origin",52],PARAMETER["central_meridian",10],PARAMETER["false_easting",4321000],PARAMETER["false_northing",3210000],UNIT["Meter",1]]

Which translates to 3035 - ETRS89_LAEA_Europe.