Easy parallel HTTP requests with Python and asyncio

Python 3.x, and in particular Python 3.5, has brought asynchronous programming to Python. While async can make code harder to read, there is a good and easy to understand use case for async that I use often: execute a batch of HTTP requests in parallel.

Before we look at asynchronous requests, let us look at the sequential case. This will give us something to compare with later. The code listing below is an example of how to make twenty synchronous HTTP requests:

# Example 1: synchronous requests
import requests

num_requests = 20

responses = [
    requests.get('http://example.org/')
    for i in range(num_requests)
]

How does the total completion time develop as a function of num_requests? The chart below shows my measurements. The curve is unsurprisingly linear:

Using synchronous requests, I was able to execute just five requests per second in my experiment.

Next, let us see how we can make asynchronous HTTP requests with the help of asyncio. The code listing below is an example of how to make twenty asynchronous HTTP requests:

# Example 2: asynchronous requests
import asyncio
import requests

async def main():
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            requests.get, 
            'http://example.org/'
        )
        for i in range(20)
    ]
    for response in await asyncio.gather(*futures):
        pass

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

The code is now more convoluted. But is it better? In the chart below, you will see the total completion time in seconds as a function of the number of asynchronous requests made.

The stepwise curve indicates that some requests are being executed in parallel. However, the curve is still asymptotically linear. Let’s find out why.

Increasing the number of threads

Notice that there is a step pattern in the chart? The completion time is 1x for 1-5 requests, 2x for 6-10 requests, 3x for 11-15 requests, and so on. The reason that we see this step pattern is that the default Executor has an internal pool of five threads that execute work. While five requests can be executed in parallel, any remaining requests will have to wait in line for a thread to become available.

So here is an idea. To minimize the total completion time, we could increase the size of the thread pool to match the number of requests we have to make. Luckily, this is easy to do as we will see next. The code listing below is an example of how to make twenty asynchronous HTTP requests with a thread pool of size twenty:

# Example 3: asynchronous requests with larger thread pool
import asyncio
import concurrent.futures
import requests

async def main():

    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:

        loop = asyncio.get_event_loop()
        futures = [
            loop.run_in_executor(
                executor, 
                requests.get, 
                'http://example.org/'
            )
            for i in range(20)
        ]
        for response in await asyncio.gather(*futures):
            pass


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Let us rerun the experiment from before.

Bingo. If the executor has a pool of 20 threads then 20 requests take roughly the same amount of time as 1 request.

The next obvious question to ask is: what is the limit to this approach? How does the total completion time increase for a more substantial amount of asynchronous requests and equal amount of threads in the executor pool, say 100 or 1000? Let’s find out.

We are back to linear – the flat kind. The chart shows that we can easily make 500+ asynchronous HTTP requests in the same time it takes to make 10 synchronous requests – or 300 requests per second. Surely a vast improvement. You might have noticed that the chart only goes to 720 requests, not 1000. The reason is that “something bad” happened when I reached 720 requests/threads and my timing program became very sad and stopped working. On a different computer than mine, this limit will likely be found somewhere else.

Conclusion

First, you have seen that with asynchronous requests you can execute more requests per second than with synchronous requests: 300 requests per second compared to just 5 requests per second (in my experiments).

Second, you have seen how to achieve this well-known benefit of async I/O in Python 3.5 using just slightly convoluted code.

I would like to apologize to the people who host example.org for hammering their website during my experiment. I like to think that it was for a good cause.

(Tentative)

Symbiosen mellem mennesker og AI vil kunne transformere mennesket til en rationel organisme (jvf. Daniel Kahneman som har påvist at mennesket for sig selv ikke er en rationel organisme). Hvordan det? Vores minutiøse adfærd bliver i stigende grad sporet i alle livets væsentlige forhold. Kunstig intelligens bliver bedre og bedre til at skønne om vi er glade, sunde og rige udfra en analyse af alle de spor vi efterlader os overalt. Vi står nu i en situation hvor vi kan – eller snart kan – stille spørgsmål som: hvor glad, sund og rig var person X til tiden t? Hvilke handlinger h1, h2, h3, … havde person X udført (f.eks. på Spotify, rejser, jobskifte, lægebesøg) som ledte op til dette øjeblik? Hvor glad vil X være til tiden t+1, t+10, t+1000 hvis alting fortsætter som nu? Hvilke handlinger skal X udføre for at maksimere sin glæde til tiden t+1000?
Med andre ord, der er komplekse livsområder hvor kompleks AI har et potentiale for maksimere vore long-term utility (f.eks. vores “livsglæde” eller formue om 10 år). Forstil dig at en personlig AI kan
– Finde din næste bolig
– Finde en skole/fritidsaktivitet til dit barn
– Finde investeringsobjekter
– Finde kærlighed
– Finde venner
– Finde dit næste måltid
– o.s.v.

How to select top-k items from each group in SQL

Here is an analytical query that you (and I) will often need to do if you work in e-commerce, marketing or similar domain. It answers the question, within each group of items (e.g. partitioned by territory, age groups or something else) what are the top-k items for some utility function over the items (e.g. the number of units sold)?

Assume we have a table, aggregated_sales(id INT, item TEXT, country TEXT, total_sold INT). Here is a sample:

ID  ITEM              COUNTRY  TOTAL_SOLD
1   Basketball        DK       125
2   Basketball        US       432943
3   Leather football  FO       64773 
4   Leather football  DK       56230
5   Boxing gloves     CU       9812

In SQL, here is how to get the top-3 selling items by country:

SELECT id, item, country, total_sold
FROM (
  SELECT
    *,
    row_number() OVER (PARTITION BY country ORDER BY total_sold DESC) as rn
  FROM aggregated_sales
) t
WHERE rn = 1;

Bonus info: the table sample was generated from a CSV file with the command column -s , -t filename.csv.

How to get structured Wikipedia data via DBPedia

Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.

Infobox

While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.

Example

Let us explore the DBPedia API by way of an example.

I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:

http://dbpedia.org/data/Matteo_Donati.json

In this example, we will fetch and process the JSON data with a small Python script.

# Python
import requests
 
data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
matteo = data['http://dbpedia.org/resource/Matteo_Donati']
 
# matteo is a dictionary with lots of keys
# that correspond to the player's properties.
# Each value is a list of dictionaries itself.
 
height = matteo['http://dbpedia.org/ontology/height'][0]['value']
# 1.88  (float)
birth_year = matteo['http://dbpedia.org/ontology/birthYear'][0]['value']
# '1995'  (string)
hand = matteo['http://dbpedia.org/ontology/plays'][0]['value']
# 'Right-handed (two-handed backhand)'  (string)
singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking'][0]['value']
# 'No. 171'  (string)

The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:

http://dbpedia.org/data/Roger_Federer.json

Please note, that at the time of writing, DBPedia does not support https.

Redundancy and inconsistency

The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:

for key in sorted(matteo): print(key)
"""
http://dbpedia.org/ontology/Person/height
http://dbpedia.org/ontology/abstract
http://dbpedia.org/ontology/birthDate
http://dbpedia.org/ontology/birthPlace
http://dbpedia.org/ontology/birthYear
http://dbpedia.org/ontology/careerPrizeMoney
http://dbpedia.org/ontology/country
http://dbpedia.org/ontology/height
http://dbpedia.org/ontology/plays
http://dbpedia.org/ontology/residence
http://dbpedia.org/ontology/thumbnail
http://dbpedia.org/ontology/wikiPageID
http://dbpedia.org/ontology/wikiPageRevisionID
http://dbpedia.org/property/birthDate
http://dbpedia.org/property/birthPlace
http://dbpedia.org/property/caption
http://dbpedia.org/property/careerprizemoney
http://dbpedia.org/property/currentdoublesranking
http://dbpedia.org/property/currentsinglesranking
http://dbpedia.org/property/dateOfBirth
http://dbpedia.org/property/doublesrecord
http://dbpedia.org/property/doublestitles
http://dbpedia.org/property/highestdoublesranking
http://dbpedia.org/property/highestsinglesranking
http://dbpedia.org/property/name
http://dbpedia.org/property/placeOfBirth
http://dbpedia.org/property/plays
http://dbpedia.org/property/residence
http://dbpedia.org/property/shortDescription
http://dbpedia.org/property/singlesrecord
http://dbpedia.org/property/singlestitles
http://dbpedia.org/property/updated
http://dbpedia.org/property/usopenresult
http://dbpedia.org/property/wimbledonresult
http://purl.org/dc/elements/1.1/description
http://purl.org/dc/terms/subject
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#comment
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/ns/prov#wasDerivedFrom
http://xmlns.com/foaf/0.1/depiction
http://xmlns.com/foaf/0.1/givenName
http://xmlns.com/foaf/0.1/isPrimaryTopicOf
http://xmlns.com/foaf/0.1/name
http://xmlns.com/foaf/0.1/surname
"""

You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:

  • http://dbpedia.org/ontology/Person/height
  • http://dbpedia.org/ontology/height

Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.

With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.

Conclusion

DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.

How to compute the pagerank of almost anything

Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in what direction the relationship points, e.g. who lost to who in chess.

Intuitively, you may think of directional relationships as a transferal of some abstract value between two parties. For example, when one chess player loses to another in chess, then the value (i.e. relative skill level) of the winner will increase and the value of the loser decrease. Furthermore, the amount of value that is transfered depends on the starting value of each party. For example, if a master chess player loses to a novice chess player in chess, then the relative skill level of the novice will dramatically increase. Conversely, if a novice chess player loses to a master chess player in chess, then that is to be expected. In this situation, the relative skill level of each player should remain roughly the same as before – the status quo.

Below you’ll see an illustration of a small graph with seven nodes and seven edges. The pagerank of each node is illustrated by shading it, where a darker color denotes a higher rank.

If you study this figure, you should notice that:

  • Nodes 1 through 4 all have low rank, because no other nodes point to them
  • Node 5 has a medium rank, because a low-rank node points to it
  • Node 6 has high rank, because many low-rank nodes point to it
  • Node 7 has the highest rank, because a high-rank node points to it, while it points to nothing

Compute pagerank with Python

The pageranks of the nodes in the example graph (see figure above) was computed in Python with the help of the networkx library, which can be installed with pip: pip install networkx. The code that creates a graph and computes pagerank is listed below:

import networkx as nx
 
# Initialize directed graph
G = nx.DiGraph()
 
# Add edges (implicitely adds nodes)
G.add_edge(1,6)
G.add_edge(2,6)
G.add_edge(3,6)
G.add_edge(4,6)
G.add_edge(5,6)
G.add_edge(4,5)
G.add_edge(6,7)
 
# Compute pagerank (keys are node IDs, values are pageranks)
pr = nx.pagerank(G)
"""
{
  1: 0.06242340798778012, 
  2: 0.06242340798778012, 
  3: 0.06242340798778012, 
  4: 0.06242340798778012, 
  5: 0.08895357136701444, 
  6: 0.32374552689540625, 
  7: 0.33760726978645894
}
"""

Notice that each nodes is represented by an integer ID, with no specific semantics tied to the nodes nor the edges. In other words, the graph could equally well represent relationships between web pages, scientists and chess players (or something else entirely).

If your relationships can be assigned weights, e.g. the strength of a victory in chess or the prominence of a link on a web page, then you can add weights to the edges in the graph. Luckily, weighted edges can be easily added in networkx:

G.add_edge(1, 2, weight=0.5)

Dealing with time

You may ask yourself, should a chess game that took place last year impact a player’s rank as much as a game that was won or lost just last week? In many situations, the most meaningful answer would be no. A good way to represent the passing of time in a relationship graph is to use edge weights that decrease over time by some function. For example, an exponential decay function can be used, such that relationships that were formed a long time ago have exponentially lower weight than recently formed relationships. This can be achieved in Python with the ** operator with a negative exponent:

time_decayed_weight = max(.00001, time_passed) ** -1
G.add_edge(1, 2, weight=time_decayed_weight)

We use the trick max(.00001, time_passed) to ensure that we do not raise zero to the power of a negative number. The unit of time passed depends on the domain, and is not essential to the computation. For example, the unit could be milliseconds, years or millennia.

To be continued…

Quick introduction to RabbitMQ and Celery

I like to code in Python. I also like the concept of asynchronous workers to build loosely coupled applications. Luckily, RabbitMQ and Celery can help me do exactly that.

This post is based on a very nice YouTube video by Max Mautner (the one below).

For easy repeatability, I have transcribed the video in this article, with some minor changes for Python 3.

Install RabbitMQ and Celery

I will assume that you are on a Mac. If not, you will have to think a bit to make the instructions work.

First, let us install RabbitMQ if you don’t already have it.

brew install rabbitmq
 
# add the RabbitMQ tools to your path, however you want to do that.
# The tools are in /usr/local/sbin
echo '/usr/local/sbin' >> ~/.bash_profile

Next, I recommend that you create a virtual environment for this project and install Celery in there.

# create virtual environment
virtualenv -p python3 p3env
 
# activate environment
source p3env/bin/activate
 
# install celery in the environment
pip install celery
 
# Test that it worked
celery -h

Example application

This example is the simplest possible. It assumes the defaults for Celery, such as using a local RabbitMQ as the message broker. The application will distribute 10000 addition tasks to be executed by Celery.

We will need two files. One file that defines the Celery task (tasks.py) and one for the driver program (driver.py). Of course, you can call these files anything you want. Also, the driver program is just a simple way to push tasks to RabbitMQ (the Celery default), which will later be dequeued by the Celery workers.

First, let’s create tasks.py:

from celery import Celery
 
app = Celery()
 
@app.task
def add(x, y):
    return x + y

Next, let’s create driver.py:

from tasks import add
 
for i in range(10000):
    # The delay function was added by Celery, when we decorated the add function
    add.delay(i, 1)

As you can see, the driver program consists of a loop that calls add.delay(i, 1) 10000 times. We did not explicitly define the delay function. It was added automatically when we decorated the add function with the annotation @app.task. This means that the function call will be pushed to the message broker and executed asynchronously on the workers.

Run example

To run the example, first start the local RabbitMQ server in a new Terminal window:

# Start message broker
rabbitmq-server
 
# check that /usr/local/sbin is in path 
# if this does not work

In another Terminal window, start the Celery workers:

# activate the virtual env in the new window
source p3evn/bin/activate
 
# start the workers
celery worker -A tasks -l INFO

Finally, run the driver program to create 10000 tasks:

# activate the virtual env in the new window, if needed
source p3evn/bin/activate
 
# run the driver program
python driver.py

Now, in the Terminal window that is running the workers, you should see lines fly by as tasks are being executed asynchronously.

What to do next?

Celery fits a lot of use cases, from web scraping, API consumption, long-running web application tasks etc. The follow-up video by Max demonstrates a simplified web scraping use case. Like the first video, it is succinct and sufficient for a basic understanding.

How to Become a Web Scraping Pro with Python pt. 1

Scrapy is an excellent Python library for web scraping. For example, you could create an API with data that is populated via web scraping. This article covers some basic scrapy features, such as the shell and selectors.

Install scrapy in virtual environment on your machine:

$ virtualenv venv
$ source venv/bin/activate
$ pip install scrapy

To learn about scrapy, the shell is a good place to start, because it offers an interactive environment where you can try selectors on a concrete web page. Here is how to start the scrapy shell:

$ scrapy shell http://doc.scrapy.org/en/latest/topics/selectors.html

Selectors

Now, try out different selections.

You can select elements on a page with CSS and XPath; these selectors can be stringed together. For example, use css to select a tags and xpath to select the href attribute of those tags:

>>> for link in response.css('a').xpath('@href').extract():
>>>   print link

Documentation

Now you are ready to head over to the documentation to read more about how to become great a using scrapy. Another tip is to follow the scrapinghub blog.

How to export CSV file to database with Python

Pandas and SQLAlchemy offer powerful conversions between CSV files and tables in databases. Here is a small example:

import pandas as pd
from sqlalchemy import create_engine
 
df = pd.read_csv('mydata.csv')
 
engine = create_engine('sqlite:///mydata.db')
df.to_sql('mytable', engine)

Read more:

How to use non-default profile in boto3

Given an AWS credentials file that looks like this:

[default]
aws_access_key_id = DEFAULT
aws_secret_access_key = SECRET1
 
[dev]
aws_access_key_id = DEV
aws_secret_access_key = SECRET2
 
[prod]
aws_access_key_id = PROD
aws_secret_access_key = SECRET3

You can use any profile, say dev, like this in Python:

import boto3.session
 
dev = boto3.session.Session(profile_name='dev')
 
s3 = dev.resource('s3')
for bucket in s3.buckets.all():
    print(bucket.name)
print('')