Category: Programming

  • How to sort numbers with an evolutionary algorithm (CMA-ES)

    Yes, this is clearly nonsense. Sorting is not a hard problem and standard algorithms such as quicksort and mergesort have O(x^2) and O(n log(n)) complexity. But let me scratch this itch of sorting numbers using an evolutionary algorithm, specifically Covariance matrix adaptation evolution strategy (CMA-ES). Technically, we will use what I think is the original library by the inventor of the method, Nikolaus Hansen.

    In python, we will make use of these two libraries:

    import cma  # pip install cma
    import numpy as np  # pip install numpy
    

    Solving without constraints

    CMA-ES like other metaheuristic uses the concept of a fitness function to search for good or optimal solutions to a problem. The algorithm does not need to know the structure of your problem as all knowledge is encapsulated in the fitness function. The algorithm generates candidate solutions and evaluates them with the fitness function. The fitness of solutions is used to generate the next batch of solutions until convergence (fitness = 0).

    While you can define feasibility of solutions by providing constraints, this is not a requirement. Therefore, we first will try to solve the toy problem without constraints.

    Fitness function and initial solution

    For CMA-ES to work, you must provide a fitness function that is used to evaluate solutions. For sorting, we define our fitness function as the euclidean distance between a solution x to the optimal solution xopt, which is a sorted list.

    Clearly, it is nonsense that we must first have the optimal solution in order to define the fitness function. Why search if we already have the answer? But again, this is just a simple example chosen so we can focus on the method, not the application.

    In addition to a fitness function, you must also provide a seed solution, which we will call x0. The algorithm will start from x0 and search for better solutions using that as a starting point. Conceptually, a bunch of "neighbours" are evaluated in each step and the direction of search is determined by computing their fitness. Most metaheuristics will intensify search in promising neighbourhoods and ignore the less promising ones. You can read more about CMA-ES on Wikipedia.

    Below we will setup the problem by defining the fitness function ff and an initial solution x0.

    # Optimal solution (used in fitness function)
    n = 40
    xopt = np.arange(n).astype(float)
    # [0, 1, ..., 38, 39]
    
    # fitness function, the euclidean distance x -> xopt
    ff = lambda x: np.linalg.norm(xopt-x)
    
    # Initial solution, a random permutation of the optimal solution
    x0 = np.random.permutation(xopt)
    # [26, 16, ..., 38, 12]
    
    # initial standard deviation
    sigma0 = 0.5
    

    Now that we have defined the fitness function, we can forget that we ever knew the optimal solution. It is however embedded in the fitness function. For your problem, your would have some meaningful way of defining the fitness of a solution. Keep in mind that a value of 0 means perfect fitness, where larger values mean worse fitness.

    Optimise: using wrapper API

    First, we can optimise using the wrapper functions provided by cma. For some reason, using these wrappers results in slower convergence and I don’t know why. There are several ways to use the wrapper API. Below you see two different ways, which I believe are equivalent:

    # method 1
    es = cma.CMAEvolutionStrategy(x0, sigma0)
    es.optimize(ff)
    xbest = es.result.xbest
    
    # method 2
    xbest, es = cma.fmin2(ff, x0, 0.5)
    
    print(xbest.round(0))
    

    Optimise: using stop-ask-tell

    Next, we will solve the problem without the wrappers. The cma library uses a stop-ask-tell protocol.

    • Stop: returns true if the algorithm has converged
    • Ask: the algorithm returns the current pool of solutions
    • Tell: the user provides a fitness value for each solution

    While the following code is slightly longer, it converges faster for some reason. Again, I don’t know why. It is however an equivalent way to solve the problem and also finds the optimal solution.

    es = cma.CMAEvolutionStrategy(x0, 0.5)
    
    fvals = []  # used for plotting later
    while not es.stop():
        solutions = es.ask()
        fitness = np.array([ff(xopt, x) for x in solutions])
        fvals.append(fitness.min())
        es.tell(solutions, fitness)
    
    xbest = es.result_pretty().xbest.round(precision)
    print(xbest.round(0))
    

    The program finds the solution in less than 1 second on my laptop (Mac Book Pro M2). Is that impressive? Well, it is to some degree. The solution space is essentially any combination of 40 real numbers, since we did not specify any constraints on the values. You can specify constraints in the CMA library, but that is a topic for another day.

    Adding constraints

    You may provide constraints to CMA-ES as a vector-valued function, g_i, which defines a solution x as feasible if and only if g_i(x) ≤ 0 for all i. The following code is based on the example notebook from the pycma website. The structure of the code is almost identical from what we had before. The only difference is that we now combine the old fitness and new constraint function into a new "special" fitness function that is used during optimisation.

    For our problem of sorting numbers, we want to enforce the constraint that any number must be less than or equal to any numbers to the right of it. If you think about it, that is the same as saying the numbers must be sorted. This means that our initial solution, which is just a permutation of sorted numbers, will be infeasible with near 100% probability.

    Modified example that adds a constraint:

    # Create constraint function
    def constraints(x):
        # x_i must be less than or equal to all x's to the right
        return [xi - x[i:].max() for i, xi in enumerate(x)]
    
    # Combine old fitness function and constraints
    # This is used in place of the old fitness function
    ffc = cma.ConstrainedFitnessAL(ff, constraints)
    
    es = cma.CMAEvolutionStrategy(x0, 0.5)
    
    while not es.stop():
        solutions = es.ask()
        fitness = np.array([ffc(x) for x in solutions])
        es.tell(solutions, fitness)
    
    xbest = es.result.xbest
    

    For this particular problem, adding the constraint seemingly does not help the problem converge faster. Maybe it already converges as fast as it can, and the constraint just adds overhead and an initial scramble for a feasible solution?

    Early stopping

    Plotting the fitness value as it evolves over time, it is clear that we could have stopped earlier with a pretty good solution. Maybe a pretty good solution does not make sense for sorting, but it would make sense in many other scenarios, such as financial optimisation, where there is a significant amount of uncertainty.

    Plot fitness over time:

    import matplotlib.pyplot as plt
    
    plt.plot(fvals)
    plt.xlabel('Time')
    plt.ylabel('Fitness')
    plt.title('Fitness over time')
    plt.show()
    

  • How to fill missing dates in Pandas

    Create a pandas dataframe with a date column:

    import pandas as pd
    import datetime
    
    TODAY = datetime.date.today()
    ONE_WEEK = datetime.timedelta(days=7)
    ONE_DAY = datetime.timedelta(days=1)
    
    df = pd.DataFrame({'dt': [TODAY-ONE_WEEK, TODAY-3*ONE_DAY, TODAY], 'x': [42, 45,127]})
    

    The dates have gaps:

               dt    x
    0  2018-11-19   42
    1  2018-11-23   45
    2  2018-11-26  127
    

    Now, fill in the missing dates:

    r = pd.date_range(start=df.dt.min(), end=df.dt.max())
    df.set_index('dt').reindex(r).fillna(0.0).rename_axis('dt').reset_index()
    

    Voila! The dataframe no longer has gaps:

              dt      x
    0 2018-11-19   42.0
    1 2018-11-20    NaN
    2 2018-11-21    NaN
    3 2018-11-22    NaN
    4 2018-11-23   45.0
    5 2018-11-24    NaN
    6 2018-11-25    NaN
    7 2018-11-26  127.0
    
  • Cosine similarity in Python

    Cosine similarity is the normalised dot product between two vectors. I guess it is called “cosine” similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

    Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    
    # vectors
    a = np.array([1,2,3])
    b = np.array([1,1,4])
    
    # manually compute cosine similarity
    dot = np.dot(a, b)
    norma = np.linalg.norm(a)
    normb = np.linalg.norm(b)
    cos = dot / (norma * normb)
    
    # use library, operates on sets of vectors
    aa = a.reshape(1,3)
    ba = b.reshape(1,3)
    cos_lib = cosine_similarity(aa, ba)
    
    print(
        dot,
        norma,
        normb,
        cos,
        cos_lib[0][0]
    )
    

    The values might differ a slight bit on the smaller decimals. On my computer I get:

    • 0.9449111825230682 (manual)
    • 0.9449111825230683 (library)
  • How to display a Choropleth map in Jupyter Notebook

    Here is the code:

    %matplotlib inline
    import geopandas as gpd
    import matplotlib as mpl  # make rcParams available (optional)
    mpl.rcParams['figure.dpi']= 144  # increase dpi (optional)
    
    world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
    world = world[world.name != 'Antarctica']  # remove Antarctica (optional)
    world['gdp_per_person'] = world.gdp_md_est / world.pop_est
    g = world.plot(column='gdp_per_person', cmap='OrRd', scheme='quantiles')
    g.set_facecolor('#A8C5DD')  # make the ocean blue (optional)
    

    Here is what the map looks like:

    Dependencies:

    pip install matplotlib
    pip install geopandas
    pip install pysal  # for scheme option
    
  • (Integer) Linear Programming in Python

    Step one:

    brew install glpk
    pip install pulp
    

    Step two:

    from pulp import * 
    
    prob = LpProblem("test1", LpMinimize) 
    
    # Variables 
    x = LpVariable("x", 0, 4, cat="Integer") 
    y = LpVariable("y", -1, 1, cat="Integer") 
    z = LpVariable("z", 0, cat="Integer") 
    
    # Objective 
    prob += x + 4*y + 9*z 
    
    # Constraints 
    prob += x+y <= 5 
    prob += x+z >= 10 
    prob += -y+z == 7 
    
    GLPK().solve(prob) 
    
    # Solution 
    for v in prob.variables():
        print v.name, "=", v.varValue 
    
    print "objective=", value(prob.objective)  
    

    In the documentation there are further examples, e.g. one to minimise the cost of producing cat food.

  • Python script for geocoding a text file

    Assume that you have a file with some locations as text with one location per line.

    For example, here are some school names in Copenhagen, Denmark, stored in schools.csv:

    Hyltebjerg Skole
    Heibergskolen
    Ellebjerg Skole
    Katrinedals Skole
    Peder Lykke Skolen
    Amager Fælled Skole
    Tingbjerg Heldagsskole
    Øster Farimagsgades Skole
    Sankt Annæ Gymnasiums Grundskole
    Lykkebo Skole
    Randersgades Skole
    Strandvejsskolen
    Sortedamskolen
    Grøndalsvængets Skole
    Sølvgades Skole
    Skolen ved Sundet
    Hanssted Skole
    Holbergskolen
    Den Classenske Legatskole
    Tove Ditlevsens Skole
    Lergravsparkens Skole
    Vigerslev Allés Skole
    Bavnehøj Skole
    Ålholm Skole
    Langelinieskolen
    Guldberg Skole
    Husum Skole
    Nyboder Skole
    Vanløse Skole
    Kirkebjerg Skole
    Christianshavns Skole
    Bellahøj Skole
    Kildevældsskolen
    Korsager Skole
    Nørrebro Park Skole
    Utterslev Skole
    Skolen på Islands Brygge
    Brønshøj Skole
    Kirsebærhavens Skole
    Rødkilde Skole
    Vesterbro Ny Skole
    Blågård Skole
    Sønderbro Skole
    Højdevangens Skole
    Oehlenschlægersgades Skole
    Vibenshus Skole
    Valby Skole
    Rådmandsgades Skole
    Lundehusskolen
    Tagensbo Skole
    

    Here is a script, geocode.py, that will attempt to geocode each location in an input stream. It prints CSV output to stdout with the fields input_line, input_line_no, result_no, place, latitude, longitude:

    from geopy import geocoders
    import sys
    import time
    import pdb
    
    geocoder = geocoders.GoogleV3()
    
    SEPARATOR='|'  # can also use tab. Comma is bad, since the place will most likely contain a comma.
    dummy = ['', ['', '']]
    i = 0
    
    header = ['input_line', 'input_line_no', 'result_no', 'place', 'latitude', 'longitude']
    print(SEPARATOR.join(header))
    
    for line in sys.stdin:
        line = line.strip()
        results = geocoder.geocode(line, exactly_one=False) or [dummy]
        for j, res in enumerate(results):
            place = res[0]
            lat = str(res[1][0])
            lon = str(res[1][1])
            out = SEPARATOR.join([line, str(i), str(j), place, lat, lon])
            print (out)
        time.sleep(0.05)
        i += 1
    

    Here is how you might use the script:

    cat schools.csv | python geocode.py 
    

    Tip: you might want to

  • Easy parallel HTTP requests with Python and asyncio

    Python 3.x, and in particular Python 3.5, natively supports asynchronous programming. While asynchronous code can be harder to read than synchronous code, there are many use cases were the added complexity is worthwhile. One such examples is to execute a batch of HTTP requests in parallel, which I will explore in this post. Additionally, the async-await paradigm used by Python 3.5 makes the code almost as easy to understand as synchronous code.

    Before we look at asynchronous requests, let us look at the sequential case. This will give us something to compare with later. The code listing below is an example of how to make twenty synchronous HTTP requests:

    # Example 1: synchronous requests
    import requests
    
    num_requests = 20
    
    responses = [
        requests.get('http://example.org/')
        for i in range(num_requests)
    ]
    

    How does the total completion time develop as a function of num_requests? The chart below shows my measurements. The curve is unsurprisingly linear:

    Using synchronous requests, I was able to execute just five requests per second in my experiment.

    Next, let us see how we can make asynchronous HTTP requests with the help of asyncio. The code listing below is an example of how to make twenty asynchronous HTTP requests in Python 3.5 or later:

    # Example 2: asynchronous requests
    import asyncio
    import requests
    
    async def main():
        loop = asyncio.get_event_loop()
        futures = [
            loop.run_in_executor(
                None, 
                requests.get, 
                'http://example.org/'
            )
            for i in range(20)
        ]
        for response in await asyncio.gather(*futures):
            pass
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    

    The code is now more convoluted. But is it better? In the chart below, you will see the total completion time in seconds as a function of the number of asynchronous requests made.

    The stepwise curve indicates that some requests are being executed in parallel. However, the curve is still asymptotically linear. Let’s find out why.

    Increasing the number of threads

    Notice that there is a step pattern in the chart? The completion time is 1x for 1-5 requests, 2x for 6-10 requests, 3x for 11-15 requests, and so on. The reason that we see this step pattern is that the default Executor has an internal pool of five threads that execute work. While five requests can be executed in parallel, any remaining requests will have to wait for a thread to become available.

    So here is an idea. To minimize the total completion time, we could increase the size of the thread pool to match the number of requests we have to make. Luckily, this is easy to do as we will see next. The code listing below is an example of how to make twenty asynchronous HTTP requests with a thread pool of twenty worker threads:

    # Example 3: asynchronous requests with larger thread pool
    import asyncio
    import concurrent.futures
    import requests
    
    async def main():
    
        with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
    
            loop = asyncio.get_event_loop()
            futures = [
                loop.run_in_executor(
                    executor, 
                    requests.get, 
                    'http://example.org/'
                )
                for i in range(20)
            ]
            for response in await asyncio.gather(*futures):
                pass
    
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    

    Let us rerun the experiment from before.

    Bingo. If the executor has a pool of 20 threads then 20 requests take roughly the same amount of time as 1 request.

    The next obvious question to ask is: what is the limit to this approach? How does the total completion time increase for a more substantial amount of asynchronous requests and equal amount of threads in the executor pool, say 100 or 1000? Let’s find out.

    We are back to linear – the flat kind. The chart shows that we can easily make 500+ asynchronous HTTP requests in the same time it takes to make 10 synchronous requests – or 300 requests per second. Surely a vast improvement. You might have noticed that the chart only goes to 720 requests, not 1000. The reason is that “something bad” happened when I reached 720 requests/threads and my timing program became very sad and stopped working. On a different computer than mine, this limit will likely be found somewhere else.

    Conclusion

    First, you have seen that with asynchronous requests you can execute more requests per second than with synchronous requests: 300 requests per second compared to just 5 requests per second (in my experiments).

    Second, you have seen how to achieve this well-known benefit of async I/O in Python 3.5 using just slightly convoluted code.

    I would like to apologize to the people who host example.org for hammering their website during my experiment. I like to think that it was for a good cause.

  • How to compute the pagerank of almost anything

    Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in what direction the relationship points, e.g. who lost to who in chess.

    Intuitively, you may think of directional relationships as a transferal of some abstract value between two parties. For example, when one chess player loses to another in chess, then the value (i.e. relative skill level) of the winner will increase and the value of the loser decrease. Furthermore, the amount of value that is transfered depends on the starting value of each party. For example, if a master chess player loses to a novice chess player in chess, then the relative skill level of the novice will dramatically increase. Conversely, if a novice chess player loses to a master chess player in chess, then that is to be expected. In this situation, the relative skill level of each player should remain roughly the same as before – the status quo.

    Below you’ll see an illustration of a small graph with seven nodes and seven edges. The pagerank of each node is illustrated by shading it, where a darker color denotes a higher rank.

    If you study this figure, you should notice that:

    • Nodes 1 through 4 all have low rank, because no other nodes point to them
    • Node 5 has a medium rank, because a low-rank node points to it
    • Node 6 has high rank, because many low-rank nodes point to it
    • Node 7 has the highest rank, because a high-rank node points to it, while it points to nothing

    Compute pagerank with Python

    The pageranks of the nodes in the example graph (see figure above) was computed in Python with the help of the networkx library, which can be installed with pip: pip install networkx. The code that creates a graph and computes pagerank is listed below:

    import networkx as nx
    
    # Initialize directed graph
    G = nx.DiGraph()
    
    # Add edges (implicitely adds nodes)
    G.add_edge(1,6)
    G.add_edge(2,6)
    G.add_edge(3,6)
    G.add_edge(4,6)
    G.add_edge(5,6)
    G.add_edge(4,5)
    G.add_edge(6,7)
    
    # Compute pagerank (keys are node IDs, values are pageranks)
    pr = nx.pagerank(G)
    """
    {
      1: 0.06242340798778012, 
      2: 0.06242340798778012, 
      3: 0.06242340798778012, 
      4: 0.06242340798778012, 
      5: 0.08895357136701444, 
      6: 0.32374552689540625, 
      7: 0.33760726978645894
    }
    """
    

    Notice that each nodes is represented by an integer ID, with no specific semantics tied to the nodes nor the edges. In other words, the graph could equally well represent relationships between web pages, scientists and chess players (or something else entirely).

    If your relationships can be assigned weights, e.g. the strength of a victory in chess or the prominence of a link on a web page, then you can add weights to the edges in the graph. Luckily, weighted edges can be easily added in networkx:

    G.add_edge(1, 2, weight=0.5)
    

    Dealing with time

    You may ask yourself, should a chess game that took place last year impact a player’s rank as much as a game that was won or lost just last week? In many situations, the most meaningful answer would be no. A good way to represent the passing of time in a relationship graph is to use edge weights that decrease over time by some function. For example, an exponential decay function can be used, such that relationships that were formed a long time ago have exponentially lower weight than recently formed relationships. This can be achieved in Python with the ** operator with a negative exponent:

    time_decayed_weight = max(.00001, time_passed) ** -1
    G.add_edge(1, 2, weight=time_decayed_weight)
    

    We use the trick max(.00001, time_passed) to ensure that we do not raise zero to the power of a negative number. The unit of time passed depends on the domain, and is not essential to the computation. For example, the unit could be milliseconds, years or millennia.

    To be continued…

  • Quick introduction to RabbitMQ and Celery

    I like to code in Python. I also like the concept of asynchronous workers to build loosely coupled applications. Luckily, RabbitMQ and Celery can help me do exactly that.

    This post is based on a very nice YouTube video by Max Mautner (the one below).

    For easy repeatability, I have transcribed the video in this article, with some minor changes for Python 3.

    Install RabbitMQ and Celery

    I will assume that you are on a Mac. If not, you will have to think a bit to make the instructions work.

    First, let us install RabbitMQ if you don’t already have it.

    brew install rabbitmq
    
    # add the RabbitMQ tools to your path, however you want to do that.
    # The tools are in /usr/local/sbin
    echo '/usr/local/sbin' >> ~/.bash_profile
    

    Next, I recommend that you create a virtual environment for this project and install Celery in there.

    # create virtual environment
    virtualenv -p python3 p3env
    
    # activate environment
    source p3env/bin/activate
    
    # install celery in the environment
    pip install celery
    
    # Test that it worked
    celery -h
    

    Example application

    This example is the simplest possible. It assumes the defaults for Celery, such as using a local RabbitMQ as the message broker. The application will distribute 10000 addition tasks to be executed by Celery.

    We will need two files. One file that defines the Celery task (tasks.py) and one for the driver program (driver.py). Of course, you can call these files anything you want. Also, the driver program is just a simple way to push tasks to RabbitMQ (the Celery default), which will later be dequeued by the Celery workers.

    First, let’s create tasks.py:

    from celery import Celery
    
    app = Celery()
    
    @app.task
    def add(x, y):
        return x + y
    

    Next, let’s create driver.py:

    from tasks import add
    
    for i in range(10000):
        # The delay function was added by Celery, when we decorated the add function
        add.delay(i, 1)
    

    As you can see, the driver program consists of a loop that calls add.delay(i, 1) 10000 times. We did not explicitly define the delay function. It was added automatically when we decorated the add function with the annotation @app.task. This means that the function call will be pushed to the message broker and executed asynchronously on the workers.

    Run example

    To run the example, first start the local RabbitMQ server in a new Terminal window:

    # Start message broker
    rabbitmq-server
    
    # check that /usr/local/sbin is in path 
    # if this does not work
    

    In another Terminal window, start the Celery workers:

    # activate the virtual env in the new window
    source p3evn/bin/activate
    
    # start the workers
    celery worker -A tasks -l INFO
    

    Finally, run the driver program to create 10000 tasks:

    # activate the virtual env in the new window, if needed
    source p3evn/bin/activate
    
    # run the driver program
    python driver.py
    

    Now, in the Terminal window that is running the workers, you should see lines fly by as tasks are being executed asynchronously.

    What to do next?

    Celery fits a lot of use cases, from web scraping, API consumption, long-running web application tasks etc. The follow-up video by Max demonstrates a simplified web scraping use case. Like the first video, it is succinct and sufficient for a basic understanding.

  • How to export CSV file to database with Python

    Pandas and SQLAlchemy offer powerful conversions between CSV files and tables in databases. Here is a small example:

    import pandas as pd
    from sqlalchemy import create_engine
    
    df = pd.read_csv('mydata.csv')
    
    engine = create_engine('sqlite:///mydata.db')
    df.to_sql('mytable', engine)
    

    Read more: