Category: Algorithms

  • Double Machine Learning for Dummies as explained by ChatGPT

    I asked ChatGPT to:

    explain double machine learning for dummies

    Here is the answer:

    Double machine learning (DML) is a statistical method used to estimate treatment effects in the presence of high-dimensional controls. To put it simply, it’s a way to answer questions about cause and effect using data and machine learning, even when there are lots of potentially relevant variables.

    Here is a more simplified explanation using a practical example:

    Let’s say you are a coach of a football team, and you want to determine the impact of a new training method on the performance of your team. However, there are many factors that could affect performance – things like diet, amount of sleep, weather conditions, player mood, previous experience, and more.

    With traditional methods, you would have to painstakingly adjust for all these factors, or "controls", to determine the real impact of the new training method. This can be time-consuming, difficult, and may not be accurate, especially if there are a lot of controls.

    Here’s where double machine learning comes in. DML uses machine learning twice (hence the "double" part) to solve this problem:

    1. First, it uses machine learning to predict the outcome (football performance) based on all the control variables (diet, sleep, weather, etc.), and then subtracts this prediction from the actual outcome. This gives you a "residual" outcome that can’t be explained by the controls.

    2. Then, it does the same thing for the treatment (the new training method): it predicts the treatment based on the controls, and subtracts this prediction from the actual treatment. This gives you a "residual" treatment that can’t be explained by the controls.

    After that, you simply correlate the "residual" outcome with the "residual" treatment to estimate the effect of the treatment. This gives you an estimate of the impact of the new training method, accounting for all the controls without needing to adjust for them one by one.

    In other words, double machine learning helps you figure out the cause-and-effect relationship between variables when there are many other variables that could potentially influence the result. It does this by using machine learning to "filter out" the effect of the other variables, leaving just the effect you’re interested in.

  • How to sort numbers with an evolutionary algorithm (CMA-ES)

    Yes, this is clearly nonsense. Sorting is not a hard problem and standard algorithms such as quicksort and mergesort have O(x^2) and O(n log(n)) complexity. But let me scratch this itch of sorting numbers using an evolutionary algorithm, specifically Covariance matrix adaptation evolution strategy (CMA-ES). Technically, we will use what I think is the original library by the inventor of the method, Nikolaus Hansen.

    In python, we will make use of these two libraries:

    import cma  # pip install cma
    import numpy as np  # pip install numpy
    

    Solving without constraints

    CMA-ES like other metaheuristic uses the concept of a fitness function to search for good or optimal solutions to a problem. The algorithm does not need to know the structure of your problem as all knowledge is encapsulated in the fitness function. The algorithm generates candidate solutions and evaluates them with the fitness function. The fitness of solutions is used to generate the next batch of solutions until convergence (fitness = 0).

    While you can define feasibility of solutions by providing constraints, this is not a requirement. Therefore, we first will try to solve the toy problem without constraints.

    Fitness function and initial solution

    For CMA-ES to work, you must provide a fitness function that is used to evaluate solutions. For sorting, we define our fitness function as the euclidean distance between a solution x to the optimal solution xopt, which is a sorted list.

    Clearly, it is nonsense that we must first have the optimal solution in order to define the fitness function. Why search if we already have the answer? But again, this is just a simple example chosen so we can focus on the method, not the application.

    In addition to a fitness function, you must also provide a seed solution, which we will call x0. The algorithm will start from x0 and search for better solutions using that as a starting point. Conceptually, a bunch of "neighbours" are evaluated in each step and the direction of search is determined by computing their fitness. Most metaheuristics will intensify search in promising neighbourhoods and ignore the less promising ones. You can read more about CMA-ES on Wikipedia.

    Below we will setup the problem by defining the fitness function ff and an initial solution x0.

    # Optimal solution (used in fitness function)
    n = 40
    xopt = np.arange(n).astype(float)
    # [0, 1, ..., 38, 39]
    
    # fitness function, the euclidean distance x -> xopt
    ff = lambda x: np.linalg.norm(xopt-x)
    
    # Initial solution, a random permutation of the optimal solution
    x0 = np.random.permutation(xopt)
    # [26, 16, ..., 38, 12]
    
    # initial standard deviation
    sigma0 = 0.5
    

    Now that we have defined the fitness function, we can forget that we ever knew the optimal solution. It is however embedded in the fitness function. For your problem, your would have some meaningful way of defining the fitness of a solution. Keep in mind that a value of 0 means perfect fitness, where larger values mean worse fitness.

    Optimise: using wrapper API

    First, we can optimise using the wrapper functions provided by cma. For some reason, using these wrappers results in slower convergence and I don’t know why. There are several ways to use the wrapper API. Below you see two different ways, which I believe are equivalent:

    # method 1
    es = cma.CMAEvolutionStrategy(x0, sigma0)
    es.optimize(ff)
    xbest = es.result.xbest
    
    # method 2
    xbest, es = cma.fmin2(ff, x0, 0.5)
    
    print(xbest.round(0))
    

    Optimise: using stop-ask-tell

    Next, we will solve the problem without the wrappers. The cma library uses a stop-ask-tell protocol.

    • Stop: returns true if the algorithm has converged
    • Ask: the algorithm returns the current pool of solutions
    • Tell: the user provides a fitness value for each solution

    While the following code is slightly longer, it converges faster for some reason. Again, I don’t know why. It is however an equivalent way to solve the problem and also finds the optimal solution.

    es = cma.CMAEvolutionStrategy(x0, 0.5)
    
    fvals = []  # used for plotting later
    while not es.stop():
        solutions = es.ask()
        fitness = np.array([ff(xopt, x) for x in solutions])
        fvals.append(fitness.min())
        es.tell(solutions, fitness)
    
    xbest = es.result_pretty().xbest.round(precision)
    print(xbest.round(0))
    

    The program finds the solution in less than 1 second on my laptop (Mac Book Pro M2). Is that impressive? Well, it is to some degree. The solution space is essentially any combination of 40 real numbers, since we did not specify any constraints on the values. You can specify constraints in the CMA library, but that is a topic for another day.

    Adding constraints

    You may provide constraints to CMA-ES as a vector-valued function, g_i, which defines a solution x as feasible if and only if g_i(x) ≤ 0 for all i. The following code is based on the example notebook from the pycma website. The structure of the code is almost identical from what we had before. The only difference is that we now combine the old fitness and new constraint function into a new "special" fitness function that is used during optimisation.

    For our problem of sorting numbers, we want to enforce the constraint that any number must be less than or equal to any numbers to the right of it. If you think about it, that is the same as saying the numbers must be sorted. This means that our initial solution, which is just a permutation of sorted numbers, will be infeasible with near 100% probability.

    Modified example that adds a constraint:

    # Create constraint function
    def constraints(x):
        # x_i must be less than or equal to all x's to the right
        return [xi - x[i:].max() for i, xi in enumerate(x)]
    
    # Combine old fitness function and constraints
    # This is used in place of the old fitness function
    ffc = cma.ConstrainedFitnessAL(ff, constraints)
    
    es = cma.CMAEvolutionStrategy(x0, 0.5)
    
    while not es.stop():
        solutions = es.ask()
        fitness = np.array([ffc(x) for x in solutions])
        es.tell(solutions, fitness)
    
    xbest = es.result.xbest
    

    For this particular problem, adding the constraint seemingly does not help the problem converge faster. Maybe it already converges as fast as it can, and the constraint just adds overhead and an initial scramble for a feasible solution?

    Early stopping

    Plotting the fitness value as it evolves over time, it is clear that we could have stopped earlier with a pretty good solution. Maybe a pretty good solution does not make sense for sorting, but it would make sense in many other scenarios, such as financial optimisation, where there is a significant amount of uncertainty.

    Plot fitness over time:

    import matplotlib.pyplot as plt
    
    plt.plot(fvals)
    plt.xlabel('Time')
    plt.ylabel('Fitness')
    plt.title('Fitness over time')
    plt.show()
    

  • How to use bnlearn to learn causal structures

    This article on causal machine learning covers a practical example of how to learn structural causal models (SCM) directly from data. We will use bnlearn, which is an open-source library for learning the graphical structure of Bayesian networks in Python. Check out my Github repo for additional code examples. For other frameworks, checkout my page on causal stuff.

    Learning a Bayesian network can be split into structure learning and parameter learning which are both implemented in bnlearn.

    • Structure learning: Given a set of data samples, estimate a DAG that captures the dependencies between the variables.
    • Parameter learning: Given a set of data samples and a DAG that captures the dependencies between the variables, estimate the (conditional) probability distributions of the individual variables.

    Libraries

    We will learn through a practical example and code. The following libraries are used to implement the example. Numpy and pandas are used for recreating a classic synthetic dataset often used in causal machine learning, the "sprinkler" dataset. BNLearn is then used to learn the causal structure among the variables in the dataset.

    You will need the following imports in Python:

    import numpy as np
    import pandas as pd
    import bnlearn as bn
    

    The sprinkler dataset


    Photo by Rémi Müller on Unsplash

    Imagine a small world with a lawn that is sometimes wet. I bet you can smell that lawn just thinking about it. Only two things cause this lawn to be wet. If it rains or if the sprinkler is on. Otherwise the lawn is dry (i.e. ¬wet). While clouds are needed for rain, not all clouds carry rain. It may therefore be cloudy without rain. On sunny days the lawn might need some water and then the sprinkler is turned on. On other sunny days the lawn does not need water and the sprinkler is off. The sprinkler is never on when it is cloudy, because somehow clouds help the lawn stay moist if not wet.

    The lawn world implies four stochastic variables:

    • Cloudy (independent)
    • Rain (depends on Cloudy)
    • Sprinkler (depends on not-Cloudy)
    • Grass wet (depends on Rain and Sprinkler)

    The following code samples the four variables and creates the sprinker dataset :

    n_samples = 10000
    cloudy = np.random.choice(2, p=[0.25, 0.75], size=n_samples)
    rain = cloudy * np.random.choice(2, p=[0.7, 0.3], size=n_samples)
    sprinkler = (1-rain) * (1-cloudy) * np.random.choice(2, p=[0.5, 0.5], size=n_samples)
    grass_wet = np.maximum(rain, sprinkler)
    data = np.column_stack((cloudy, rain, sprinkler, grass_wet))
    df = pd.DataFrame(data, columns=["cloudy", "rain", "sprinkler", "grass_wet"])
    

    The resulting dataset may look like this:

    Cloudy Rain Sprinkler Grass wet
    1 0 0 0
    1 1 0 1
    0 0 0 0
    0 0 1 1

    Take a moment to verify that the observations are consistent with the story told above.

    Learning the causal structure

    The variables were created to have a specific causal structures. Examples of structures among the variables are shown below, where an arrow (X → Y) should be read as "X causes Y":

    • Cloudy → Rain → Grass wet (chain)
    • Rain → Grass wet ← Sprinkler (collider)
    • Sprinkler ← Cloudy → Rain (fork)

    Let’s see if we can learn this causal structure using bnlearn as shown in the following code snippet:

    model = bn.structure_learning.fit(df)
    model = bn.independence_test(model, df)
    

    Notice that we may learn the wrong causal relationships. For example, it may seem that turning the sprinkler off causes clouds to appear. This is because the sprinkler is never on while there are clouds and vice versa. However, any observations where the sprinkler is off and no clouds appear would be evidence to the contrary, which may or may not be present in the sample we generated above.

    Visualising the causal DAG

    Because bnlearn includes networkx, we get the ability to visualise the graph that was learned. It’s a single line of code:

    G = bn.plot(model)
    

    If all went well with the data generation and learning, the graph should look similar to this.

    If it does not, simply try to generate the data again, optionally increasing the number of samples.

    Conclusion

    BNLearn can be used to learn the causal relationships of variables directly from data. It does not always work and is somewhat sensitive to the sample drawn, as causal relationships may sometimes be misinterpreted if insufficient evidence exists in the sample to indicate otherwise.

  • Cosine similarity in Python

    Cosine similarity is the normalised dot product between two vectors. I guess it is called “cosine” similarity because the dot product is the product of Euclidean magnitudes of the two vectors and the cosine of the angle between them. If you want, read more about cosine similarity and dot products on Wikipedia.

    Here is how to compute cosine similarity in Python, either manually (well, using numpy) or using a specialised library:

    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    
    # vectors
    a = np.array([1,2,3])
    b = np.array([1,1,4])
    
    # manually compute cosine similarity
    dot = np.dot(a, b)
    norma = np.linalg.norm(a)
    normb = np.linalg.norm(b)
    cos = dot / (norma * normb)
    
    # use library, operates on sets of vectors
    aa = a.reshape(1,3)
    ba = b.reshape(1,3)
    cos_lib = cosine_similarity(aa, ba)
    
    print(
        dot,
        norma,
        normb,
        cos,
        cos_lib[0][0]
    )
    

    The values might differ a slight bit on the smaller decimals. On my computer I get:

    • 0.9449111825230682 (manual)
    • 0.9449111825230683 (library)
  • (Integer) Linear Programming in Python

    Step one:

    brew install glpk
    pip install pulp
    

    Step two:

    from pulp import * 
    
    prob = LpProblem("test1", LpMinimize) 
    
    # Variables 
    x = LpVariable("x", 0, 4, cat="Integer") 
    y = LpVariable("y", -1, 1, cat="Integer") 
    z = LpVariable("z", 0, cat="Integer") 
    
    # Objective 
    prob += x + 4*y + 9*z 
    
    # Constraints 
    prob += x+y <= 5 
    prob += x+z >= 10 
    prob += -y+z == 7 
    
    GLPK().solve(prob) 
    
    # Solution 
    for v in prob.variables():
        print v.name, "=", v.varValue 
    
    print "objective=", value(prob.objective)  
    

    In the documentation there are further examples, e.g. one to minimise the cost of producing cat food.

  • How to compute the pagerank of almost anything

    Whenever two things have a directional relationship to each other, then you can compute the pagerank of those things. For example, you can observe a directional relationships between web pages that link to each other, scientists that cite each other, and chess players that beat each other. The relationship is directional because it matters in what direction the relationship points, e.g. who lost to who in chess.

    Intuitively, you may think of directional relationships as a transferal of some abstract value between two parties. For example, when one chess player loses to another in chess, then the value (i.e. relative skill level) of the winner will increase and the value of the loser decrease. Furthermore, the amount of value that is transfered depends on the starting value of each party. For example, if a master chess player loses to a novice chess player in chess, then the relative skill level of the novice will dramatically increase. Conversely, if a novice chess player loses to a master chess player in chess, then that is to be expected. In this situation, the relative skill level of each player should remain roughly the same as before – the status quo.

    Below you’ll see an illustration of a small graph with seven nodes and seven edges. The pagerank of each node is illustrated by shading it, where a darker color denotes a higher rank.

    If you study this figure, you should notice that:

    • Nodes 1 through 4 all have low rank, because no other nodes point to them
    • Node 5 has a medium rank, because a low-rank node points to it
    • Node 6 has high rank, because many low-rank nodes point to it
    • Node 7 has the highest rank, because a high-rank node points to it, while it points to nothing

    Compute pagerank with Python

    The pageranks of the nodes in the example graph (see figure above) was computed in Python with the help of the networkx library, which can be installed with pip: pip install networkx. The code that creates a graph and computes pagerank is listed below:

    import networkx as nx
    
    # Initialize directed graph
    G = nx.DiGraph()
    
    # Add edges (implicitely adds nodes)
    G.add_edge(1,6)
    G.add_edge(2,6)
    G.add_edge(3,6)
    G.add_edge(4,6)
    G.add_edge(5,6)
    G.add_edge(4,5)
    G.add_edge(6,7)
    
    # Compute pagerank (keys are node IDs, values are pageranks)
    pr = nx.pagerank(G)
    """
    {
      1: 0.06242340798778012, 
      2: 0.06242340798778012, 
      3: 0.06242340798778012, 
      4: 0.06242340798778012, 
      5: 0.08895357136701444, 
      6: 0.32374552689540625, 
      7: 0.33760726978645894
    }
    """
    

    Notice that each nodes is represented by an integer ID, with no specific semantics tied to the nodes nor the edges. In other words, the graph could equally well represent relationships between web pages, scientists and chess players (or something else entirely).

    If your relationships can be assigned weights, e.g. the strength of a victory in chess or the prominence of a link on a web page, then you can add weights to the edges in the graph. Luckily, weighted edges can be easily added in networkx:

    G.add_edge(1, 2, weight=0.5)
    

    Dealing with time

    You may ask yourself, should a chess game that took place last year impact a player’s rank as much as a game that was won or lost just last week? In many situations, the most meaningful answer would be no. A good way to represent the passing of time in a relationship graph is to use edge weights that decrease over time by some function. For example, an exponential decay function can be used, such that relationships that were formed a long time ago have exponentially lower weight than recently formed relationships. This can be achieved in Python with the ** operator with a negative exponent:

    time_decayed_weight = max(.00001, time_passed) ** -1
    G.add_edge(1, 2, weight=time_decayed_weight)
    

    We use the trick max(.00001, time_passed) to ensure that we do not raise zero to the power of a negative number. The unit of time passed depends on the domain, and is not essential to the computation. For example, the unit could be milliseconds, years or millennia.

    To be continued…

  • Running LP-solver in Postgres

    Having reinstalled PostgreSQL with support for Python and pointing at my non-system python, it is time to test whether I can use the convex optimizer library I’ve installed in my Python 2.7 (pip install cvxopt).

    Install PL/Python if not already installed

    -- if not already installed. Doesn't hurt.
    create extension plpythonu;
    

    Create a function that imports cvxopt:

    CREATE OR REPLACE FUNCTION hello_cvxopt()
      RETURNS text
    AS $$
      import cvxopt
      return cvxopt.__doc__
    $$ LANGUAGE plpythonu IMMUTABLE;
    

    See if it works:

    select hello_cvxopt();
    -- should return a documentation string
    

    Try the linear programming example:

    CREATE OR REPLACE FUNCTION cvxopt_lp_example()
      RETURNS float[]
    AS $$
      from cvxopt import matrix, solvers
      A = matrix([ [-1.0, -1.0, 0.0, 1.0], [1.0, -1.0, -1.0, -2.0] ])
      b = matrix([ 1.0, -2.0, 0.0, 4.0 ])
      c = matrix([ 2.0, 1.0 ])
      solvers.options['show_progress'] = False
      sol=solvers.lp(c,A,b)
      return list(sol['x'])
    $$ LANGUAGE plpythonu IMMUTABLE;
    
    select cvxopt_lp_example();
    -- should return something like "{0.499999995215,1.49999999912}"
    

    Try

  • Clustering in Python

    In a project I’m going to use clustering algorithms implemented in Python, such as k-means.

    Clustering

    http://stackoverflow.com/questions/1545606/python-k-means-algorithm

    scipy.cluster has been reported to have some problems, so for now I’ll use PyCluster (following advice given on stackoverflow).

    Install PyCluster:

    pip install PyCluster
    

    (more…)

  • Is there a need for a fast compression algorithm for geospatial data?

    Fast compression algorithms like Snappy, QuickLZ and LZ4 are designed for a general stream of bytes, and typically don’t treat byte-sequences representing numbers in any special way. Geospatial data is special in the sense that it often contains a large amount of numbers, like floats, representing coordinates.

    (more…)

  • Trying a Python R-tree implementation

    Rtree is a ctypes Python wrapper of libspatialindex that provides a number of advanced spatial indexing features for the spatially curious Python user.

    (more…)