Category: Data science

  • Double Machine Learning for Dummies as explained by ChatGPT

    I asked ChatGPT to:

    explain double machine learning for dummies

    Here is the answer:

    Double machine learning (DML) is a statistical method used to estimate treatment effects in the presence of high-dimensional controls. To put it simply, it’s a way to answer questions about cause and effect using data and machine learning, even when there are lots of potentially relevant variables.

    Here is a more simplified explanation using a practical example:

    Let’s say you are a coach of a football team, and you want to determine the impact of a new training method on the performance of your team. However, there are many factors that could affect performance – things like diet, amount of sleep, weather conditions, player mood, previous experience, and more.

    With traditional methods, you would have to painstakingly adjust for all these factors, or "controls", to determine the real impact of the new training method. This can be time-consuming, difficult, and may not be accurate, especially if there are a lot of controls.

    Here’s where double machine learning comes in. DML uses machine learning twice (hence the "double" part) to solve this problem:

    1. First, it uses machine learning to predict the outcome (football performance) based on all the control variables (diet, sleep, weather, etc.), and then subtracts this prediction from the actual outcome. This gives you a "residual" outcome that can’t be explained by the controls.

    2. Then, it does the same thing for the treatment (the new training method): it predicts the treatment based on the controls, and subtracts this prediction from the actual treatment. This gives you a "residual" treatment that can’t be explained by the controls.

    After that, you simply correlate the "residual" outcome with the "residual" treatment to estimate the effect of the treatment. This gives you an estimate of the impact of the new training method, accounting for all the controls without needing to adjust for them one by one.

    In other words, double machine learning helps you figure out the cause-and-effect relationship between variables when there are many other variables that could potentially influence the result. It does this by using machine learning to "filter out" the effect of the other variables, leaving just the effect you’re interested in.

  • How to draw lines on map in Databricks

    Imports:

    import plotly.graph_objects as go

    Plot:

    fig = go.Figure()
    
    fig.add_trace(go.Scattermapbox(
        mode = "markers+lines",
        lon = [10, 20, 30],
        lat = [10, 15,30],
        marker = {'size': 10}))
    
    fig.add_trace(go.Scattermapbox(
        mode = "markers+lines",
        lon = [-50, -60,40],
        lat = [30, 10, -20],
        marker = {'size': 10}))
    
    fig.update_layout(
        margin ={'l':0,'t':0,'b':0,'r':0},
        mapbox = {
            'center': {'lon': 10, 'lat': 10},
            'style': "carto-positron",
            'center': {'lon': -20, 'lat': -20},
            'zoom': 1})
    
    fig.show()

    Displays:

  • Learn some machine learning fundamentals in an afternoon

    Here is a plan to learn ML fundamentals in an afternoon by watching some videos on youtube:

    Follow this plan

    Machine learning fundamentals:

    [Stop and drink coffee, eat a snack]

    How to address bias and variance:

    Extra material:

    Test your knowledge

    • What is bias?
      • A: Bla
      • B: The inability of a machine learning model (e.g. linear regression) to express the true relationship between X and Y
      • C: Bla
    • What is variance?
      • A: The difference in how well a model fits different datasets (e.g. training and test)
      • B: Bla
      • C: Bla
    • What problem does regularization, bagging and boosting address?
      • A: Bla
      • B: Bla
      • C: Finds the sweet spot between simple and complicated models
    • What is regularization?
      • A: Bla
      • B: Bla
      • C: Bla
    • What is bagging?
      • A: Bla
      • B: Bla
      • C: Bla
    • What is boosting?
      • A: Bla
      • B: Bla
      • C: Bla
      • What is bootstrapping?
      • A: Repeat an experiment a bunch of times until we feel certain about the result
      • B: Repeatly random sample n times (with replacement) from a set of n observations and build up a histogram of any statistic, e.g. the mean.
      • C: Augment a small set of observations with synthetic samples to increase sample size
  • How to make interactive plots in Jupyter

    Python has great options for plotting data. However, sometimes you want to explore data by changing parameters and rerunning plots to explore the effect of those changed parameters. This slows down the cycle of exploration. Luckily, Jupyter offers you a way to make you plots interactive, so you can see the effect of parameter changes immediately. Here is how to do it.

    Start jupyter

    Before you proceed, start a jupyter notebook with a Python kernel where you can type in the code.

    Next, you need a few imports:

    %matplotlib inline
    import numpy as np
    from matplotlib import pyplot as plt
    from ipywidgets import interactive, fixed

    Write a plotting function

    Next, you have to provide a function with parameters you want to change. The function should plot you output. Here is a single example with a noise sine function, where we want the user to be able to change the noise level.

    def noisy_sine(alpha=0.0):
        n_samples = 500
        alpha = np.abs(alpha)
        x = np.linspace(0,2*np.pi,n_samples)
        y = np.sin(x) + np.random.random(n_samples) * alpha
        plt.plot(x, y)
        plt.show()
        return x,y

    Provide your function as input to interactive()

    Finally, pass your function to interactive() along with value ranges for the parameters you want to explore.

    interactive(noisy_sine, alpha=(0.0, 1.0))

    The result should look like this: