Author: kostas

  • How SSH Nuke in Matrix Reloaded works

    In the movie Matrix Reloaded, we see Trinity use SSH Nuke to hack a compromised server. This script exploits the SSH CRC-32 vulnerability, specifically known as CVE-2001-0144, which affected older versions of the OpenSSH software. Here’s an explanation of how this vulnerability worked:

    Overview of the Vulnerability

    The SSH CRC-32 vulnerability exploited a flaw in the handling of CRC (Cyclic Redundancy Check) in the SSH protocol version 1. This vulnerability could allow an attacker to execute arbitrary code on a remote system running an affected SSH daemon (sshd).

    Technical Details

    1. CRC-32 Integrity Check:

      • The SSH protocol version 1 used a CRC-32 checksum to verify the integrity of data packets.
      • The purpose of CRC-32 was to detect accidental changes to raw data.
    2. Vulnerability in CRC-32 Compensation Attack Detector:

      • The vulnerability was found in the CRC-32 compensation attack detector function.
      • This function was supposed to prevent an attacker from tampering with the checksum to produce valid-looking but malicious packets.
    3. Exploiting the Buffer Overflow:

      • The flaw allowed for a buffer overflow in the CRC-32 compensation attack detector.
      • An attacker could craft a specially designed SSH packet with an incorrect CRC-32 checksum, which would bypass the integrity check and overflow the buffer.
      • By carefully controlling the data in the overflow, the attacker could overwrite memory and execute arbitrary code on the server.

    Steps of the Exploit

    1. Sending Malicious Packets:

      • The attacker sends a series of maliciously crafted packets to the SSH server.
      • These packets have incorrect CRC-32 checksums designed to exploit the overflow.
    2. Triggering the Overflow:

      • When the SSH server processes these packets, it incorrectly handles the CRC-32 checksums, leading to a buffer overflow.
    3. Executing Arbitrary Code:

      • The overflow allows the attacker to inject and execute arbitrary code on the SSH server.
      • This could provide the attacker with unauthorized access to the server, potentially with root privileges.

    Mitigation and Resolution

    • Patch and Update:

      • The vulnerability was addressed in later versions of OpenSSH.
      • Users were advised to upgrade to the latest version of OpenSSH to mitigate this vulnerability.
    • Switch to SSH Protocol Version 2:

      • The vulnerability only affected SSH protocol version 1.
      • Protocol version 2, which is more secure and does not have this flaw, became the recommended version to use.

    Conclusion

    The SSH CRC-32 vulnerability was a serious security flaw that could allow attackers to gain unauthorized access to systems by exploiting a buffer overflow in the SSH protocol version 1. The exploit involved crafting packets to bypass the integrity checks and trigger an overflow, leading to arbitrary code execution. This vulnerability highlights the importance of using updated and patched software versions and preferring more secure protocols, like SSH version 2.

  • How to call OpenAI’s ChatGPT API

    Here is how you can call OpenAI’s ChatGPT API, given that you have an API key. Follow these instructions to get one.

    import openai
    
    openai.api_key = ''
    
    openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"},
            {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
            {"role": "user", "content": "Where was it played?"}
        ]
    )
    

    This will print out something like the following:

    <OpenAIObject chat.completion id=chatcmpl-xxx at xxx> JSON: {
      "id": "chatcmpl-xxx",
      "object": "chat.completion",
      "created": 1689160553,
      "model": "gpt-3.5-turbo-0613",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "The 2020 World Series was played at Globe Life Field in Arlington, Texas."
          },
          "finish_reason": "stop"
        }
      ],
      "usage": {
        "prompt_tokens": 53,
        "completion_tokens": 17,
        "total_tokens": 70
      }
    }
  • How to reset your WordPress password via SQL

    You may have the need to reset your wordpress password and here I’ll show you how to do that, assuming you have access to your sites MySQL database.

    WordPress stores the password for your user in the wp_users table. The password itself is assumed to be hashed with MD5. You can set a new password with the following query:

    UPDATE `wp_users` SET `user_pass` = MD5('mynewpassword')
    WHERE `user_login` = 'user_x';

    Just replace user_login value ("user_x" in the query above) with your own username and replace the password ("mynewpassword in the query above) with the password you want.

    To list users that exist in the database, you can use this simple query:

    SELECT * FROM wp_users

    That’s all there is to it.

  • Double Machine Learning for Dummies as explained by ChatGPT

    I asked ChatGPT to:

    explain double machine learning for dummies

    Here is the answer:

    Double machine learning (DML) is a statistical method used to estimate treatment effects in the presence of high-dimensional controls. To put it simply, it’s a way to answer questions about cause and effect using data and machine learning, even when there are lots of potentially relevant variables.

    Here is a more simplified explanation using a practical example:

    Let’s say you are a coach of a football team, and you want to determine the impact of a new training method on the performance of your team. However, there are many factors that could affect performance – things like diet, amount of sleep, weather conditions, player mood, previous experience, and more.

    With traditional methods, you would have to painstakingly adjust for all these factors, or "controls", to determine the real impact of the new training method. This can be time-consuming, difficult, and may not be accurate, especially if there are a lot of controls.

    Here’s where double machine learning comes in. DML uses machine learning twice (hence the "double" part) to solve this problem:

    1. First, it uses machine learning to predict the outcome (football performance) based on all the control variables (diet, sleep, weather, etc.), and then subtracts this prediction from the actual outcome. This gives you a "residual" outcome that can’t be explained by the controls.

    2. Then, it does the same thing for the treatment (the new training method): it predicts the treatment based on the controls, and subtracts this prediction from the actual treatment. This gives you a "residual" treatment that can’t be explained by the controls.

    After that, you simply correlate the "residual" outcome with the "residual" treatment to estimate the effect of the treatment. This gives you an estimate of the impact of the new training method, accounting for all the controls without needing to adjust for them one by one.

    In other words, double machine learning helps you figure out the cause-and-effect relationship between variables when there are many other variables that could potentially influence the result. It does this by using machine learning to "filter out" the effect of the other variables, leaving just the effect you’re interested in.

  • How to Draw an Owl

    Taken from lecture 1 of Statistical Rethinking course (around the 44 minute mark). The course material is also on Github.

    How to draw an "owl" version 1:

    1. Create generative simulation (GS)
    2. Write an estimator
    3. Validate estimator using simulated data
    4. Analyze real data: …
    5. Reuse 1 to compute hypothetical interventions

    How to draw an "owl" version 2:

    1. Theoretical estimand
    2. Scientific causal models
    3. Use 1 & 2 to build statistical models
    4. Simulate from 2 that 3 yields 1
    5. Analyze real data
  • How to sort numbers with an evolutionary algorithm (CMA-ES)

    Yes, this is clearly nonsense. Sorting is not a hard problem and standard algorithms such as quicksort and mergesort have O(x^2) and O(n log(n)) complexity. But let me scratch this itch of sorting numbers using an evolutionary algorithm, specifically Covariance matrix adaptation evolution strategy (CMA-ES). Technically, we will use what I think is the original library by the inventor of the method, Nikolaus Hansen.

    In python, we will make use of these two libraries:

    import cma  # pip install cma
    import numpy as np  # pip install numpy
    

    Solving without constraints

    CMA-ES like other metaheuristic uses the concept of a fitness function to search for good or optimal solutions to a problem. The algorithm does not need to know the structure of your problem as all knowledge is encapsulated in the fitness function. The algorithm generates candidate solutions and evaluates them with the fitness function. The fitness of solutions is used to generate the next batch of solutions until convergence (fitness = 0).

    While you can define feasibility of solutions by providing constraints, this is not a requirement. Therefore, we first will try to solve the toy problem without constraints.

    Fitness function and initial solution

    For CMA-ES to work, you must provide a fitness function that is used to evaluate solutions. For sorting, we define our fitness function as the euclidean distance between a solution x to the optimal solution xopt, which is a sorted list.

    Clearly, it is nonsense that we must first have the optimal solution in order to define the fitness function. Why search if we already have the answer? But again, this is just a simple example chosen so we can focus on the method, not the application.

    In addition to a fitness function, you must also provide a seed solution, which we will call x0. The algorithm will start from x0 and search for better solutions using that as a starting point. Conceptually, a bunch of "neighbours" are evaluated in each step and the direction of search is determined by computing their fitness. Most metaheuristics will intensify search in promising neighbourhoods and ignore the less promising ones. You can read more about CMA-ES on Wikipedia.

    Below we will setup the problem by defining the fitness function ff and an initial solution x0.

    # Optimal solution (used in fitness function)
    n = 40
    xopt = np.arange(n).astype(float)
    # [0, 1, ..., 38, 39]
    
    # fitness function, the euclidean distance x -> xopt
    ff = lambda x: np.linalg.norm(xopt-x)
    
    # Initial solution, a random permutation of the optimal solution
    x0 = np.random.permutation(xopt)
    # [26, 16, ..., 38, 12]
    
    # initial standard deviation
    sigma0 = 0.5
    

    Now that we have defined the fitness function, we can forget that we ever knew the optimal solution. It is however embedded in the fitness function. For your problem, your would have some meaningful way of defining the fitness of a solution. Keep in mind that a value of 0 means perfect fitness, where larger values mean worse fitness.

    Optimise: using wrapper API

    First, we can optimise using the wrapper functions provided by cma. For some reason, using these wrappers results in slower convergence and I don’t know why. There are several ways to use the wrapper API. Below you see two different ways, which I believe are equivalent:

    # method 1
    es = cma.CMAEvolutionStrategy(x0, sigma0)
    es.optimize(ff)
    xbest = es.result.xbest
    
    # method 2
    xbest, es = cma.fmin2(ff, x0, 0.5)
    
    print(xbest.round(0))
    

    Optimise: using stop-ask-tell

    Next, we will solve the problem without the wrappers. The cma library uses a stop-ask-tell protocol.

    • Stop: returns true if the algorithm has converged
    • Ask: the algorithm returns the current pool of solutions
    • Tell: the user provides a fitness value for each solution

    While the following code is slightly longer, it converges faster for some reason. Again, I don’t know why. It is however an equivalent way to solve the problem and also finds the optimal solution.

    es = cma.CMAEvolutionStrategy(x0, 0.5)
    
    fvals = []  # used for plotting later
    while not es.stop():
        solutions = es.ask()
        fitness = np.array([ff(xopt, x) for x in solutions])
        fvals.append(fitness.min())
        es.tell(solutions, fitness)
    
    xbest = es.result_pretty().xbest.round(precision)
    print(xbest.round(0))
    

    The program finds the solution in less than 1 second on my laptop (Mac Book Pro M2). Is that impressive? Well, it is to some degree. The solution space is essentially any combination of 40 real numbers, since we did not specify any constraints on the values. You can specify constraints in the CMA library, but that is a topic for another day.

    Adding constraints

    You may provide constraints to CMA-ES as a vector-valued function, g_i, which defines a solution x as feasible if and only if g_i(x) ≤ 0 for all i. The following code is based on the example notebook from the pycma website. The structure of the code is almost identical from what we had before. The only difference is that we now combine the old fitness and new constraint function into a new "special" fitness function that is used during optimisation.

    For our problem of sorting numbers, we want to enforce the constraint that any number must be less than or equal to any numbers to the right of it. If you think about it, that is the same as saying the numbers must be sorted. This means that our initial solution, which is just a permutation of sorted numbers, will be infeasible with near 100% probability.

    Modified example that adds a constraint:

    # Create constraint function
    def constraints(x):
        # x_i must be less than or equal to all x's to the right
        return [xi - x[i:].max() for i, xi in enumerate(x)]
    
    # Combine old fitness function and constraints
    # This is used in place of the old fitness function
    ffc = cma.ConstrainedFitnessAL(ff, constraints)
    
    es = cma.CMAEvolutionStrategy(x0, 0.5)
    
    while not es.stop():
        solutions = es.ask()
        fitness = np.array([ffc(x) for x in solutions])
        es.tell(solutions, fitness)
    
    xbest = es.result.xbest
    

    For this particular problem, adding the constraint seemingly does not help the problem converge faster. Maybe it already converges as fast as it can, and the constraint just adds overhead and an initial scramble for a feasible solution?

    Early stopping

    Plotting the fitness value as it evolves over time, it is clear that we could have stopped earlier with a pretty good solution. Maybe a pretty good solution does not make sense for sorting, but it would make sense in many other scenarios, such as financial optimisation, where there is a significant amount of uncertainty.

    Plot fitness over time:

    import matplotlib.pyplot as plt
    
    plt.plot(fvals)
    plt.xlabel('Time')
    plt.ylabel('Fitness')
    plt.title('Fitness over time')
    plt.show()
    

  • How to draw lines on map in Databricks

    Imports:

    import plotly.graph_objects as go

    Plot:

    fig = go.Figure()
    
    fig.add_trace(go.Scattermapbox(
        mode = "markers+lines",
        lon = [10, 20, 30],
        lat = [10, 15,30],
        marker = {'size': 10}))
    
    fig.add_trace(go.Scattermapbox(
        mode = "markers+lines",
        lon = [-50, -60,40],
        lat = [30, 10, -20],
        marker = {'size': 10}))
    
    fig.update_layout(
        margin ={'l':0,'t':0,'b':0,'r':0},
        mapbox = {
            'center': {'lon': 10, 'lat': 10},
            'style': "carto-positron",
            'center': {'lon': -20, 'lat': -20},
            'zoom': 1})
    
    fig.show()

    Displays:

  • How to call an API from PySpark (in workers)

    Tested in Databricks

    import pyspark.sql.functions as F
    import requests
    
    # create dataframe
    pokenumbers = [(i,) for i in range(100)]
    cols = ["pokenum"]
    
    df_pokenums = spark.createDataFrame(data=pokenumbers, schema=cols)
    
    # call API
    def get_name(rows):
        # take the first item in list (API doesn't support batch)
        first = rows[0]
        url = f'https://pokeapi.co/api/v2/pokemon-form/{first.pokenum}'
        try:
            resp = requests.get(url)
            name = resp.json()['pokemon']['name']
        except:
            name = 'did not work'
        return resp.status_code, name
    
    # apply to partitions
    df_pokenums.repartition(10).rdd.glom().map(get_name).collect()
  • How to use bnlearn to learn causal structures

    This article on causal machine learning covers a practical example of how to learn structural causal models (SCM) directly from data. We will use bnlearn, which is an open-source library for learning the graphical structure of Bayesian networks in Python. Check out my Github repo for additional code examples. For other frameworks, checkout my page on causal stuff.

    Learning a Bayesian network can be split into structure learning and parameter learning which are both implemented in bnlearn.

    • Structure learning: Given a set of data samples, estimate a DAG that captures the dependencies between the variables.
    • Parameter learning: Given a set of data samples and a DAG that captures the dependencies between the variables, estimate the (conditional) probability distributions of the individual variables.

    Libraries

    We will learn through a practical example and code. The following libraries are used to implement the example. Numpy and pandas are used for recreating a classic synthetic dataset often used in causal machine learning, the "sprinkler" dataset. BNLearn is then used to learn the causal structure among the variables in the dataset.

    You will need the following imports in Python:

    import numpy as np
    import pandas as pd
    import bnlearn as bn
    

    The sprinkler dataset


    Photo by Rémi Müller on Unsplash

    Imagine a small world with a lawn that is sometimes wet. I bet you can smell that lawn just thinking about it. Only two things cause this lawn to be wet. If it rains or if the sprinkler is on. Otherwise the lawn is dry (i.e. ¬wet). While clouds are needed for rain, not all clouds carry rain. It may therefore be cloudy without rain. On sunny days the lawn might need some water and then the sprinkler is turned on. On other sunny days the lawn does not need water and the sprinkler is off. The sprinkler is never on when it is cloudy, because somehow clouds help the lawn stay moist if not wet.

    The lawn world implies four stochastic variables:

    • Cloudy (independent)
    • Rain (depends on Cloudy)
    • Sprinkler (depends on not-Cloudy)
    • Grass wet (depends on Rain and Sprinkler)

    The following code samples the four variables and creates the sprinker dataset :

    n_samples = 10000
    cloudy = np.random.choice(2, p=[0.25, 0.75], size=n_samples)
    rain = cloudy * np.random.choice(2, p=[0.7, 0.3], size=n_samples)
    sprinkler = (1-rain) * (1-cloudy) * np.random.choice(2, p=[0.5, 0.5], size=n_samples)
    grass_wet = np.maximum(rain, sprinkler)
    data = np.column_stack((cloudy, rain, sprinkler, grass_wet))
    df = pd.DataFrame(data, columns=["cloudy", "rain", "sprinkler", "grass_wet"])
    

    The resulting dataset may look like this:

    Cloudy Rain Sprinkler Grass wet
    1 0 0 0
    1 1 0 1
    0 0 0 0
    0 0 1 1

    Take a moment to verify that the observations are consistent with the story told above.

    Learning the causal structure

    The variables were created to have a specific causal structures. Examples of structures among the variables are shown below, where an arrow (X → Y) should be read as "X causes Y":

    • Cloudy → Rain → Grass wet (chain)
    • Rain → Grass wet ← Sprinkler (collider)
    • Sprinkler ← Cloudy → Rain (fork)

    Let’s see if we can learn this causal structure using bnlearn as shown in the following code snippet:

    model = bn.structure_learning.fit(df)
    model = bn.independence_test(model, df)
    

    Notice that we may learn the wrong causal relationships. For example, it may seem that turning the sprinkler off causes clouds to appear. This is because the sprinkler is never on while there are clouds and vice versa. However, any observations where the sprinkler is off and no clouds appear would be evidence to the contrary, which may or may not be present in the sample we generated above.

    Visualising the causal DAG

    Because bnlearn includes networkx, we get the ability to visualise the graph that was learned. It’s a single line of code:

    G = bn.plot(model)
    

    If all went well with the data generation and learning, the graph should look similar to this.

    If it does not, simply try to generate the data again, optionally increasing the number of samples.

    Conclusion

    BNLearn can be used to learn the causal relationships of variables directly from data. It does not always work and is somewhat sensitive to the sample drawn, as causal relationships may sometimes be misinterpreted if insufficient evidence exists in the sample to indicate otherwise.