Category: Uncategorized

  • How to Draw an Owl

    Taken from lecture 1 of Statistical Rethinking course (around the 44 minute mark). The course material is also on Github.

    How to draw an "owl" version 1:

    1. Create generative simulation (GS)
    2. Write an estimator
    3. Validate estimator using simulated data
    4. Analyze real data: …
    5. Reuse 1 to compute hypothetical interventions

    How to draw an "owl" version 2:

    1. Theoretical estimand
    2. Scientific causal models
    3. Use 1 & 2 to build statistical models
    4. Simulate from 2 that 3 yields 1
    5. Analyze real data
  • How to call an API from PySpark (in workers)

    Tested in Databricks

    import pyspark.sql.functions as F
    import requests
    
    # create dataframe
    pokenumbers = [(i,) for i in range(100)]
    cols = ["pokenum"]
    
    df_pokenums = spark.createDataFrame(data=pokenumbers, schema=cols)
    
    # call API
    def get_name(rows):
        # take the first item in list (API doesn't support batch)
        first = rows[0]
        url = f'https://pokeapi.co/api/v2/pokemon-form/{first.pokenum}'
        try:
            resp = requests.get(url)
            name = resp.json()['pokemon']['name']
        except:
            name = 'did not work'
        return resp.status_code, name
    
    # apply to partitions
    df_pokenums.repartition(10).rdd.glom().map(get_name).collect()
  • Unusual instruments in Vietnam

    The nose flute

    The Duo A’Reng, a primitive talkbox!

  • Getting back into operations research

    For the last five years, I have been fascinated with machine learning techniques, but that fascination is slowly running out. I increasingly consider ML a tool in my tool box among others, not a panacea to all problems. In particular, I’d like to return to other algorithmic techniques from my educational back, i.e. computer science. Besides classical algorithms like Dijkstra’s algorithm, in particular I’d like to pick up linear programming again and operations research (OR)

    During my master’s studies (2006-2008) I was convinced that linear programming would make the world a better place. Human life, I was convinced, would be more economically efficient, more environmentally friendly, with more of the good stuff and less of the bad. Later I thought the exact same thing of machine learning. Yes, a hopeful young and older man I am. Perhaps now is the time for me to compare the methodologies and arrive at some personal conclusions.

    A (very) rough categorisation, from my point of view as a current practitioner:

    • Classical algorithms: requires a deep understanding of the problem; also requires a deep understanding of the computational process required to solve them; An example is the Shortest Path Problem and Dijkstra’s Algorithm.
    • Linear programming: requires a deep understanding of the problem and the ability to model it mathematically; does not require deep knowledge of any computational process to solve them besides the general Simplex algorithm (for non-integer problems);
    • Machine learning: does not require a deep understanding of the problem; does not require a deep understanding of the computational process used to solve them; An example, is using a Random Forrest algorithm to solve the Digit Recognition Problem.

    At work, I’m blessed with colleagues who like to discuss and knowledge share about these topics.

    Bibliography of recent interest

    1. Optimization Beyond Prediction:
      Prescriptive Price Optimization
  • Good Book, Bad Program

    Programming is like writing a book. Both programs and books are written in a language, e.g. Datalog or Tagalog, but the similarity goes deeper than that. Imagine that you must write a book with the subject "a man walks his dog". There are endless ways to write that book, but only a few of them will become good books. Similarly, imagine that you must write a program that adds two numbers together. There are infinitely many ways program that accomplish the task, but only a few of these are good programs. We cannot judge a programmer’s abilities solely on whether the unit tests pass, nor can we judge an author’s abilities solely on whether the book adheres to the subject.

  • Create a European city map with population density

    Datasets:

    – Urban morphological zones 2000 (EU): https://www.eea.europa.eu/data-and-maps/data/urban-morphological-zones-2000-2
    – Population count (World): http://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-count-rev10/
    – Administrative regions (World): http://gadm.org/

    The map is European since the “urban” data from the European Environmental Agency (EEA) only covers Europe.

    Caveats

    The UMZ data ended up in PostGIS with srid 900914. You can use prj2epsg.org to convert the contents of a .prj file to an estimated SRID code. In this case the UMZ .prj file as the contents:

    PROJCS["ETRS89_LAEA_Europe",GEOGCS["GCS_ETRS_1989",DATUM["D_ETRS_1989",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Lambert_Azimuthal_Equal_Area"],PARAMETER["latitude_of_origin",52],PARAMETER["central_meridian",10],PARAMETER["false_easting",4321000],PARAMETER["false_northing",3210000],UNIT["Meter",1]]
    

    Which translates to 3035 - ETRS89_LAEA_Europe.

  • How to create a world-wide PostgreSQL database of administrative regions

    The GADM database contains geographical data for administrative regions, e.g. countries, regions and municipalities. As always, once you have the data in the right format, it is easy to import it into a database. The data is available from GADM in several formats. All data has the coordinate reference system in longitude/latitude and theWGS84 datum.

    Step-by-step:

    1. Download data for the whole world or by country. For a change, I will use the GeoPackage format.
    2. Create a PostgreSQL database (assumed to exist)
    3. Import the data with ogr2ogr (see instructions below)

    Import data instructions

    Download data (example for Denmark):

    wget http://biogeo.ucdavis.edu/data/gadm2.8/gpkg/DNK_adm_gpkg.zip
    unzip DNK_adm_gpkg.zip
    

    Next, create a database called “gadm” on my local PostgreSQL server; of course you can use another name if you prefer. Install the PostGIS extension:

    create extension postgis
    

    Finally, use ogr2ogr with the GPKG (GeoPackage) driver to import the data:

    ogr2ogr -f PostgreSQL "PG:dbname=gadm" DNK_adm.gpkg
    

    Now the data is imported an ready to be queried.

    As a test, we can query the adm2 table (municipalities) with a coordinate inside the municipality of Copenhagen, Denmark.

    SELECT name_2, ST_AsText(wkb_geometry)
    FROM dnk_adm2
    WHERE ST_Intersects(ST_SetSRID(ST_Point(12.563585, 55.690628), 4326), wkb_geometry)
    -- AND ST_Point(12.563585, 55.690628) && wkb_geometry
    

    You can view the selected well-known string geometry (WKT) in an online viewer, such as openstreetmap-wkt-playground. Other viewers are listed on stackexchange.

    Alternative sources

    For this post I really wanted a dataset of populated/urban areas. However, the GADM data I downloaded only contains adm0-adm2, which is a tessellation of the land area, i.e. cannot be used to discriminate between urban and rural areas.

    Other data sources are listed below:

    – http://www.naturalearthdata.com/downloads/
    – https://data.humdata.org
    – https://freegisdata.rtwilson.com/

    From the rtwilson list, here are some specific datasets that indicate population density and urbanism:

    – http://sedac.ciesin.columbia.edu/data/collection/gpw-v4/sets/browse
    – https://www.eea.europa.eu/data-and-maps/data/urban-morphological-zones-2000-2
    – http://www.worldpop.org.uk/ (does not cover Europe and North America)
    – https://nordpil.com/resources/world-database-of-large-cities/

  • How to assess computers on your local area network

    I teach children how to programm and do other things with technology in an organisation called Coding Pirates in Denmark, which aims to be a kind of scout movement for geeks. A best seller among the kids is learning how to hack and I see this as a unique opportunity to convey some basic human values in relation to something that can be potentially harmful.

    Yesterday, I and one of the kids played with nmap, the network surveying tool, to investigate our local area network. The aim was to find information about the computers that were attached, such as operating system, system owner’s first name (often part of the computer name) and whether any computer had open server ports (SSH, web etc.). We used nmap in combination with Wireshark.

    1. Tell another person about a fun website (any website will do)
    2. Use wireshark to detect the IP address (e.g. 192.168.85.116) of any computer that accesses that website
    3. Use nmap to scan the IP address we found: nmap -vS 192.168.85.116

    We also learned how to detect that someone logs into your computer and e.g. kick the person (assume an Ubuntu host):

    # Monitor login attempts
    tail -f /var/log/auth.log
    # See active sessions
    who
    # List remote sessions
    ps fax | grep 'pts/'
    # Kill sessions
    kill -9 [pid of bash processes connected to session]
    

    Other tricks

    List all hosts (ping scan) on your local area network:

    nmap -sP 192.168.1.*
    

    Find computers on your local area network that run an SSH server:

    nmap -p 22 --open -sV 192.168.1.*
    
  • Urban Mining – Gold from Airbags,

    Airbag sensors

    There is a small amount of gold inside the Airbag sensor. The sensor contains a small gold-plated marble, which amounts to some tiny amount of gold.

    Integrated Chips

    One kilogram of chips contains 1-8 grams of gold (according to Archimedes Channel on YouTube). Judging from the video they look for gold in the processor-type chips (CPU, DSP, etc).