No one in ad tech needs to know your name

I work in the ad tech industry, which means that I track people online for a living. Mainly, I do it because the industry has interesting computer science problems and because the job pays well.
I will not defend ad tech. Mainly because ad tech is not important enough to humanity to defend. However, I do believe that ad tech’s algorithms are important to humanity because they can be applied to important areas, such as your health, personal finance and education. However, I have a different point today.

I have a subtle point about privacy. I have noticed that at no point does the ad tech industry need to know who you really are. Ad tech does not need to know what your real name is, what your parents real names are, your actual street address or any other piece of information that identifies you as you to another human being. It is a little bit hard to explain, but I will try. Ad tech is powered by algorithms and these algorithms operate in an abstract space where your true identity is not important. Most ad tech knows you by a random number that was assigned to you. All your interests are also represented by random numbers. The place you live yet another. Ad tech algorithms only care about the relationships between these numbers, not what the numbers actually represent in the real world.

Here is how it works. You get assigned a random number, e.g. 123, to represent you. Then, ad tech will attempt to link your number, 123, with the numbers of boxes that represent products or services that you might be interested in. For example, a box A could be people who need a vacation and box B could be people who could be tempted to buy a new BMW. Ideally, if you really need a vacation and someone really wants to sell you that vacation, then a connection between 123 and A should be made. From ad tech’s perspective, the number 123 is linked to the box A. The algorithm does not need to use labels like “Alice Anderson” or “Bob Biermann”, because the numbers 1 and 2 will get the job done just fine — from a mathematical point of view.

At some point your true identity becomes interesting, long after ad tech has left the scene. At some point, somebody (e.g. a human being or a robot) might need to print your real name and street address on a card box box, put the product you ordered inside and ship it via DHL. Up until that exact point, your name, street address or any other personally identifiable information is utterly unimportant to anybody. Nobody cares and no advertisement algorithm needs to know. I think this is an important point.
Ad-tech algorithms, if not ad tech itself, can have a massive and positive impact on areas of life that you probably care about. For example, algorithms can help you with your health, personal finance, insurances, education, whether you should buy Bitcoin or Ether today, or whether you should attend job interview A instead of job interview B today, or your kids attend school X or Y. In these areas, relatively un-altered algorithms from ad tech can help. It is important to keep in mind, that again no algorithm needs to know your name in order to work. Not even if that algorithm is looking through your medical record and correlating your stats with the stats of million of other patient records.
Of course it is true that your real identity can be learned from seemingly anonymised data. It might even be fairly trivial to do so, using good old detective skills. Differential privacy has some fairly hard results in that area. However, the main point is that someone has to make a conscious decision to look into the data on a mission to find you and possibly design a new algorithm for that purpose.

Now I get to my main point. Yes, ad tech CAN know who you are with some detective work. However, ad tech does not NEED to know who you are in order to work. This is so important because it means that we can potentially harness the power of algorithms in areas of life that matter — without compromising the privacy of anybody. It is not going to be easy to obtain the granular and self-controlled privacy that is needed, but it is worthwhile. And that is why I joined ad tech in the first place, because the computer science problems are interesting and important — and well, interesting and important things tend to pay well.

My recipe for dal

I love dal. Here is my simple recipe, which draws it’s inspiration from both India and the Mediterranean Region.

Ingredients:

– 200-300 g split red lentils
– 1 onion
– 2-3 cloves of garlic
– 1 can of crushed tomatoes
– coriander seeds
– cumin
– fresh mint
– fresh coriander
– water
– salt and pepper

Fry finely-chopped onion and garlic in olive oil in a pot. Rinse split red lentils in water and add them to pot. It is ok if it fries a bit with the onion and garlic while you get the next ingredient ready. Add a can of chopped tomatoes. Refill the empty can with water and pour into pot. Refill can and pour into pot one more time. Bring it to a boil and then turn down the heat. Add some crushed coriander seeds and cumin. Also add some salt. Boil on low heat until the the water has been absorbed and the lentils have become al dente. Grind some black pepper into the pot and turn off heat. Chop some fresh mint (not too much because the taste can become overpowering) and some fresh coriander and add to pot. Let is settle and cool down a bit. Now you eat it with bread. Enjoy.

Neural networks on GPUs: cost of DIY vs. Amazon

I like to dabble with machine learning and specifically neural networks. However, I don’t like to wait for exorbitant amounts of time. Since my laptop does not have a graphics card that is supported by the neural network frameworks I use, I have to wait for a long time while my models git fitted. This is a problem.

The solution to the problem is to get access to a computer with a supported Nvidia GPU. Two approaches are to either get my own rig or rent one from Amazon. Which is cheaper?

Cost analysis

I will assume that I will train models on my machine (whether local or at Amazon) for two hours every day.

The Amazon p2 range of EC2 machines come with Nvidia K80 cards, which costs about 50.000 DKK. Already this analysis is going to be difficult; I will not buy a computer that costs 50.000 DKK just to train NN models. So, in this analysis I will be comparing apples to oranges, but that is how it is.

Cost of Amazon

The p2.xlarge EC2 instance has a single K80 GPU, which is at least as good as any rig I would consider buying.

The on-demand prie is $0.9/hour; the spot price about five times cheaper. Usage for two hours every day for a whole year costs 4.500 DKK for on-demand and 900 DKK for spot instances. However, the p2 instances is sometimes unavailable in the European spot markets.

Cost of DIY

What is the best GPU to get for a DIY machine learning rig? In 2016, Quora answers suggested that the Nvidia cards Titan X and GTX980TI would be best. Let’s go with that.

This is quite a bit more than 4.500 DKK and that is only for the graphics card. The finished rig would probably cost around 15.000 DKK (Titan) and 10.000 DKK (GTX).

The electricity also has to be factored in, plus that the cards are basically slower than the K80.

Best choice for increased usage

With increased usage the DIY approach will become cheaper than Amazon, albeit still a slower option. With usage of 5 or 7 hours/day the DIY approaches break even after a year.

Further reading

Build a deep learning rig for $800.

Ether mining – first attempt

This guide outlines my first (and failed) attempt at mining Ether on AWS. I will first show how I set up a GPU instance to mine Ether. Then I will conclude on the profitability of the whole endeavour in a simple way, i.e. profit = [value of generated Ether] – [AWS cost].

The guide is based on this, this, this and this.

Specs:

Machine p2.xlarge (spot instance)
AMI ami-2cbf3e44
Miner cpp-ethereum
Pool ethermine.org
Result -$0.19/hour

Download blockchain

Log into AWS in us-east-1.

Launch an p2.xlarge EC2 machine. Set the bid price to $0.9. Use the image ami-2cbf3e44, which comes with the NVIDIA drivers etc. The instance should have 100gb disk space to hold the blockchain.

(SSH into the machine)

Install Ethereum:

sudo apt-get install software-properties-common
sudo add-apt-repository -y ppa:ethereum/ethereum-qt
sudo add-apt-repository -y ppa:ethereum/ethereum
sudo add-apt-repository -y ppa:ethereum/ethereum-dev
sudo apt-get update
sudo apt-get install ethereum

Start a screen session:

screen

Run geth in the screen session, which begins to download the blockchain:

geth

You can now detach from the screen session by hitting ctrl a and d and disconnect from the machine. Periodically check back on the progress of geth downloading the blockchain.

(After a while)

Reattach to the screen session and check the process:

screen -r

You will see an output that looks like the output below. When geth has caught up with the block chain, you will see messages with 1 block per line.

Mine Ether

(When block chain has caught up)

Now we will mine Ether using the default ethminer. This is not recommended according to etherminer.org, but we will see how bad it is.

Install the cpp-ethminer:

sudo apt-get update
sudo apt-get install cpp-ethereum

Benchmark the miner to measure its hash rate. You will need to run it twice. The first time it runs it builds a DAG. The second time it runs it will run the actual benchmark:

# Run twice
ethminer -G -M

I got about 19 MH/s (i.e. 19 million hashes per second) on the p2.xlarge instance.

Create a new screen window by hitting ctrl-a and c.

Now, mine some Ether in the new window. Use the -F option to send it to your account via ethermine.org:

ethminer -F http://eu1.ethermine.org:5555/0x7B194c41B9B5325ae4225Af86Ba4a3a2cdc6Bf4D.rig1 -G

The long string that begins with ‘0x7B…’ and ends right before ‘.’ is my Ether wallet. You should obviously use your own.

Now, you can detach from the screen session with ctrl-a and d and check back once in a while to see if it is still running. Also check if Ether accumulates in your wallet. Finally, you can check the dash board (mine) to see the progress.

Results

Now, let’s tally the Ether that was generated and subtract the cost for the AWS instance.

The AWS spot price for one p2.xlarge is about $0.2/hour. According to ethermine, my rig generates about $0.01 worth of Ether per hour. That gives a negative result of $0.01 – $0.2 = -$0.19/hour.

The AWS p2.xlarge instance would need to be 20 x more effective to break even. Ethermine reports my effective hashrate as 19 MH/s. In conclusion, we need to squeeze at least 380 MH/s out of the p2.xlarge or find something else to do.

Python script for geocoding a text file

Assume that you have a file with some locations as text with one location per line.

For example, here are some school names in Copenhagen, Denmark, stored in schools.csv:

Hyltebjerg Skole
Heibergskolen
Ellebjerg Skole
Katrinedals Skole
Peder Lykke Skolen
Amager Fælled Skole
Tingbjerg Heldagsskole
Øster Farimagsgades Skole
Sankt Annæ Gymnasiums Grundskole
Lykkebo Skole
Randersgades Skole
Strandvejsskolen
Sortedamskolen
Grøndalsvængets Skole
Sølvgades Skole
Skolen ved Sundet
Hanssted Skole
Holbergskolen
Den Classenske Legatskole
Tove Ditlevsens Skole
Lergravsparkens Skole
Vigerslev Allés Skole
Bavnehøj Skole
Ålholm Skole
Langelinieskolen
Guldberg Skole
Husum Skole
Nyboder Skole
Vanløse Skole
Kirkebjerg Skole
Christianshavns Skole
Bellahøj Skole
Kildevældsskolen
Korsager Skole
Nørrebro Park Skole
Utterslev Skole
Skolen på Islands Brygge
Brønshøj Skole
Kirsebærhavens Skole
Rødkilde Skole
Vesterbro Ny Skole
Blågård Skole
Sønderbro Skole
Højdevangens Skole
Oehlenschlægersgades Skole
Vibenshus Skole
Valby Skole
Rådmandsgades Skole
Lundehusskolen
Tagensbo Skole

Here is a script, geocode.py, that will attempt to geocode each location in an input stream. It prints CSV output to stdout with the fields input_line, input_line_no, result_no, place, latitude, longitude:

from geopy import geocoders
import sys
import time
import pdb
 
geocoder = geocoders.GoogleV3()
 
SEPARATOR='|'  # can also use tab. Comma is bad, since the place will most likely contain a comma.
dummy = ['', ['', '']]
i = 0
 
header = ['input_line', 'input_line_no', 'result_no', 'place', 'latitude', 'longitude']
print(SEPARATOR.join(header))
 
for line in sys.stdin:
    line = line.strip()
    results = geocoder.geocode(line, exactly_one=False) or [dummy]
    for j, res in enumerate(results):
        place = res[0]
        lat = str(res[1][0])
        lon = str(res[1][1])
        out = SEPARATOR.join([line, str(i), str(j), place, lat, lon])
        print (out)
    time.sleep(0.05)
    i += 1

Here is how you might use the script:

cat schools.csv | python geocode.py 

Tip: you might want to

How AI, robotics and advanced manufacturing could impact everybody’s life on Earth

What if everybody could live a wealthy, healthy, job-less and creative life in a post-scarcity Universe? Are we currently on a trajectory to this new reality and what are the obstacles we may face on the way? What are the important game-changing technologies?

TODO: create and agenda (very tentative):

1) contrast current life circumstances with a potential future
2) identify the key problems that we could solve with technology
3) review the players in society that will take part in this change
3) contrast views on the opportunities and threats of these technologies
4) …

Our future life conditions here on Earth might soon be impacted by game-changing advancements in artifical intelligence, robotics, manufacturing and genetics; at least if you ask people like Elon Mush, Andrew Ng and Ray Kurzweil. What are the most important technologies and what is the impact they might have? What are the dangers? Opinions differ so the intention here is to review and contrast what leading fiction writers, scientists, visionaries and entrepreneurs think about the question: how will AI, robots, and advanced manufacturing impact everybody’s life circumstances here on Earth?

Fiction to be reviewed

Post-scarcity:
– The Culture series

AI:
– Asimov

The Human-Computer Cortex:
– That Swedish guy who wrote sci-fi computer implants in the 70’s

Non-fiction to be reviewed

AI:
– Douglas Hofstadter: GEB

Videos to be reviewed

AI:


The Human-Computer Cortex:

News articles to be reviewed

AI:
– https://aifuture2016.stanford.edu/
– http://fortune.com/2016/06/15/future-of-work-2/
– http://www.businessinsider.com/researchers-predictions-future-artificial-intelligence-2015-10?r=US&IR=T&IR=T

3D printing:
– https://hbr.org/2013/03/3-d-printing-will-change-the-world

Easy parallel HTTP requests with Python and asyncio

Python 3.x, and in particular Python 3.5, natively supports asynchronous programming. While asynchronous code can be harder to read than synchronous code, there are many use cases were the added complexity is worthwhile. One such examples is to execute a batch of HTTP requests in parallel, which I will explore in this post. Additionally, the async-await paradigm used by Python 3.5 makes the code almost as easy to understand as synchronous code.

Before we look at asynchronous requests, let us look at the sequential case. This will give us something to compare with later. The code listing below is an example of how to make twenty synchronous HTTP requests:

# Example 1: synchronous requests
import requests

num_requests = 20

responses = [
    requests.get('http://example.org/')
    for i in range(num_requests)
]

How does the total completion time develop as a function of num_requests? The chart below shows my measurements. The curve is unsurprisingly linear:

Using synchronous requests, I was able to execute just five requests per second in my experiment.

Next, let us see how we can make asynchronous HTTP requests with the help of asyncio. The code listing below is an example of how to make twenty asynchronous HTTP requests in Python 3.5 or later:

# Example 2: asynchronous requests
import asyncio
import requests

async def main():
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            requests.get, 
            'http://example.org/'
        )
        for i in range(20)
    ]
    for response in await asyncio.gather(*futures):
        pass

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

The code is now more convoluted. But is it better? In the chart below, you will see the total completion time in seconds as a function of the number of asynchronous requests made.

The stepwise curve indicates that some requests are being executed in parallel. However, the curve is still asymptotically linear. Let’s find out why.

Increasing the number of threads

Notice that there is a step pattern in the chart? The completion time is 1x for 1-5 requests, 2x for 6-10 requests, 3x for 11-15 requests, and so on. The reason that we see this step pattern is that the default Executor has an internal pool of five threads that execute work. While five requests can be executed in parallel, any remaining requests will have to wait for a thread to become available.

So here is an idea. To minimize the total completion time, we could increase the size of the thread pool to match the number of requests we have to make. Luckily, this is easy to do as we will see next. The code listing below is an example of how to make twenty asynchronous HTTP requests with a thread pool of twenty worker threads:

# Example 3: asynchronous requests with larger thread pool
import asyncio
import concurrent.futures
import requests

async def main():

    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:

        loop = asyncio.get_event_loop()
        futures = [
            loop.run_in_executor(
                executor, 
                requests.get, 
                'http://example.org/'
            )
            for i in range(20)
        ]
        for response in await asyncio.gather(*futures):
            pass


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Let us rerun the experiment from before.

Bingo. If the executor has a pool of 20 threads then 20 requests take roughly the same amount of time as 1 request.

The next obvious question to ask is: what is the limit to this approach? How does the total completion time increase for a more substantial amount of asynchronous requests and equal amount of threads in the executor pool, say 100 or 1000? Let’s find out.

We are back to linear – the flat kind. The chart shows that we can easily make 500+ asynchronous HTTP requests in the same time it takes to make 10 synchronous requests – or 300 requests per second. Surely a vast improvement. You might have noticed that the chart only goes to 720 requests, not 1000. The reason is that “something bad” happened when I reached 720 requests/threads and my timing program became very sad and stopped working. On a different computer than mine, this limit will likely be found somewhere else.

Conclusion

First, you have seen that with asynchronous requests you can execute more requests per second than with synchronous requests: 300 requests per second compared to just 5 requests per second (in my experiments).

Second, you have seen how to achieve this well-known benefit of async I/O in Python 3.5 using just slightly convoluted code.

I would like to apologize to the people who host example.org for hammering their website during my experiment. I like to think that it was for a good cause.

(Tentative)

Symbiosen mellem mennesker og AI vil kunne transformere mennesket til en rationel organisme (jvf. Daniel Kahneman som har påvist at mennesket for sig selv ikke er en rationel organisme). Hvordan det? Vores minutiøse adfærd bliver i stigende grad sporet i alle livets væsentlige forhold. Kunstig intelligens bliver bedre og bedre til at skønne om vi er glade, sunde og rige udfra en analyse af alle de spor vi efterlader os overalt. Vi står nu i en situation hvor vi kan – eller snart kan – stille spørgsmål som: hvor glad, sund og rig var person X til tiden t? Hvilke handlinger h1, h2, h3, … havde person X udført (f.eks. på Spotify, rejser, jobskifte, lægebesøg) som ledte op til dette øjeblik? Hvor glad vil X være til tiden t+1, t+10, t+1000 hvis alting fortsætter som nu? Hvilke handlinger skal X udføre for at maksimere sin glæde til tiden t+1000?
Med andre ord, der er komplekse livsområder hvor kompleks AI har et potentiale for maksimere vore long-term utility (f.eks. vores “livsglæde” eller formue om 10 år). Forstil dig at en personlig AI kan
– Finde din næste bolig
– Finde en skole/fritidsaktivitet til dit barn
– Finde investeringsobjekter
– Finde kærlighed
– Finde venner
– Finde dit næste måltid
– o.s.v.

How to select top-k items from each group in SQL

Here is an analytical query that you (and I) will often need to do if you work in e-commerce, marketing or similar domain. It answers the question, within each group of items (e.g. partitioned by territory, age groups or something else) what are the top-k items for some utility function over the items (e.g. the number of units sold)?

Assume we have a table, aggregated_sales(id INT, item TEXT, country TEXT, total_sold INT). Here is a sample:

ID  ITEM              COUNTRY  TOTAL_SOLD
1   Basketball        DK       125
2   Basketball        US       432943
3   Leather football  FO       64773 
4   Leather football  DK       56230
5   Boxing gloves     CU       9812

In SQL, here is how to get the top-3 selling items by country:

SELECT id, item, country, total_sold
FROM (
  SELECT
    *,
    row_number() OVER (PARTITION BY country ORDER BY total_sold DESC) as rn
  FROM aggregated_sales
) t
WHERE rn = 1;

Bonus info: the table sample was generated from a CSV file with the command column -s , -t filename.csv.

How to get structured Wikipedia data via DBPedia

Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.

Infobox

While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.

Example

Let us explore the DBPedia API by way of an example.

I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:

http://dbpedia.org/data/Matteo_Donati.json

In this example, we will fetch and process the JSON data with a small Python script.

# Python
import requests
 
data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
matteo = data['http://dbpedia.org/resource/Matteo_Donati']
 
# matteo is a dictionary with lots of keys
# that correspond to the player's properties.
# Each value is a list of dictionaries itself.
 
height = matteo['http://dbpedia.org/ontology/height'][0]['value']
# 1.88  (float)
birth_year = matteo['http://dbpedia.org/ontology/birthYear'][0]['value']
# '1995'  (string)
hand = matteo['http://dbpedia.org/ontology/plays'][0]['value']
# 'Right-handed (two-handed backhand)'  (string)
singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking'][0]['value']
# 'No. 171'  (string)

The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:

http://dbpedia.org/data/Roger_Federer.json

Please note, that at the time of writing, DBPedia does not support https.

Redundancy and inconsistency

The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:

for key in sorted(matteo): print(key)
"""
http://dbpedia.org/ontology/Person/height
http://dbpedia.org/ontology/abstract
http://dbpedia.org/ontology/birthDate
http://dbpedia.org/ontology/birthPlace
http://dbpedia.org/ontology/birthYear
http://dbpedia.org/ontology/careerPrizeMoney
http://dbpedia.org/ontology/country
http://dbpedia.org/ontology/height
http://dbpedia.org/ontology/plays
http://dbpedia.org/ontology/residence
http://dbpedia.org/ontology/thumbnail
http://dbpedia.org/ontology/wikiPageID
http://dbpedia.org/ontology/wikiPageRevisionID
http://dbpedia.org/property/birthDate
http://dbpedia.org/property/birthPlace
http://dbpedia.org/property/caption
http://dbpedia.org/property/careerprizemoney
http://dbpedia.org/property/currentdoublesranking
http://dbpedia.org/property/currentsinglesranking
http://dbpedia.org/property/dateOfBirth
http://dbpedia.org/property/doublesrecord
http://dbpedia.org/property/doublestitles
http://dbpedia.org/property/highestdoublesranking
http://dbpedia.org/property/highestsinglesranking
http://dbpedia.org/property/name
http://dbpedia.org/property/placeOfBirth
http://dbpedia.org/property/plays
http://dbpedia.org/property/residence
http://dbpedia.org/property/shortDescription
http://dbpedia.org/property/singlesrecord
http://dbpedia.org/property/singlestitles
http://dbpedia.org/property/updated
http://dbpedia.org/property/usopenresult
http://dbpedia.org/property/wimbledonresult
http://purl.org/dc/elements/1.1/description
http://purl.org/dc/terms/subject
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#comment
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/ns/prov#wasDerivedFrom
http://xmlns.com/foaf/0.1/depiction
http://xmlns.com/foaf/0.1/givenName
http://xmlns.com/foaf/0.1/isPrimaryTopicOf
http://xmlns.com/foaf/0.1/name
http://xmlns.com/foaf/0.1/surname
"""

You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:

  • http://dbpedia.org/ontology/Person/height
  • http://dbpedia.org/ontology/height

Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.

With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.

Conclusion

DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.