Do What You Want

The message of this movie is “do what you want”. The video is a meditative portrait of a man who was once a successful medical doctor, and who as 60+ lives in a small studio in Los Angeles (i think). Every day, he goes out to skate the beaches of Southern California with a big smile on his face. He does not seem crazy, although I don’t know for sure. He tells us that he had an opportunity and took it. He decided to stop being an asshole, and start being spiritual.

If, like me, you are a parent of a two-year old and a four-year old, I hope you will interleave the “do what you want” part with an adequate or more dosis of “take care of your children”. Another message of the video is that it is never to later to realize yourself. Even when you are 60+ years old. Until then, happy grinding!

Slomo from The New York Times – Video on Vimeo.

What Goes Around Comes Around

Today I read the What Goes Around Comes Around chapter from the “Red Book” by Michael Stonebraker and Joseph M. Hellerstein. The chapter (or paper if you will) is a summary of 35 years of data model proposals, grouped into 9 different eras. This post is a kind of cheat sheet to the lessons learned in the chapter.

The paper surveyed three decades of data model thinking. It is clear that we have come “full circle”. We started off with a complex data model (Hierarchical/Network model), which was followed by a great debate between a complex model and a much simpler one (Relational model). The simpler one was shown to be advantageous in terms of understandability and its ability to support data independence.

Then, a substantial collection of additions were proposed, none of which gained substantial market traction, largely because they failed to offer substantial leverage in exchange for the increased complexity. The only ideas that got market traction were user-defined functions (Object-Relational model) and user-defined access methods (Object-Relational model), and these were performance constructs not data model constructs. The current proposal is now a superset of the union of all previous proposals. I.e. we have navigated a full circle.

Hierarchical Data Model (IMS)

Late 1960’s and 1970’s

  • Lesson 1: Physical and logical data independence are highly desirable
  • Lesson 2: Tree structured data models are very restrictive
  • Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data
  • Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard. (Key-Value stores anyone?)

Network Data Model (CODASYL)


  • Lesson 5: Networks are more flexible than hierarchies but more complex
  • Lesson 6: Loading and recovering networks is more complex than hierarchies

Relational Data Model

1970’s and early 1980’s

  • Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence
  • Lesson 8: Logical data independence is easier with a simple data model than with a
    complex one
  • Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology (Key-Value stores anyone?)
  • Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers (Key-Value stores anyone?)

Entity-Relationship Data Model


  • Lesson 11: Functional dependencies are too difficult for mere mortals to understand

Extended Relational Data Model


  • Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere

Semantic Data Model

Late 1970’s and 1980’s Innovation: classes, multiple inheritance.

No lessons learned, but the model failed for the same reasons as the Extended Relational Data Model.

Object-oriented: late 1980’s and early 1990’s

Beginning in the mid 1980’s there was a “tidal wave” of interest in Object-oriented DBMSs (OODB). Basically, this community pointed to an “impedance mismatch” between relational data bases and languages like C++.

Impedance mismatch: In practice, relational data bases had their own naming systems, their own data type systems, and their own conventions for returning data as a result of a query. Whatever programming language was used alongside a relational data base also had its own version of all of these facilities. Hence, to bind an application to the data base required a conversion from “programming language speak” to “data base speak” and back. This
was like “gluing an apple onto a pancake”, and was the reason for the so-called impedance mismatch.

  • Lesson 13: Packages will not sell to users unless they are in “major pain”
  • Lesson 14: Persistent languages will go nowhere without the support of the programming language community


Late 1980’s and early 1990’s

The Object-Relational (OR) era was motivated by the need to index and query geographical data (using e.g. an R-tree access method), since two dimensional search is not supported by existing B-tree access methods.

As a result, the OR proposal added:

  • user-defined data types
  • user-defined operators
  • user-defined functions
  • user-defined access methods
  • Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby bluring the distinction between code and data) and user-defined access methods
  • Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard

Semi-structured (XML)

Late 1990’s to the present

There are two basic points that this class of work exemplifies: (1) schema last and (2) complex network-oriented data model.

  • Lesson 16: Schema-last is a probably a niche market
  • Lesson 17: XQuery is pretty much OR SQL with a different syntax
  • Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the enterprise

Get Weather using JSON web service and Python

Get the current weather for Copenhagen:

import urllib2
import json
# hent vejret for Koebenhavn
url = ',dk'
response = urllib2.urlopen(url)
# parse JSON resultatet
data = json.load(response)
print 'Weather in Copenhagen:', data['weather'][0]['description']

Linked Data: First Blood

Knowing a lot about something, makes me more prone to appraising its value. I unfortunately know very little about Linked data. For this reason, I’ve had a very biased and shamefully low opinion about the concept of linked data. I’ve decided to change this.

A repository of linked data that I’ve recently taken an interest in, is DBPedia. DBPedia is a project about extracting structured data (linked data) from Wikipedia, and exposing it via a SPARQL endpoint. With the interested in DBPedia, comes the first sparks (pun intended) of interest in RDF-endpoints and in particular SPARQL.

The brilliant thing about DBPedia (and SPARQL) is that it makes it possible to query a vast repository of information, originally in raw text form, using a proper query language. It’s Wikipedia with a nerd boner on.

So what can you do with SPARQL and DBPedia? There are several examples on the DBPedia homepage.

Here is one (slightly modified one): Find all people born in Copenhagen before 1900 (the link points to a page that executes the query):

PREFIX dbo: <>
SELECT ?name ?birth ?death ?person WHERE {
     ?person dbo:birthPlace :Copenhagen .
     ?person dbo:birthDate ?birth .
     ?person foaf:name ?name .
     ?person dbo:deathDate ?death .
     FILTER (?birth < "1900-01-01"^^xsd:date) .
ORDER BY ?name

Looking at the names that are returned, I believe that those are names of people born in Copenhagen before 1900. A test probe looking up one of the people on the list confirms it. According to Wikipedia, Agnes Charlotte Dagmar Adler was a pianist born in Copenhagen in 1865.

Ok, the hello world of linked data has been commited to this blog. This will NOT be the last thing I write about Linked Data… I’ve seen the light.

This blog post is dedicated to Anders Friis-Christensen, who tried (without luck) to get me interested in Linked Data two years ago. I might be a bit slow, but I eventually get it 🙂

The scale of the Danish cyber effort

How much money does Denmark spend on cyber defense, compared to the U.S? In total and per citizen. This is what I’ll look at in this post. I’ll also try to get an initial idea of what is going on. Why am I doing this? Actually, just out of curiosity, and to kill some time before I have my hair cut.

Picking up the paper-paper (Politiken) this morning I read a short opinion-piece about the intelligence branch of the danish armed forced (FE: Forsvarets Efterretningstjeneste), and in particular the new Center for Cybersecurity. The concern is that this new center is going to spy on ordinary Danish citizens (NSA-style). It made me curious, and I decided to investigate for myself.

Web soldiers during combat. Not entirely sure that’s not World of Warcraft.

In 2011 the center was established, with a fairly modest annual budget of 35 million DKK a year (out of a 90 million DKK budget in 2014 for cyber efforts by the Danish Ministry of Defense; increased to 150 million DKK in 2016). This is a modest budget, given the amount of money truly skilled IT-professionals charge an hour and what IT equipment costs in general, but also compared to what other institutions in Denmark receive. For example the Danish Geodata Agency, which I’ve had the great pleasure of working for, has annual budget of more than 200 million DKK.

So 90 million DKK for cyber defense versus 200 million DKK for geographical data (2014).

In the United States, the Defense Department allocates $4.7 billion on the annual budget for “cyber efforts”. Making the currency conversion, that is 25 billion DKK versus 90 million DKK, a ration of 277:1.

Red square is Danish budget, blue square is U.S. budget:

The population of the United States is 313 million people. The population ratio between the U.S. and Denmark is approximately 62:1. The United States thus spends roughly 4.5 times more money per capita on cyber efforts than Denmark.

Dollars spent on cyber efforts per person in the U.S and in Denmark:

When trying to understand the motivation for national cyber efforts, Danish independent media seems to focus on the threat posed by industrial espionage (from abroad?) against Danish companies (1, 2, 3). This is surely a real threat, and should be a primary mission IMO.

The stated mission of the center, as described on the homepage for the center, is a bit more vague. It goes something like this:

Styrke Danmarks modstandsdygtighed mod trusler rettet mod samfundsvigtig informations- og kommunikationsteknologi (ikt); Sikre forudsætningerne for en robust ikt-infrastruktur i Danmark; Varsle om og imødegå cyberangreb med henblik på at styrke beskyttelsen af danske interesser.

I’m not really sure what that means concretely. What the paper-paper (Politiken) is concerned about is that the center is going to spy on Danish and foreign citizens. Given the modest annual budget and the usual burn-rate in public administration, I think this is going to be a rather weak threat to our privacy. Another question is, what should the primary mission of the center be, and how should that mission be accomplished? In any event, 90 million DKK do not go a long way towards anything. That being said, I’d certainly curious about what the money IS spent on. If I learn, I’m not sure I’ll post it on my personal blog, so don’t hold your breath.

This was primarily a way to pass some time before I have my hair cut (in five minutes).

Finding the haystack

Indtil denne sommer troede mange, at de kendte til sandheden om, at det er svært at finde nålen i høstakken. Kun de færreste vidste, at man har vendt ordsproget på hovedet i efterretningsverdenen. Her siger man, at »for at finde nålen, har vi brug for høstakken«.

Writing a parser in Python

This is my base pattern for writing a parser in Python by using the pyparsing library. It is slightly more complicated than a hello world in pyparsing, but I think it is more useful as a small example of writing a parser for a real grammar.

A base class PNode is used to provide utility functions to classes implementing parse tree nodes, e.g. turning a parse tree into the original string (except all whitespace is replaced by single space). It assumes that tokens in the input where separated by whitespace, and that all whitespace is the same.

For a particular grammar, I use Python classes to represent nodes in the parse tree; these classes get created by calling the setParseAction method on the corresponding BNF element. I like having these classes because it adds a nice structure to the parse tree.

from pyparsing import *
class PNode(object):
    """Base class for parser elements"""
    def __init__(self, tokens):
        super(PNode, self).__init__()
        self.tokens = tokens
    def __str__(self):
        return u" ".join(map(lambda x: unicode(x), self.tokens))
    def __repr__(self):
        return self.__str__()
# Target classes
class Integer(PNode):
    def __init__(self, tokens):
        super(Integer, self).__init__(tokens)
        self.value = int(tokens[0])
class Comma(PNode):
    def __init__(self, tokens):
        super(Comma, self).__init__(tokens)
class IntegerList(PNode):
    def __init__(self, tokens):
        super(IntegerList, self).__init__(tokens)
        self.integers = filter(lambda x: type(x) == Integer, tokens)
        #pdb.set_trace() = 'bar'
comma = Literal(',').setParseAction(Comma)
integer = Word(nums).setParseAction(Integer)
integer_list = (integer + ZeroOrMore(comma + integer)).setParseAction(IntegerList)
bnf = integer_list
bnf += StringEnd()
# Try parser
parsed_list = bnf.parseString('1,2,3')[0]
print parsed_list