Linked Data: First Blood

Knowing a lot about something, makes me more prone to appraising its value. I unfortunately know very little about Linked data. For this reason, I’ve had a very biased and shamefully low opinion about the concept of linked data. I’ve decided to change this.

A repository of linked data that I’ve recently taken an interest in, is DBPedia. DBPedia is a project about extracting structured data (linked data) from Wikipedia, and exposing it via a SPARQL endpoint. With the interested in DBPedia, comes the first sparks (pun intended) of interest in RDF-endpoints and in particular SPARQL.

The brilliant thing about DBPedia (and SPARQL) is that it makes it possible to query a vast repository of information, originally in raw text form, using a proper query language. It’s Wikipedia with a nerd boner on.

So what can you do with SPARQL and DBPedia? There are several examples on the DBPedia homepage.

Here is one (slightly modified one): Find all people born in Copenhagen before 1900 (the link points to a page that executes the query):

PREFIX dbo: <http://dbpedia.org/ontology/>
 
SELECT ?name ?birth ?death ?person WHERE {
     ?person dbo:birthPlace :Copenhagen .
     ?person dbo:birthDate ?birth .
     ?person foaf:name ?name .
     ?person dbo:deathDate ?death .
     FILTER (?birth < "1900-01-01"^^xsd:date) .
}
ORDER BY ?name

Looking at the names that are returned, I believe that those are names of people born in Copenhagen before 1900. A test probe looking up one of the people on the list confirms it. According to Wikipedia, Agnes Charlotte Dagmar Adler was a pianist born in Copenhagen in 1865.

Ok, the hello world of linked data has been commited to this blog. This will NOT be the last thing I write about Linked Data… I’ve seen the light.

This blog post is dedicated to Anders Friis-Christensen, who tried (without luck) to get me interested in Linked Data two years ago. I might be a bit slow, but I eventually get it 🙂

The scale of the Danish cyber effort

How much money does Denmark spend on cyber defense, compared to the U.S? In total and per citizen. This is what I’ll look at in this post. I’ll also try to get an initial idea of what is going on. Why am I doing this? Actually, just out of curiosity, and to kill some time before I have my hair cut.

Picking up the paper-paper (Politiken) this morning I read a short opinion-piece about the intelligence branch of the danish armed forced (FE: Forsvarets Efterretningstjeneste), and in particular the new Center for Cybersecurity. The concern is that this new center is going to spy on ordinary Danish citizens (NSA-style). It made me curious, and I decided to investigate for myself.

Web soldiers during combat. Not entirely sure that’s not World of Warcraft.

In 2011 the center was established, with a fairly modest annual budget of 35 million DKK a year (out of a 90 million DKK budget in 2014 for cyber efforts by the Danish Ministry of Defense; increased to 150 million DKK in 2016). This is a modest budget, given the amount of money truly skilled IT-professionals charge an hour and what IT equipment costs in general, but also compared to what other institutions in Denmark receive. For example the Danish Geodata Agency, which I’ve had the great pleasure of working for, has annual budget of more than 200 million DKK.

So 90 million DKK for cyber defense versus 200 million DKK for geographical data (2014).

In the United States, the Defense Department allocates $4.7 billion on the annual budget for “cyber efforts”. Making the currency conversion, that is 25 billion DKK versus 90 million DKK, a ration of 277:1.

Red square is Danish budget, blue square is U.S. budget:

The population of the United States is 313 million people. The population ratio between the U.S. and Denmark is approximately 62:1. The United States thus spends roughly 4.5 times more money per capita on cyber efforts than Denmark.

Dollars spent on cyber efforts per person in the U.S and in Denmark:

When trying to understand the motivation for national cyber efforts, Danish independent media seems to focus on the threat posed by industrial espionage (from abroad?) against Danish companies (1, 2, 3). This is surely a real threat, and should be a primary mission IMO.

The stated mission of the center, as described on the homepage for the center, is a bit more vague. It goes something like this:

Styrke Danmarks modstandsdygtighed mod trusler rettet mod samfundsvigtig informations- og kommunikationsteknologi (ikt); Sikre forudsætningerne for en robust ikt-infrastruktur i Danmark; Varsle om og imødegå cyberangreb med henblik på at styrke beskyttelsen af danske interesser.

I’m not really sure what that means concretely. What the paper-paper (Politiken) is concerned about is that the center is going to spy on Danish and foreign citizens. Given the modest annual budget and the usual burn-rate in public administration, I think this is going to be a rather weak threat to our privacy. Another question is, what should the primary mission of the center be, and how should that mission be accomplished? In any event, 90 million DKK do not go a long way towards anything. That being said, I’d certainly curious about what the money IS spent on. If I learn, I’m not sure I’ll post it on my personal blog, so don’t hold your breath.

This was primarily a way to pass some time before I have my hair cut (in five minutes).

Finding the haystack

Indtil denne sommer troede mange, at de kendte til sandheden om, at det er svært at finde nålen i høstakken. Kun de færreste vidste, at man har vendt ordsproget på hovedet i efterretningsverdenen. Her siger man, at »for at finde nålen, har vi brug for høstakken«.

http://www.information.dk/471968

Writing a parser in Python

This is my base pattern for writing a parser in Python by using the pyparsing library. It is slightly more complicated than a hello world in pyparsing, but I think it is more useful as a small example of writing a parser for a real grammar.

A base class PNode is used to provide utility functions to classes implementing parse tree nodes, e.g. turning a parse tree into the original string (except all whitespace is replaced by single space). It assumes that tokens in the input where separated by whitespace, and that all whitespace is the same.

For a particular grammar, I use Python classes to represent nodes in the parse tree; these classes get created by calling the setParseAction method on the corresponding BNF element. I like having these classes because it adds a nice structure to the parse tree.

from pyparsing import *
 
class PNode(object):
    """Base class for parser elements"""
    def __init__(self, tokens):
        super(PNode, self).__init__()
        self.tokens = tokens
 
    def __str__(self):
        return u" ".join(map(lambda x: unicode(x), self.tokens))
 
    def __repr__(self):
        return self.__str__()
 
# Target classes
 
class Integer(PNode):
    def __init__(self, tokens):
        super(Integer, self).__init__(tokens)
        self.value = int(tokens[0])
 
class Comma(PNode):
    def __init__(self, tokens):
        super(Comma, self).__init__(tokens)
 
class IntegerList(PNode):
    def __init__(self, tokens):
        super(IntegerList, self).__init__(tokens)
        self.integers = filter(lambda x: type(x) == Integer, tokens)
        #pdb.set_trace()
        #self.foo = 'bar'
 
# BNF
 
comma = Literal(',').setParseAction(Comma)
integer = Word(nums).setParseAction(Integer)
integer_list = (integer + ZeroOrMore(comma + integer)).setParseAction(IntegerList)
 
bnf = integer_list
bnf += StringEnd()
 
# Try parser
 
parsed_list = bnf.parseString('1,2,3')[0]
 
print parsed_list

When to be most careful about catching the flu?

Continuing on my blogification of Peter Norvigs excellent talk, the question is, when to watch out for the flu, e.g. if you live in Denmark?

1) Go to www.google.com/trends/
2) Type in the word “influenza”
3) Select your geographical region (Denmark in my case)
4) See data up to year 2008, to avoid the graph being squished by the outbreak of A(H1N1) (which leads to unusually many people talking about the flu)

Turns out the answer is: watch out in October and February.

Geocoding Python function for PostgreSQL

Gratefully making use of what others have provided, i.e. geopy, Google and plpythonu.

Type to hold result of geocoding:

CREATE TYPE geocoding AS (
  place text,
  latitude DOUBLE PRECISION,
  longitude DOUBLE PRECISION
);

Function that does the actual geocoding (to be extended with more vendors. Hint: look at geopy wiki). Takes an (arbitrary) input string to be geocoded:

CREATE OR REPLACE FUNCTION python_geocode
(
  input text,
  vendor text DEFAULT 'google'
) RETURNS SETOF geocoding AS
$$
  import time
  from geopy import geocoders
  # https://code.google.com/p/geopy/wiki/GettingStarted
 
  time.sleep(0.2)
  # TODO: Add other available vendors, e.g. Yahoo.
  if vendor.lower() == 'google':
    geocoder = geocoders.GoogleV3()
  else:
    raise ValueError("Invalid geocoder: %s" % vendor)
  try:
    for res in geocoder.geocode(input, exactly_one=False):
      yield {'place': res[0], 'latitude': res[1][0], 'longitude': res[1][1]}
  except:
    pass
$$ LANGUAGE plpythonu VOLATILE;

Example:

SELECT place, ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)
FROM python_geocode('Kostas');