How to get structured Wikipedia data via DBPedia

Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.

Infobox

While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.

Example

Let us explore the DBPedia API by way of an example.

I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:

http://dbpedia.org/data/Matteo_Donati.json

In this example, we will fetch and process the JSON data with a small Python script.

# Python
import requests
 
data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
matteo = data['http://dbpedia.org/resource/Matteo_Donati']
 
# matteo is a dictionary with lots of keys
# that correspond to the player's properties.
# Each value is a list of dictionaries itself.
 
height = matteo['http://dbpedia.org/ontology/height'][0]['value']
# 1.88  (float)
birth_year = matteo['http://dbpedia.org/ontology/birthYear'][0]['value']
# '1995'  (string)
hand = matteo['http://dbpedia.org/ontology/plays'][0]['value']
# 'Right-handed (two-handed backhand)'  (string)
singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking'][0]['value']
# 'No. 171'  (string)

The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:

http://dbpedia.org/data/Roger_Federer.json

Please note, that at the time of writing, DBPedia does not support https.

Redundancy and inconsistency

The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:

for key in sorted(matteo): print(key)
"""
http://dbpedia.org/ontology/Person/height
http://dbpedia.org/ontology/abstract
http://dbpedia.org/ontology/birthDate
http://dbpedia.org/ontology/birthPlace
http://dbpedia.org/ontology/birthYear
http://dbpedia.org/ontology/careerPrizeMoney
http://dbpedia.org/ontology/country
http://dbpedia.org/ontology/height
http://dbpedia.org/ontology/plays
http://dbpedia.org/ontology/residence
http://dbpedia.org/ontology/thumbnail
http://dbpedia.org/ontology/wikiPageID
http://dbpedia.org/ontology/wikiPageRevisionID
http://dbpedia.org/property/birthDate
http://dbpedia.org/property/birthPlace
http://dbpedia.org/property/caption
http://dbpedia.org/property/careerprizemoney
http://dbpedia.org/property/currentdoublesranking
http://dbpedia.org/property/currentsinglesranking
http://dbpedia.org/property/dateOfBirth
http://dbpedia.org/property/doublesrecord
http://dbpedia.org/property/doublestitles
http://dbpedia.org/property/highestdoublesranking
http://dbpedia.org/property/highestsinglesranking
http://dbpedia.org/property/name
http://dbpedia.org/property/placeOfBirth
http://dbpedia.org/property/plays
http://dbpedia.org/property/residence
http://dbpedia.org/property/shortDescription
http://dbpedia.org/property/singlesrecord
http://dbpedia.org/property/singlestitles
http://dbpedia.org/property/updated
http://dbpedia.org/property/usopenresult
http://dbpedia.org/property/wimbledonresult
http://purl.org/dc/elements/1.1/description
http://purl.org/dc/terms/subject
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#comment
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/ns/prov#wasDerivedFrom
http://xmlns.com/foaf/0.1/depiction
http://xmlns.com/foaf/0.1/givenName
http://xmlns.com/foaf/0.1/isPrimaryTopicOf
http://xmlns.com/foaf/0.1/name
http://xmlns.com/foaf/0.1/surname
"""

You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:

  • http://dbpedia.org/ontology/Person/height
  • http://dbpedia.org/ontology/height

Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.

With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.

Conclusion

DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.

Linked Data: First Blood

Knowing a lot about something, makes me more prone to appraising its value. I unfortunately know very little about Linked data. For this reason, I’ve had a very biased and shamefully low opinion about the concept of linked data. I’ve decided to change this.

A repository of linked data that I’ve recently taken an interest in, is DBPedia. DBPedia is a project about extracting structured data (linked data) from Wikipedia, and exposing it via a SPARQL endpoint. With the interested in DBPedia, comes the first sparks (pun intended) of interest in RDF-endpoints and in particular SPARQL.

The brilliant thing about DBPedia (and SPARQL) is that it makes it possible to query a vast repository of information, originally in raw text form, using a proper query language. It’s Wikipedia with a nerd boner on.

So what can you do with SPARQL and DBPedia? There are several examples on the DBPedia homepage.

Here is one (slightly modified one): Find all people born in Copenhagen before 1900 (the link points to a page that executes the query):

PREFIX dbo: <http://dbpedia.org/ontology/>
 
SELECT ?name ?birth ?death ?person WHERE {
     ?person dbo:birthPlace :Copenhagen .
     ?person dbo:birthDate ?birth .
     ?person foaf:name ?name .
     ?person dbo:deathDate ?death .
     FILTER (?birth < "1900-01-01"^^xsd:date) .
}
ORDER BY ?name

Looking at the names that are returned, I believe that those are names of people born in Copenhagen before 1900. A test probe looking up one of the people on the list confirms it. According to Wikipedia, Agnes Charlotte Dagmar Adler was a pianist born in Copenhagen in 1865.

Ok, the hello world of linked data has been commited to this blog. This will NOT be the last thing I write about Linked Data… I’ve seen the light.

This blog post is dedicated to Anders Friis-Christensen, who tried (without luck) to get me interested in Linked Data two years ago. I might be a bit slow, but I eventually get it 🙂

Finding the haystack

Indtil denne sommer troede mange, at de kendte til sandheden om, at det er svært at finde nålen i høstakken. Kun de færreste vidste, at man har vendt ordsproget på hovedet i efterretningsverdenen. Her siger man, at »for at finde nålen, har vi brug for høstakken«.

http://www.information.dk/471968

Free map tiles

Map Tile Sources

Here is a list of free sources for map tiles. I’ll expand this list as I find more free tiles. Notice that some websites require an attribution. I’ll perhaps update this post with the attribution line for each.

Stamen “Toner”

World-wide, clean B/W theme

Continue reading “Free map tiles”