Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.
While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.
Let us explore the DBPedia API by way of an example.
I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:
In this example, we will fetch and process the JSON data with a small Python script.
data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
matteo = data['http://dbpedia.org/resource/Matteo_Donati']
# matteo is a dictionary with lots of keys
# that correspond to the player's properties.
# Each value is a list of dictionaries itself.
height = matteo['http://dbpedia.org/ontology/height']['value']
# 1.88 (float)
birth_year = matteo['http://dbpedia.org/ontology/birthYear']['value']
# '1995' (string)
hand = matteo['http://dbpedia.org/ontology/plays']['value']
# 'Right-handed (two-handed backhand)' (string)
singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking']['value']
# 'No. 171' (string)
The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:
Please note, that at the time of writing, DBPedia does not support https.
Redundancy and inconsistency
The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:
for key in sorted(matteo): print(key)
You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:
Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.
With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.
DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.