How to get structured Wikipedia data via DBPedia

Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.

Infobox

While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.

Example

Let us explore the DBPedia API by way of an example.

I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:

http://dbpedia.org/data/Matteo_Donati.json

In this example, we will fetch and process the JSON data with a small Python script.

# Python
import requests
 
data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
matteo = data['http://dbpedia.org/resource/Matteo_Donati']
 
# matteo is a dictionary with lots of keys
# that correspond to the player's properties.
# Each value is a list of dictionaries itself.
 
height = matteo['http://dbpedia.org/ontology/height'][0]['value']
# 1.88  (float)
birth_year = matteo['http://dbpedia.org/ontology/birthYear'][0]['value']
# '1995'  (string)
hand = matteo['http://dbpedia.org/ontology/plays'][0]['value']
# 'Right-handed (two-handed backhand)'  (string)
singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking'][0]['value']
# 'No. 171'  (string)

The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:

http://dbpedia.org/data/Roger_Federer.json

Please note, that at the time of writing, DBPedia does not support https.

Redundancy and inconsistency

The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:

for key in sorted(matteo): print(key)
"""
http://dbpedia.org/ontology/Person/height
http://dbpedia.org/ontology/abstract
http://dbpedia.org/ontology/birthDate
http://dbpedia.org/ontology/birthPlace
http://dbpedia.org/ontology/birthYear
http://dbpedia.org/ontology/careerPrizeMoney
http://dbpedia.org/ontology/country
http://dbpedia.org/ontology/height
http://dbpedia.org/ontology/plays
http://dbpedia.org/ontology/residence
http://dbpedia.org/ontology/thumbnail
http://dbpedia.org/ontology/wikiPageID
http://dbpedia.org/ontology/wikiPageRevisionID
http://dbpedia.org/property/birthDate
http://dbpedia.org/property/birthPlace
http://dbpedia.org/property/caption
http://dbpedia.org/property/careerprizemoney
http://dbpedia.org/property/currentdoublesranking
http://dbpedia.org/property/currentsinglesranking
http://dbpedia.org/property/dateOfBirth
http://dbpedia.org/property/doublesrecord
http://dbpedia.org/property/doublestitles
http://dbpedia.org/property/highestdoublesranking
http://dbpedia.org/property/highestsinglesranking
http://dbpedia.org/property/name
http://dbpedia.org/property/placeOfBirth
http://dbpedia.org/property/plays
http://dbpedia.org/property/residence
http://dbpedia.org/property/shortDescription
http://dbpedia.org/property/singlesrecord
http://dbpedia.org/property/singlestitles
http://dbpedia.org/property/updated
http://dbpedia.org/property/usopenresult
http://dbpedia.org/property/wimbledonresult
http://purl.org/dc/elements/1.1/description
http://purl.org/dc/terms/subject
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#comment
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/ns/prov#wasDerivedFrom
http://xmlns.com/foaf/0.1/depiction
http://xmlns.com/foaf/0.1/givenName
http://xmlns.com/foaf/0.1/isPrimaryTopicOf
http://xmlns.com/foaf/0.1/name
http://xmlns.com/foaf/0.1/surname
"""

You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:

  • http://dbpedia.org/ontology/Person/height
  • http://dbpedia.org/ontology/height

Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.

With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.

Conclusion

DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.

How to Become a Web Scraping Pro with Python pt. 1

Scrapy is an excellent Python library for web scraping. For example, you could create an API with data that is populated via web scraping. This article covers some basic scrapy features, such as the shell and selectors.

Install scrapy in virtual environment on your machine:

$ virtualenv venv
$ source venv/bin/activate
$ pip install scrapy

To learn about scrapy, the shell is a good place to start, because it offers an interactive environment where you can try selectors on a concrete web page. Here is how to start the scrapy shell:

$ scrapy shell http://doc.scrapy.org/en/latest/topics/selectors.html

Selectors

Now, try out different selections.

You can select elements on a page with CSS and XPath; these selectors can be stringed together. For example, use css to select a tags and xpath to select the href attribute of those tags:

>>> for link in response.css('a').xpath('@href').extract():
>>>   print link

Documentation

Now you are ready to head over to the documentation to read more about how to become great a using scrapy. Another tip is to follow the scrapinghub blog.