Category: Web Scraping

  • How to get structured Wikipedia data via DBPedia

    Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.

    Infobox

    While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.

    Example

    Let us explore the DBPedia API by way of an example.

    I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:

    http://dbpedia.org/data/Matteo_Donati.json

    In this example, we will fetch and process the JSON data with a small Python script.

    # Python
    import requests
    
    data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
    matteo = data['http://dbpedia.org/resource/Matteo_Donati']
    
    # matteo is a dictionary with lots of keys
    # that correspond to the player's properties.
    # Each value is a list of dictionaries itself.
    
    height = matteo['http://dbpedia.org/ontology/height'][0]['value']
    # 1.88  (float)
    birth_year = matteo['http://dbpedia.org/ontology/birthYear'][0]['value']
    # '1995'  (string)
    hand = matteo['http://dbpedia.org/ontology/plays'][0]['value']
    # 'Right-handed (two-handed backhand)'  (string)
    singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking'][0]['value']
    # 'No. 171'  (string)
    

    The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:

    http://dbpedia.org/data/Roger_Federer.json

    Please note, that at the time of writing, DBPedia does not support https.

    Redundancy and inconsistency

    The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:

    for key in sorted(matteo): print(key)
    """
    http://dbpedia.org/ontology/Person/height
    http://dbpedia.org/ontology/abstract
    http://dbpedia.org/ontology/birthDate
    http://dbpedia.org/ontology/birthPlace
    http://dbpedia.org/ontology/birthYear
    http://dbpedia.org/ontology/careerPrizeMoney
    http://dbpedia.org/ontology/country
    http://dbpedia.org/ontology/height
    http://dbpedia.org/ontology/plays
    http://dbpedia.org/ontology/residence
    http://dbpedia.org/ontology/thumbnail
    http://dbpedia.org/ontology/wikiPageID
    http://dbpedia.org/ontology/wikiPageRevisionID
    http://dbpedia.org/property/birthDate
    http://dbpedia.org/property/birthPlace
    http://dbpedia.org/property/caption
    http://dbpedia.org/property/careerprizemoney
    http://dbpedia.org/property/currentdoublesranking
    http://dbpedia.org/property/currentsinglesranking
    http://dbpedia.org/property/dateOfBirth
    http://dbpedia.org/property/doublesrecord
    http://dbpedia.org/property/doublestitles
    http://dbpedia.org/property/highestdoublesranking
    http://dbpedia.org/property/highestsinglesranking
    http://dbpedia.org/property/name
    http://dbpedia.org/property/placeOfBirth
    http://dbpedia.org/property/plays
    http://dbpedia.org/property/residence
    http://dbpedia.org/property/shortDescription
    http://dbpedia.org/property/singlesrecord
    http://dbpedia.org/property/singlestitles
    http://dbpedia.org/property/updated
    http://dbpedia.org/property/usopenresult
    http://dbpedia.org/property/wimbledonresult
    http://purl.org/dc/elements/1.1/description
    http://purl.org/dc/terms/subject
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type
    http://www.w3.org/2000/01/rdf-schema#comment
    http://www.w3.org/2000/01/rdf-schema#label
    http://www.w3.org/2002/07/owl#sameAs
    http://www.w3.org/ns/prov#wasDerivedFrom
    http://xmlns.com/foaf/0.1/depiction
    http://xmlns.com/foaf/0.1/givenName
    http://xmlns.com/foaf/0.1/isPrimaryTopicOf
    http://xmlns.com/foaf/0.1/name
    http://xmlns.com/foaf/0.1/surname
    """
    

    You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:

    • http://dbpedia.org/ontology/Person/height
    • http://dbpedia.org/ontology/height

    Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.

    With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.

    Conclusion

    DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.

  • How to Become a Web Scraping Pro with Python pt. 1

    Scrapy is an excellent Python library for web scraping. For example, you could create an API with data that is populated via web scraping. This article covers some basic scrapy features, such as the shell and selectors.

    Install scrapy in virtual environment on your machine:

    $ virtualenv venv
    $ source venv/bin/activate
    $ pip install scrapy
    

    To learn about scrapy, the shell is a good place to start, because it offers an interactive environment where you can try selectors on a concrete web page. Here is how to start the scrapy shell:

    $ scrapy shell http://doc.scrapy.org/en/latest/topics/selectors.html
    

    Selectors

    Now, try out different selections.

    You can select elements on a page with CSS and XPath; these selectors can be stringed together. For example, use css to select a tags and xpath to select the href attribute of those tags:

    >>> for link in response.css('a').xpath('@href').extract():
    >>>   print link
    

    Documentation

    Now you are ready to head over to the documentation to read more about how to become great a using scrapy. Another tip is to follow the scrapinghub blog.