Category: Data

  • Yummy 3D plots

    Very nice interactive 3D plots with Plotly.

    import plotly.graph_objects as go
    import numpy as np
    import pandas as pd
    
    # Read data from a csv
    Z = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/api_docs/mt_bruno_elevation.csv').values
    
    # Actually not necessary to provide X and Y...
    X = np.linspace(0, 1000, Z.shape[0])
    Y = np.linspace(0, 1000, Z.shape[1])
    
    fig = go.Figure(data=[go.Surface(x=X, y=Y, z=Z)])
    
    fig.update_layout(title='Mt Bruno Elevation', autosize=False,
                      width=500, height=500,
                      margin=dict(l=65, r=50, b=65, t=90))
    
    fig.show()
  • How to scrape images from the web

    I’m interested in object detection and other computer vision tasks. For example, I’m working on a teddy-bear detector with my son.

    So, how do you quickly download images for a certain category? You can use this approach that I learned from a course on Udemy.

    # pip install icrawler
    from icrawler.builtin import GoogleImageCrawler
    
    keywords = ['cat', 'dog']
    for keyword in keywords:
        google_crawler = GoogleImageCrawler(
            parser_threads=2,
            downloader_threads=4,
            storage={'root_dir': 'images/{}'.format(keyword)}
        
        )
        google_crawler.crawl(
            keyword=keyword, max_num=10, min_size=(200, 200))

    In the above example, the crawler will find images in two categories — cats and dogs, as if you search for ‘cat’ and ‘dog’ on Google images and downloaded what you found.

    Let’s walk through the parameters used in the code. First, there is the constructor, which is called with three arguments in the example. The most important parameter is storage, which specifies where the images will be stored. Second, we have the call to the crawl function. Here, the max_num parameter is used to specify that at most 10 images per category should be downloaded. The min_size argument specifies that the images must be at least 200 x 200 pixels.

    That’s it. Happy downloading.

  • How to get structured Wikipedia data via DBPedia

    Wikipedia contains a wealth of knowledge. While some of that knowledge consists of natural language descriptions, a rich share of information on Wikipedia is encoded in machine-readable format, such as “infoboxes” and other specially formatted parts. An infobox is rendered as a table that you typically see on the right-hand side of an article.

    Infobox

    While you could download the page source for a wikipedia article and extract the information yourself, there is a project called DBPedia that has done the hard work for you. That right, you can conveniently retrieve machine-readable data that stems from Wikipedia via the DBPedia API.

    Example

    Let us explore the DBPedia API by way of an example.

    I like tennis data and most player pages on Wikipedia have an infobox that contains basic information about a player, such as age, hand, and current singles rank. Let’s try to retrieve information about the Italian tennis player, Matteo Donati, via the JSON resource exposed by DBPedia:

    http://dbpedia.org/data/Matteo_Donati.json

    In this example, we will fetch and process the JSON data with a small Python script.

    # Python
    import requests
    
    data = requests.get('http://dbpedia.org/data/Matteo_Donati.json').json()
    matteo = data['http://dbpedia.org/resource/Matteo_Donati']
    
    # matteo is a dictionary with lots of keys
    # that correspond to the player's properties.
    # Each value is a list of dictionaries itself.
    
    height = matteo['http://dbpedia.org/ontology/height'][0]['value']
    # 1.88  (float)
    birth_year = matteo['http://dbpedia.org/ontology/birthYear'][0]['value']
    # '1995'  (string)
    hand = matteo['http://dbpedia.org/ontology/plays'][0]['value']
    # 'Right-handed (two-handed backhand)'  (string)
    singles_rank = matteo['http://dbpedia.org/property/currentsinglesranking'][0]['value']
    # 'No. 171'  (string)
    

    The simple convention for URLs on DBPedia is that spaces in names are replaced by underscores, exactly like on Wikipedia. For example, if we wanted to look up Roger Federer, we would make a request to the resource:

    http://dbpedia.org/data/Roger_Federer.json

    Please note, that at the time of writing, DBPedia does not support https.

    Redundancy and inconsistency

    The data on Matteo Donati and other entities on DBPedia is both redundant and somewhat inconsistent. This can be seen if we enumerate the keys on Matteo Donati:

    for key in sorted(matteo): print(key)
    """
    http://dbpedia.org/ontology/Person/height
    http://dbpedia.org/ontology/abstract
    http://dbpedia.org/ontology/birthDate
    http://dbpedia.org/ontology/birthPlace
    http://dbpedia.org/ontology/birthYear
    http://dbpedia.org/ontology/careerPrizeMoney
    http://dbpedia.org/ontology/country
    http://dbpedia.org/ontology/height
    http://dbpedia.org/ontology/plays
    http://dbpedia.org/ontology/residence
    http://dbpedia.org/ontology/thumbnail
    http://dbpedia.org/ontology/wikiPageID
    http://dbpedia.org/ontology/wikiPageRevisionID
    http://dbpedia.org/property/birthDate
    http://dbpedia.org/property/birthPlace
    http://dbpedia.org/property/caption
    http://dbpedia.org/property/careerprizemoney
    http://dbpedia.org/property/currentdoublesranking
    http://dbpedia.org/property/currentsinglesranking
    http://dbpedia.org/property/dateOfBirth
    http://dbpedia.org/property/doublesrecord
    http://dbpedia.org/property/doublestitles
    http://dbpedia.org/property/highestdoublesranking
    http://dbpedia.org/property/highestsinglesranking
    http://dbpedia.org/property/name
    http://dbpedia.org/property/placeOfBirth
    http://dbpedia.org/property/plays
    http://dbpedia.org/property/residence
    http://dbpedia.org/property/shortDescription
    http://dbpedia.org/property/singlesrecord
    http://dbpedia.org/property/singlestitles
    http://dbpedia.org/property/updated
    http://dbpedia.org/property/usopenresult
    http://dbpedia.org/property/wimbledonresult
    http://purl.org/dc/elements/1.1/description
    http://purl.org/dc/terms/subject
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type
    http://www.w3.org/2000/01/rdf-schema#comment
    http://www.w3.org/2000/01/rdf-schema#label
    http://www.w3.org/2002/07/owl#sameAs
    http://www.w3.org/ns/prov#wasDerivedFrom
    http://xmlns.com/foaf/0.1/depiction
    http://xmlns.com/foaf/0.1/givenName
    http://xmlns.com/foaf/0.1/isPrimaryTopicOf
    http://xmlns.com/foaf/0.1/name
    http://xmlns.com/foaf/0.1/surname
    """
    

    You’ll notice that, e.g., the height of Matteo Donati is stored under two different keys:

    • http://dbpedia.org/ontology/Person/height
    • http://dbpedia.org/ontology/height

    Luckily, both keys list Donati’s height as 1.88 m, albeit as a string type and numeral type respectively. Other bits of information that is redundantly stored include his birth date, dominant hand (“plays”) and career prize money won so far.

    With redundancy comes the possibility for inconsistency. In other words, there is no guarantee that redundant keys will keep identical values. For example, Matteo Donati is listed both as ‘Right-handed (two-handed backhand)’ and simply as ‘Right-handed’. While in this case the inconsistency is merely a matter of information detail, it can get a little confusing in general.

    Conclusion

    DBPedia is a great way to access structured data from Wikipedia articles. While the information is machine-readable in a popular format, you will have to guard against missing keys, redundant keys and inconsistent values. I hope you enjoyed this quick introduction to DBPedia and that you will find good use for the information.

  • Linked Data: First Blood

    Knowing a lot about something, makes me more prone to appraising its value. I unfortunately know very little about Linked data. For this reason, I’ve had a very biased and shamefully low opinion about the concept of linked data. I’ve decided to change this.

    A repository of linked data that I’ve recently taken an interest in, is DBPedia. DBPedia is a project about extracting structured data (linked data) from Wikipedia, and exposing it via a SPARQL endpoint. With the interested in DBPedia, comes the first sparks (pun intended) of interest in RDF-endpoints and in particular SPARQL.

    The brilliant thing about DBPedia (and SPARQL) is that it makes it possible to query a vast repository of information, originally in raw text form, using a proper query language. It’s Wikipedia with a nerd boner on.

    So what can you do with SPARQL and DBPedia? There are several examples on the DBPedia homepage.

    Here is one (slightly modified one): Find all people born in Copenhagen before 1900 (the link points to a page that executes the query):

    PREFIX dbo: 
    
    SELECT ?name ?birth ?death ?person WHERE {
         ?person dbo:birthPlace :Copenhagen .
         ?person dbo:birthDate ?birth .
         ?person foaf:name ?name .
         ?person dbo:deathDate ?death .
         FILTER (?birth < "1900-01-01"^^xsd:date) .
    }
    ORDER BY ?name
    

    Looking at the names that are returned, I believe that those are names of people born in Copenhagen before 1900. A test probe looking up one of the people on the list confirms it. According to Wikipedia, Agnes Charlotte Dagmar Adler was a pianist born in Copenhagen in 1865.

    Ok, the hello world of linked data has been commited to this blog. This will NOT be the last thing I write about Linked Data... I've seen the light.

    This blog post is dedicated to Anders Friis-Christensen, who tried (without luck) to get me interested in Linked Data two years ago. I might be a bit slow, but I eventually get it :-)

  • Finding the haystack

    Indtil denne sommer troede mange, at de kendte til sandheden om, at det er svært at finde nålen i høstakken. Kun de færreste vidste, at man har vendt ordsproget på hovedet i efterretningsverdenen. Her siger man, at »for at finde nålen, har vi brug for høstakken«.

    http://www.information.dk/471968

  • Free map tiles

    Map Tile Sources

    Here is a list of free sources for map tiles. I'll expand this list as I find more free tiles. Notice that some websites require an attribution. I'll perhaps update this post with the attribution line for each.

    Stamen "Toner"

    World-wide, clean B/W theme

    (more…)

  • Danish Government Basic Data program 2012-2016

    The Agency for Digitisation in Denmark has just published their agenda for publication of so called "Basic Data". The following page is in english:

    Good Basic Data for Everyone – a Driver for Growth and Efficiency