The purpose of language by Chomsky

In the following Google video, Noam Chomsky raises and answers the interesting question: what amazing insights into language have linguistics revealed, which the public does not know about?.

He answers that human natural language was propably developed to support the human thinking process, not to serve as a means of communication. He believes that language might have evolved long before it was first used for communication. He goes as far as saying that the design of human natural language makes it unfit for communication.

I find his language-is-for-thinking point is very interesting. I’m currently finishing a PhD, and it would explain the difficulties I sometimes have when trying to convert between language for thinking into language for communicating my thoughts. There is even a phd-comic about it.

As very often with Chomsky, the talk weaves in and out between political and linguistic topics. Interestingly enough, he does not shy away from mentioning and criticizing Google’s part in state oppression through cooperation with NSA. That might seem like a breach of some sort of social etiquette, however, he was strongly encouraged to “speak truth to power” by the person introducing him. Be careful what you ask for.

Recursive relationship between humans, computers and human societies

This post is influenced by a talk I had with Marcos Vaz Salles and a debate that happened between Foucault and Chomsky in 1970.

The relationship between humans and societies is a recursive relationship. Human beings influence societies and societies in turn influence human beings. Next, humans are influencing the societies that they themselves have been influenced by. Total entanglement. A composite and recursive organism.

Recently, we have added a new recursive layer to the already recursive organism of humans plus society, namely the computer. When computers were first created, the relationship between humans and computers seemed non-recursive. Naïvely, in the good old days, humans coded computers, not the other way around. That may no longer be true, and perhaps it never was. Increasingly, computer algorithms are influencing the structure of human societies, e.g. through algorithmically controlled social networks. By transitivity, the influence that computers have on societies is propagated to humans. Furthermore, computers have recently gained the ability to code human beings directly. Computer algorithms are now used to synthesize new gene sequences for human beings, some of which are actually born. These human beings in turn can code computers, and again we come full circle. At this point in history we are a three-way recursive organism: humans plus computers plus societies.

In a debate between Foucault and Chomsky, Foucault raises the question whether we can discover and encode the system of regularity and constraints that makes science possible, outside the human mind. This question was preceded by the consensus that the human creative process can achieve complex results exactly because it is limited and governed by finite rules. Furthermore, it was agreed that humans, because we are limited, can only formulate certain theories. Do societies have the ability to construct classes of theories that human individuals can not, and what happens when we add the computer to the recursive definition? If so, can these otherwise unreachable theories be codified in a way so they can be understood by humans? Can humans instruct computers to use theories that we do not have the ability to discover or even understand ourselves?

1970 debate between Noam Chomsky and Michel Foucault

Chomsky has written and said many things, namely on the topic of linguistics and politics. In an attempt to get an overview of it all, I searched for the term “overview of chomsky’s work” and found post on znet called A Brief Review of the Work of Professor Noam Chomsky. Just what I wanted. One sentence mentions a television debate between Chomsky and Foucault from 1970, and luckily that video was available on YouTube. I decided to watch it, because it might give a more focused and deeper glimpse of some of Chomsky’s work, to balance the more general overview I initially wanted to get.

Word-count exercise with Spark on Amazon EMR

This is a mini-workshop that shows you how to work with Spark on Amazon Elastic Map-Reduce; It’s a kind of hello world of Spark on EMR. We will solve a simple problem, namely use Spark and Amazon EMR to count the words in a text file stored in S3.

To follow along you will need the following:

Create some test data in S3

We will count the words in the U.S. constitution, more specifically count the words in a text file that I have found online. Step one is to upload this file to Amazon S3, so that the Spark cluster (created in next section) can access it.

Download the file locally first:

wget http://www.usconstitution.net/const.txt

Create a bucket to hold the data on S3:

aws s3 mb s3://[your-bucket-name]

Finally, upload the file to S3:

aws s3 mv const.txt s3://[your-bucket-name]/us-constitution.txt

Create Spark cluster on AWS EMR

To create a Spark cluster on Amazon EMR, we need to pick an instance type for the machines. For this small toy example we will use three m3.xlarge instances. You can consult the Amazon EMR price list for an overview of all supported instance types on Amazon EMR.

Launch a Spark 0.8.1 cluster with three m3.xlarge instances on Amazon EMR:

elastic-mapreduce --create --alive --name "Spark/Shark Cluster"  \
--bootstrap-action s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh \
--bootstrap-name "Spark/Shark"  --instance-type m3.xlarge --instance-count 3

If everything worked, the command returns a job flow ID, e.g. a message saying something like “Created job flow j-1R2OWN88UD8ZC”.

It will take a few minutes before the cluster is in the “WAITING” state, which means that it is ready to accept queries. We can check that cluster is “WAITING” state using the –list option to elastic-mapreduce:

elastic-mapreduce --list j-1R2OWN88UD8ZC
# replace j-1R2OWN88UD8ZC with the ID you got when launching the cluster

When cluster has status “WAITING”, connect to the master node of the Spark Cluster using SSH:

elastic-mapreduce --ssh j-1R2OWN88UD8ZC
# replace j-1R2OWN88UD8ZC with the ID you got when launching the cluster

You should now be connected to the master node of your Spark cluster…

Run query in spark shell

To run the word-count query, we will enter the Spark shell installed on the master node. Since the text file is really unstructured, it is perfect for a map-reduce type query. Once in the shell, we will express the word-count query in the Scala programming language.

Enter spark shell:

SPARK_MEM="2g" /home/hadoop/spark/spark-shell

(In Spark shell) load U.S. constitution text file:

val file = sc.textFile("s3://[your-bucket-name]/us-constitution.txt")

(In Spark shell) count words in file, replacing dots and commas with space:

// remove linebreaks before pasting...
val counts = file
  .flatMap(line => line
    .toLowerCase()
    .replace(".", " ")
    .replace(",", " ")
    .split(" "))
  .map(word => (word, 1L))
  .reduceByKey(_ + _)

(In Spark shell) Inspect ten most prominent words (using unary minus to invert sort-order, i.e. descending):

val sorted_counts = counts.collect().sortBy(wc => -wc._2)
sorted_counts.take(10).foreach(println)
# prints lines containing (word, count) pairs

Save the sorted counts in S3:

sc.parallelize(sorted_counts).saveAsTextFile("s3://[your-bucket-name]/wordcount-us-consitution")

(Back on local machine) remember to terminate cluster when done:

elastic-mapreduce --terminate j-1R2OWN88UD8ZC
# replace j-1R2OWN88UD8ZC with the ID you got when launching the cluster

If you’ve forgotten the Cluster ID, you can get a list of active clusters using the –list command:

elastic-mapreduce --list --active

Caveats

When first drafting this example, I was tempted to use a cheaper instance, i.e. m1.small. While Amazon EMR officially supports this instance type (tagged as “General Purpose – Previous Generation”), the word-count example didn’t work for me using this instance type. When I switched to the more recent and “beefier” instance type, m3.xlarge, everything worked out fine.

I also tried to bootstrap the instances with the latest version of Spark (1.0.0 at time of writing). This failed to even launch on the m1.small instance. Note that the install script in 1.0.0 is a ruby-script (s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb) instead of the 0.8.1 shell-script (s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh). It is worth trying the example above, with Spark 1.0.0, and using a current instance, e.g. m3.xlarge.

For more examples, check the Spark examples section, which includes the wordcount example that I’ve adapted a bit.

Do What You Want

The message of this movie is “do what you want”. The video is a meditative portrait of a man who was once a successful medical doctor, and who as 60+ lives in a small studio in Los Angeles (i think). Every day, he goes out to skate the beaches of Southern California with a big smile on his face. He does not seem crazy, although I don’t know for sure. He tells us that he had an opportunity and took it. He decided to stop being an asshole, and start being spiritual.

If, like me, you are a parent of a two-year old and a four-year old, I hope you will interleave the “do what you want” part with an adequate or more dosis of “take care of your children”. Another message of the video is that it is never to later to realize yourself. Even when you are 60+ years old. Until then, happy grinding!

Slomo from The New York Times – Video on Vimeo.

What Goes Around Comes Around

Today I read the What Goes Around Comes Around chapter from the “Red Book” by Michael Stonebraker and Joseph M. Hellerstein. The chapter (or paper if you will) is a summary of 35 years of data model proposals, grouped into 9 different eras. This post is a kind of cheat sheet to the lessons learned in the chapter.

The paper surveyed three decades of data model thinking. It is clear that we have come “full circle”. We started off with a complex data model (Hierarchical/Network model), which was followed by a great debate between a complex model and a much simpler one (Relational model). The simpler one was shown to be advantageous in terms of understandability and its ability to support data independence.

Then, a substantial collection of additions were proposed, none of which gained substantial market traction, largely because they failed to offer substantial leverage in exchange for the increased complexity. The only ideas that got market traction were user-defined functions (Object-Relational model) and user-defined access methods (Object-Relational model), and these were performance constructs not data model constructs. The current proposal is now a superset of the union of all previous proposals. I.e. we have navigated a full circle.

Hierarchical Data Model (IMS)

Late 1960’s and 1970’s

  • Lesson 1: Physical and logical data independence are highly desirable
  • Lesson 2: Tree structured data models are very restrictive
  • Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data
  • Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard. (Key-Value stores anyone?)

Network Data Model (CODASYL)

1970’s

  • Lesson 5: Networks are more flexible than hierarchies but more complex
  • Lesson 6: Loading and recovering networks is more complex than hierarchies

Relational Data Model

1970’s and early 1980’s

  • Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence
  • Lesson 8: Logical data independence is easier with a simple data model than with a
    complex one
  • Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology (Key-Value stores anyone?)
  • Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers (Key-Value stores anyone?)

Entity-Relationship Data Model

1970’s

  • Lesson 11: Functional dependencies are too difficult for mere mortals to understand

Extended Relational Data Model

1980’s

  • Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere

Semantic Data Model

Late 1970’s and 1980’s Innovation: classes, multiple inheritance.

No lessons learned, but the model failed for the same reasons as the Extended Relational Data Model.

Object-oriented: late 1980’s and early 1990’s

Beginning in the mid 1980’s there was a “tidal wave” of interest in Object-oriented DBMSs (OODB). Basically, this community pointed to an “impedance mismatch” between relational data bases and languages like C++.

Impedance mismatch: In practice, relational data bases had their own naming systems, their own data type systems, and their own conventions for returning data as a result of a query. Whatever programming language was used alongside a relational data base also had its own version of all of these facilities. Hence, to bind an application to the data base required a conversion from “programming language speak” to “data base speak” and back. This
was like “gluing an apple onto a pancake”, and was the reason for the so-called impedance mismatch.

  • Lesson 13: Packages will not sell to users unless they are in “major pain”
  • Lesson 14: Persistent languages will go nowhere without the support of the programming language community

Object-relational

Late 1980’s and early 1990’s

The Object-Relational (OR) era was motivated by the need to index and query geographical data (using e.g. an R-tree access method), since two dimensional search is not supported by existing B-tree access methods.

As a result, the OR proposal added:

  • user-defined data types
  • user-defined operators
  • user-defined functions
  • user-defined access methods
  • Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby bluring the distinction between code and data) and user-defined access methods
  • Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard

Semi-structured (XML)

Late 1990’s to the present

There are two basic points that this class of work exemplifies: (1) schema last and (2) complex network-oriented data model.

  • Lesson 16: Schema-last is a probably a niche market
  • Lesson 17: XQuery is pretty much OR SQL with a different syntax
  • Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the enterprise

Get Weather using JSON web service and Python

Get the current weather for Copenhagen:

import urllib2
import json
 
# hent vejret for Koebenhavn
url = 'http://api.openweathermap.org/data/2.5/weather?q=Copenhagen,dk'
response = urllib2.urlopen(url)
 
# parse JSON resultatet
data = json.load(response)
print 'Weather in Copenhagen:', data['weather'][0]['description']

Linked Data: First Blood

Knowing a lot about something, makes me more prone to appraising its value. I unfortunately know very little about Linked data. For this reason, I’ve had a very biased and shamefully low opinion about the concept of linked data. I’ve decided to change this.

A repository of linked data that I’ve recently taken an interest in, is DBPedia. DBPedia is a project about extracting structured data (linked data) from Wikipedia, and exposing it via a SPARQL endpoint. With the interested in DBPedia, comes the first sparks (pun intended) of interest in RDF-endpoints and in particular SPARQL.

The brilliant thing about DBPedia (and SPARQL) is that it makes it possible to query a vast repository of information, originally in raw text form, using a proper query language. It’s Wikipedia with a nerd boner on.

So what can you do with SPARQL and DBPedia? There are several examples on the DBPedia homepage.

Here is one (slightly modified one): Find all people born in Copenhagen before 1900 (the link points to a page that executes the query):

PREFIX dbo: <http://dbpedia.org/ontology/>
 
SELECT ?name ?birth ?death ?person WHERE {
     ?person dbo:birthPlace :Copenhagen .
     ?person dbo:birthDate ?birth .
     ?person foaf:name ?name .
     ?person dbo:deathDate ?death .
     FILTER (?birth < "1900-01-01"^^xsd:date) .
}
ORDER BY ?name

Looking at the names that are returned, I believe that those are names of people born in Copenhagen before 1900. A test probe looking up one of the people on the list confirms it. According to Wikipedia, Agnes Charlotte Dagmar Adler was a pianist born in Copenhagen in 1865.

Ok, the hello world of linked data has been commited to this blog. This will NOT be the last thing I write about Linked Data… I’ve seen the light.

This blog post is dedicated to Anders Friis-Christensen, who tried (without luck) to get me interested in Linked Data two years ago. I might be a bit slow, but I eventually get it 🙂