Apache Zeppelin (incubator) rocks!

At Spark Summit Europe 2015, several presenters made use of Apache Zeppeling, which is a notebook (a la IPython) for Spark.

I immediately wanted to try it out myself. I also highly recommend you to download and try it out if you like Spark. But one note: download Zeppelin from GitHub rather than from the apache homepage. The GitHub one is significantly more up to date (today). You do not need to preinstall Spark (but you can if you want), because Zeppelin comes with a stand-alone installation of Spark.

How long is the Doom Loop cycle currently?

Take a look at this Chomsky presentation, time it around 46:30. It seems that the most rational prediction would be that we are heading for another financial crisis, since financial systems are running a quote “Doom Loop”: Make huge gambles, make huge gains or fail. In the case of failure, get bailed out. This pattern of behaviour is rational, seen from the point of view of the financial sector, given the current environment. So, the good question is, what would the rational course of action be for us, the citizens, given that the financial sector is apparently acting, fully rationally, inside a Doom Loop?

The rational question would be, when is the next financial crisis coming? Given a good prediction of this point in time, how should we rationally act, e.g. in the real-estate market? If we should aspire to make rational decisions, we should not hope that another financial crisis will be avoided. We should expect it, and make rational decisions based upon it. For our own gain, if we so desire. Now, how do you do that? That is another question. It seems obvious that decisions in many areas should be influenced by this apparent fact, e.g. decisions in real-estate, entrepreneurship, family planning. If there is money to be made, somehow, in betting on the next financial crisis, maybe that would be the rational thing to do.

The purpose of language by Chomsky

In the following Google video, Noam Chomsky raises and answers the interesting question: what amazing insights into language have linguistics revealed, which the public does not know about?.

He answers that human natural language was propably developed to support the human thinking process, not to serve as a means of communication. He believes that language might have evolved long before it was first used for communication. He goes as far as saying that the design of human natural language makes it unfit for communication.

I find his language-is-for-thinking point is very interesting. I’m currently finishing a PhD, and it would explain the difficulties I sometimes have when trying to convert between language for thinking into language for communicating my thoughts. There is even a phd-comic about it.

As very often with Chomsky, the talk weaves in and out between political and linguistic topics. Interestingly enough, he does not shy away from mentioning and criticizing Google’s part in state oppression through cooperation with NSA. That might seem like a breach of some sort of social etiquette, however, he was strongly encouraged to “speak truth to power” by the person introducing him. Be careful what you ask for.

What Goes Around Comes Around

Today I read the What Goes Around Comes Around chapter from the “Red Book” by Michael Stonebraker and Joseph M. Hellerstein. The chapter (or paper if you will) is a summary of 35 years of data model proposals, grouped into 9 different eras. This post is a kind of cheat sheet to the lessons learned in the chapter.

The paper surveyed three decades of data model thinking. It is clear that we have come “full circle”. We started off with a complex data model (Hierarchical/Network model), which was followed by a great debate between a complex model and a much simpler one (Relational model). The simpler one was shown to be advantageous in terms of understandability and its ability to support data independence.

Then, a substantial collection of additions were proposed, none of which gained substantial market traction, largely because they failed to offer substantial leverage in exchange for the increased complexity. The only ideas that got market traction were user-defined functions (Object-Relational model) and user-defined access methods (Object-Relational model), and these were performance constructs not data model constructs. The current proposal is now a superset of the union of all previous proposals. I.e. we have navigated a full circle.

Hierarchical Data Model (IMS)

Late 1960’s and 1970’s

  • Lesson 1: Physical and logical data independence are highly desirable
  • Lesson 2: Tree structured data models are very restrictive
  • Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data
  • Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard. (Key-Value stores anyone?)

Network Data Model (CODASYL)


  • Lesson 5: Networks are more flexible than hierarchies but more complex
  • Lesson 6: Loading and recovering networks is more complex than hierarchies

Relational Data Model

1970’s and early 1980’s

  • Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence
  • Lesson 8: Logical data independence is easier with a simple data model than with a
    complex one
  • Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology (Key-Value stores anyone?)
  • Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers (Key-Value stores anyone?)

Entity-Relationship Data Model


  • Lesson 11: Functional dependencies are too difficult for mere mortals to understand

Extended Relational Data Model


  • Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere

Semantic Data Model

Late 1970’s and 1980’s Innovation: classes, multiple inheritance.

No lessons learned, but the model failed for the same reasons as the Extended Relational Data Model.

Object-oriented: late 1980’s and early 1990’s

Beginning in the mid 1980’s there was a “tidal wave” of interest in Object-oriented DBMSs (OODB). Basically, this community pointed to an “impedance mismatch” between relational data bases and languages like C++.

Impedance mismatch: In practice, relational data bases had their own naming systems, their own data type systems, and their own conventions for returning data as a result of a query. Whatever programming language was used alongside a relational data base also had its own version of all of these facilities. Hence, to bind an application to the data base required a conversion from “programming language speak” to “data base speak” and back. This
was like “gluing an apple onto a pancake”, and was the reason for the so-called impedance mismatch.

  • Lesson 13: Packages will not sell to users unless they are in “major pain”
  • Lesson 14: Persistent languages will go nowhere without the support of the programming language community


Late 1980’s and early 1990’s

The Object-Relational (OR) era was motivated by the need to index and query geographical data (using e.g. an R-tree access method), since two dimensional search is not supported by existing B-tree access methods.

As a result, the OR proposal added:

  • user-defined data types
  • user-defined operators
  • user-defined functions
  • user-defined access methods
  • Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby bluring the distinction between code and data) and user-defined access methods
  • Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard

Semi-structured (XML)

Late 1990’s to the present

There are two basic points that this class of work exemplifies: (1) schema last and (2) complex network-oriented data model.

  • Lesson 16: Schema-last is a probably a niche market
  • Lesson 17: XQuery is pretty much OR SQL with a different syntax
  • Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the enterprise

Who gains from blocked content on YouTube?

When I want to hear a particular rap song from 1992 on YouTube, the video service shows me this:

Yeah yeah, copyright or whatever, but what is the point? Who gains what exactly? By the way, if you can hear the song in the country you’re in, then fuck you 🙂

Who’s involved, let’s see. Me (the user/customer), Google (the owner of YouTube), the EU (makers of regional copyright laws), Sony (the copyright holder), and CMW (the artist).

Does Google the owner of YouTube win?

No. Google looses straight away, because I can hear the song on GrooveShark just fine (albeit without the video):

Does the EU win?

No. The EU might gain a little bit, because CMW is an american band, so chances are that I’ll listen to a EU artist like Dizzie Rascal instead:

But that’s not going to happen, because I wanted to listen to CMW, and I’ve already found the song on another service, GrooveShark.

Does Sony Music Entertainment win?

No. I already bought the song on iTunes a couple of days ago. If I hadn’t bought it, I would have downloaded it with a torrent. The YouTube video being there or not, did not factor in to my decision to buy the song. I bought the song because it was insanely easy to do on my iPhone. Period. In fact, I might choose to not buy a song in the future if it’s owned by SME.

Does the artist win?

Hardly, in fact they loose. I’m sure they appreciate that I bought the song, though I’m sure Sony Music Entertainment appreciates it a hell of a lot more if I know anything about royalty splits! And the song being blocked on YouTube did not make me buy the song, as I’ve already said. I was about to make CMW more famous, by linking their video on my blog, but couldn’t. Sorry CMW.

Do concerned mothers win?

Does the fictional organization of “concerned mothers against gangster rap” gain anything by a blocked gangster rap tune on the internet? Sure, but that is mere coincidence, it could just as well have been a song about flowers or teddy bears or a praise for “concerned mothers against gangster rap”.

By the way, you may check out the song CMW sampled on “N 2 Deep”. It’s by Lyn Collins, and features the distinct sound of the JB’s. Apparently the copyright holder (Polydor) is not insane:


I find that this blogpost and video on innovation from Edinburg by Ed Parsons is somehow related to this issue.

By the way Ed. If you watch the ping back. Sorry that I stole your look for WordPress. I kinda liked it, and I do listen to gangster rap occasionally so my morals are questionable.

Opening and closing ports on EC2 instances

Assuming that the EC2 tools have been installed like described in a previous post, opening and closing ports is done with the ec2-authorize and ec2-revoke commands respectively. These commands work on security groups rather than on instances. Recall that a set of instances belong to a security group.

Opening port 80 on EC2 instances in the ‘default’ security group.

ec2-authorize default -p 80

Close port 80 on EC2 instances in the ‘default’ security group

ec2-revoke default -p 80

See also the Amazon command reference for the EC2 API.

Hints for managing Amazon Linux on EC2

I’m using Mac OS X and running instances in the EU West Region. My instances are of the Amazon Linux AMI.

Installing the EC2 command line tools

Having command-line tools installed is a supplement to the AWS management console found online. I found a good tutorial about how to get started with the tools for EC2 on Mac OS X.

After downloading the tools from Amazon download site, the tutorial describes how to set environment variables and how to create X.509 certificates etc.

The only detail missing was that I’m running my instances in the EU West region. I found a hint in another tutorial on setting an additional environment variable. My resulting .profile file looks like this:

# Setup Amazon EC2 Command-Line Tools
export EC2_HOME=~/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=`ls $EC2_HOME/pk-*.pem`
export EC2_CERT=`ls $EC2_HOME/cert-*.pem`
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
# This line is from second tutorial, for use with EU West Region:
export EC2_URL=https://eu-west-1.ec2.amazonaws.com

The first tutorial show many examples of using the command-line tools to start instances, open ports etc.

Package manager for Amazon Linux AMI

Maybe the tools can be used to install packages on the Amazon Linux AMI instance, but you could also use a package manager.

Amazon Linux AMI comes with the yum package manager installed. A tutorial which is specifically aimed at installing PHP on a Amazon Linux AMI instances also gives a quick tour of yum. Basically you do like this:

$ sudo yum install <PACKAGE_NAME>

Installing Apache Web Server

As an example of using the EC2 tools and the yum package manager is installing the Apache Web Server. The command ec2-describe-instances lists running instances in the region given in the environment variable EC2_URL.

$ ec2-describe-instances
RESERVATION	r-xxxxxxxx	xxxxxxxxxxxxx	default
INSTANCE	i-xxxxxxxx	ami-xxxxxxx	ec2-xx-xxx-xx-xx.eu-west-1.compute.amazonaws.com

default is the name of the security group for the instance. You may have used a different security group name. Security groups are used to make it easier to apply a set of permissions to a range of instances. The command ec2-authorize applies a permission to a security group, like opening up port 80 for httpd.

# open up port 80 on instances belonging to security group 'default'
$ ec2-authorize default -p 80
PERMISSION  default  ALLOWS  tcp  80 80  FROM  CIDR

Logging into the instance with ssh and then using the package manager to install httpd.

# use the key pair that you used when launcing your instance
$ ssh -i ~/.ec2/ec2-keypair ec2-user@c2-xx-xxx-xx-xx.eu-west-1.compute.amazonaws.com
# install httpd - starts an install process
$ sudo yum install httpd