Screen capturing with PhantomJS

PhantomJS is a headless browsers that you can use, e.g. to test web UIs and to screen capture webpages. I will focus on the last use case.

Since PhantomJS knows how to execute Javascript, it can create a screen shot of most webpages, even those that render their part of their GUI using Javascript.

Installing

To get started with PhantomJS, download and unzip a PhantomJS binary for your system. In the unzip’ed directory structure you’ll find bin/phantomjs, which is ready to use binary program. You can add that directory to your PATH if you like.

PhantomJS is controlled by Javascript. The script rasterize.js is a useful multi-purpose script for creating screen shots. We will use this script, so download and store it somewhere convenient.

Hello world

I have created a simple test page that partly produces the page content using Javascript. If Javascript is enabled, the page will read “Hello Javascript”. Otherwise, the page reads “Hello”. Let us now screen capture this page using PhantomJS:

# Copy paste everything into a terminal window and run it
# You need to specify the right paths to:
# - phantomjs (e.g. add phantom "bin" dir to PATH)
# - rasterize.js (e.g. run below command in dir containing script)
phantomjs rasterize.js http://skipperkongen.dk/files/hello_javascript.html hello_javascript.pdf

If that went well, you should now have a PDF file called hello_javascript.pdf in the directory where you ran the command. Open the PDF and confirm that it contains the text “Hello Javascript” just like the web page does.

Screen capturing a real blog post

Hopefully, the above experiment worked. However, the content in the generated PDF was not too interesting. Let’s repeat the above experiment with a real blog post, namely the first blog post I ever wrote on skipperkongen.dk:

# Copy paste everything into a terminal window and run it
# You need to specify the right paths to:
# - phantomjs (e.g. add phantom "bin" dir to PATH)
# - rasterize.js (e.g. run below command in dir containing script)
phantomjs rasterize.js \
http://skipperkongen.dk/2010/11/14/hard-to-less-hard/ skipperkongen.pdf

If you open the generated PDF you will see that it is not the prettiest sight. The PDF has only a passing resemblance to what the original blog post looks like if you open it in a “normal” browser. This is perhaps all according to specifications, but I (and I’m guessing you) would like a more aesthetically pleasing result.

Inspecting the generated PDF

Before we begin to understand why the generated PDF looks in a particular way, let us describe what we are seeing. So what does the PDF look like?

First, the generated PDF is missing the content header found on the web page. Second, the rendered PDF has an incredibly narrow page layout or uses a very big font size. Third, on my Mac there is a weird “private use” symbol in several places in the pdf. Regarding the third issue, there is a fun discussion over at StackExchange for Mac OS X about the “private use” symbol with some interesting background information.

Why does the generated PDF look this way?

In order to understand why PhantomJS renders a page in a certain way, it is relevant to look at the following pages:

There is honestly not a lot of content there, so let’s try to analyze the issues ourselves. Regarding the missing header, the HTML source code for the blog post specifies a “print” CSS style with the following CSS definition:

<style type="text/css" media="print">#wpadminbar { display:none; }</style>

Regarding the missing content header, tt seems that PhantomJS uses the “print” CSS style if available when generating a PDF.

Regarding the narrow layout, recall that we used rasterize.js as the control script for phantomjs. The code in the script will have a big impact on what we are seeing, which could include layout. Inside the rasterize.js script we find the following line:

page.viewportSize = { width: 600, height: 600 };

That partly explains the narrow layout. If we change these settings to width: 1800 and height: 1000 in a copy of the file (rasterize2.js) and rerun the screen capture we get a wider PDF canvas. However, the actual content layout is only partly fixed by this. A full solution will require more, e.g. working with the page CSS.

In the next part of this post, I’ll dig more into the PhantomJS API.

Twitter HyperLogLog monoids in Spark

Want to count unique elements in a stream without blowing up memory? In more specific words, do you want to use a HyperLogLog counter in Spark? Until today, I’d never heard the word “monoid” before. However, Twitter Algebird is a project that contains a collection of monoids including a HyperLogLog monoid, which can be used to aggregate a stream into unique elements. The code looks like this:

import com.twitter.algebird._
val aggregator = new HyperLogLogMonoid(12)
inputData.reduceByKey(aggregator.plus(_, _))

This young man tells you all about it, and then some:

The video also mentions another Twitter project, the Storehaus project, which can be used to integrate Spark with a lot of NoSQL databases like DynamoDB. Looks very useful indeed.

And just to go completely crazy with the Twitter project references, the talk also brings on Summingbird. The Twitter team has a separate blog post
about using Summingbird with Spark Streaming.

Easiest way to install a PostgreSQL/PostGIS database on Mac

Installing Postgres+PostGIS has never been easier on Mac. In fact, it is now an app! You download the app-file from postgresapp.com, place it in your Applications folder, and you’re done. Really.

If you think that was over too fast

If you think that was over too fast, there is one more thing you can do. Add the postgreapp “bin” directory to PATH.

vi ~/.bash_profile
 
# add line: export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/9.3/bin

Next time you open terminal you will be able to execute all of the following commands:

PostgreSQL:

clusterdb createdb createlang createuser dropdb droplang
dropuser ecpg initdb oid2name pg_archivecleanup 
pg_basebackup pg_config pg_controldata pg_ctl pg_dump 
pg_dumpall pg_receivexlog pg_resetxlog pg_restore 
pg_standby pg_test_fsync pg_test_timing pg_upgrade 
pgbench postgres postmaster psql reindexdb vacuumdb 
vacuumlo

PROJ.4:

cs2cs geod invgeod invproj nad2bin proj

GDAL:

gdal_contour gdal_grid gdal_rasterize gdal_translate 
gdaladdo gdalbuildvrt gdaldem gdalenhance gdalinfo 
gdallocationinfo gdalmanage gdalserver gdalsrsinfo 
gdaltindex gdaltransform gdalwarp nearblack ogr2ogr 
ogrinfo ogrtindex testepsg

PostGIS:

pgsql2shp raster2pgsql shp2pgsql

That is pretty f’ing awesome!!

Poor man’s wget

The command wget is useful, but unfortunately doesn’t come preinstalled with Mac. Yeah, you can install it of course, but if you’re doing it from source, the process has a few steps to satisfy all the dependencies; start by configure make‘ing the wget source and work your was backwards until ./configure runs for your wget source without hiccups.

This is how to get a poor mans wget, or simply realize that you can use curl -O, unless you’re getting content via https.

alias wget="curl -O"

The purpose of language by Chomsky

In the following Google video, Noam Chomsky raises and answers the interesting question: what amazing insights into language have linguistics revealed, which the public does not know about?.

He answers that human natural language was propably developed to support the human thinking process, not to serve as a means of communication. He believes that language might have evolved long before it was first used for communication. He goes as far as saying that the design of human natural language makes it unfit for communication.

I find his language-is-for-thinking point is very interesting. I’m currently finishing a PhD, and it would explain the difficulties I sometimes have when trying to convert between language for thinking into language for communicating my thoughts. There is even a phd-comic about it.

As very often with Chomsky, the talk weaves in and out between political and linguistic topics. Interestingly enough, he does not shy away from mentioning and criticizing Google’s part in state oppression through cooperation with NSA. That might seem like a breach of some sort of social etiquette, however, he was strongly encouraged to “speak truth to power” by the person introducing him. Be careful what you ask for.

Recursive relationship between humans, computers and human societies

This post is influenced by a talk I had with Marcos Vaz Salles and a debate that happened between Foucault and Chomsky in 1970.

The relationship between humans and societies is a recursive relationship. Human beings influence societies and societies in turn influence human beings. Next, humans are influencing the societies that they themselves have been influenced by. Total entanglement. A composite and recursive organism.

Recently, we have added a new recursive layer to the already recursive organism of humans plus society, namely the computer. When computers were first created, the relationship between humans and computers seemed non-recursive. Naïvely, in the good old days, humans coded computers, not the other way around. That may no longer be true, and perhaps it never was. Increasingly, computer algorithms are influencing the structure of human societies, e.g. through algorithmically controlled social networks. By transitivity, the influence that computers have on societies is propagated to humans. Furthermore, computers have recently gained the ability to code human beings directly. Computer algorithms are now used to synthesize new gene sequences for human beings, some of which are actually born. These human beings in turn can code computers, and again we come full circle. At this point in history we are a three-way recursive organism: humans plus computers plus societies.

In a debate between Foucault and Chomsky, Foucault raises the question whether we can discover and encode the system of regularity and constraints that makes science possible, outside the human mind. This question was preceded by the consensus that the human creative process can achieve complex results exactly because it is limited and governed by finite rules. Furthermore, it was agreed that humans, because we are limited, can only formulate certain theories. Do societies have the ability to construct classes of theories that human individuals can not, and what happens when we add the computer to the recursive definition? If so, can these otherwise unreachable theories be codified in a way so they can be understood by humans? Can humans instruct computers to use theories that we do not have the ability to discover or even understand ourselves?

1970 debate between Noam Chomsky and Michel Foucault

Chomsky has written and said many things, namely on the topic of linguistics and politics. In an attempt to get an overview of it all, I searched for the term “overview of chomsky’s work” and found post on znet called A Brief Review of the Work of Professor Noam Chomsky. Just what I wanted. One sentence mentions a television debate between Chomsky and Foucault from 1970, and luckily that video was available on YouTube. I decided to watch it, because it might give a more focused and deeper glimpse of some of Chomsky’s work, to balance the more general overview I initially wanted to get.

Word-count exercise with Spark on Amazon EMR

This is a mini-workshop that shows you how to work with Spark on Amazon Elastic Map-Reduce; It’s a kind of hello world of Spark on EMR. We will solve a simple problem, namely use Spark and Amazon EMR to count the words in a text file stored in S3.

To follow along you will need the following:

Create some test data in S3

We will count the words in the U.S. constitution, more specifically count the words in a text file that I have found online. Step one is to upload this file to Amazon S3, so that the Spark cluster (created in next section) can access it.

Download the file locally first:

wget http://www.usconstitution.net/const.txt

Create a bucket to hold the data on S3:

aws s3 mb s3://[your-bucket-name]

Finally, upload the file to S3:

aws s3 mv const.txt s3://[your-bucket-name]/us-constitution.txt

Create Spark cluster on AWS EMR

To create a Spark cluster on Amazon EMR, we need to pick an instance type for the machines. For this small toy example we will use three m3.xlarge instances. You can consult the Amazon EMR price list for an overview of all supported instance types on Amazon EMR.

Launch a Spark 0.8.1 cluster with three m3.xlarge instances on Amazon EMR:

elastic-mapreduce --create --alive --name "Spark/Shark Cluster"  \
--bootstrap-action s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh \
--bootstrap-name "Spark/Shark"  --instance-type m3.xlarge --instance-count 3

If everything worked, the command returns a job flow ID, e.g. a message saying something like “Created job flow j-1R2OWN88UD8ZC”.

It will take a few minutes before the cluster is in the “WAITING” state, which means that it is ready to accept queries. We can check that cluster is “WAITING” state using the –list option to elastic-mapreduce:

elastic-mapreduce --list j-1R2OWN88UD8ZC
# replace j-1R2OWN88UD8ZC with the ID you got when launching the cluster

When cluster has status “WAITING”, connect to the master node of the Spark Cluster using SSH:

elastic-mapreduce --ssh j-1R2OWN88UD8ZC
# replace j-1R2OWN88UD8ZC with the ID you got when launching the cluster

You should now be connected to the master node of your Spark cluster…

Run query in spark shell

To run the word-count query, we will enter the Spark shell installed on the master node. Since the text file is really unstructured, it is perfect for a map-reduce type query. Once in the shell, we will express the word-count query in the Scala programming language.

Enter spark shell:

SPARK_MEM="2g" /home/hadoop/spark/spark-shell

(In Spark shell) load U.S. constitution text file:

val file = sc.textFile("s3://[your-bucket-name]/us-constitution.txt")

(In Spark shell) count words in file, replacing dots and commas with space:

// remove linebreaks before pasting...
val counts = file
  .flatMap(line => line
    .toLowerCase()
    .replace(".", " ")
    .replace(",", " ")
    .split(" "))
  .map(word => (word, 1L))
  .reduceByKey(_ + _)

(In Spark shell) Inspect ten most prominent words (using unary minus to invert sort-order, i.e. descending):

val sorted_counts = counts.collect().sortBy(wc => -wc._2)
sorted_counts.take(10).foreach(println)
# prints lines containing (word, count) pairs

Save the sorted counts in S3:

sc.parallelize(sorted_counts).saveAsTextFile("s3://[your-bucket-name]/wordcount-us-consitution")

(Back on local machine) remember to terminate cluster when done:

elastic-mapreduce --terminate j-1R2OWN88UD8ZC
# replace j-1R2OWN88UD8ZC with the ID you got when launching the cluster

If you’ve forgotten the Cluster ID, you can get a list of active clusters using the –list command:

elastic-mapreduce --list --active

Caveats

When first drafting this example, I was tempted to use a cheaper instance, i.e. m1.small. While Amazon EMR officially supports this instance type (tagged as “General Purpose – Previous Generation”), the word-count example didn’t work for me using this instance type. When I switched to the more recent and “beefier” instance type, m3.xlarge, everything worked out fine.

I also tried to bootstrap the instances with the latest version of Spark (1.0.0 at time of writing). This failed to even launch on the m1.small instance. Note that the install script in 1.0.0 is a ruby-script (s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb) instead of the 0.8.1 shell-script (s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh). It is worth trying the example above, with Spark 1.0.0, and using a current instance, e.g. m3.xlarge.

For more examples, check the Spark examples section, which includes the wordcount example that I’ve adapted a bit.