Screen capturing with PhantomJS

PhantomJS is a headless browsers that you can use, e.g. to test web UIs and to screen capture webpages. I will focus on the last use case.

Since PhantomJS knows how to execute Javascript, it can create a screen shot of most webpages, even those that render their part of their GUI using Javascript.

Installing

To get started with PhantomJS, download and unzip a PhantomJS binary for your system. In the unzip’ed directory structure you’ll find bin/phantomjs, which is ready to use binary program. You can add that directory to your PATH if you like.

PhantomJS is controlled by Javascript. The script rasterize.js is a useful multi-purpose script for creating screen shots. We will use this script, so download and store it somewhere convenient.

Hello world

I have created a simple test page that partly produces the page content using Javascript. If Javascript is enabled, the page will read “Hello Javascript”. Otherwise, the page reads “Hello”. Let us now screen capture this page using PhantomJS:

# Copy paste everything into a terminal window and run it
# You need to specify the right paths to:
# - phantomjs (e.g. add phantom "bin" dir to PATH)
# - rasterize.js (e.g. run below command in dir containing script)
phantomjs rasterize.js http://skipperkongen.dk/files/hello_javascript.html hello_javascript.pdf

If that went well, you should now have a PDF file called hello_javascript.pdf in the directory where you ran the command. Open the PDF and confirm that it contains the text “Hello Javascript” just like the web page does.

Screen capturing a real blog post

Hopefully, the above experiment worked. However, the content in the generated PDF was not too interesting. Let’s repeat the above experiment with a real blog post, namely the first blog post I ever wrote on skipperkongen.dk:

# Copy paste everything into a terminal window and run it
# You need to specify the right paths to:
# - phantomjs (e.g. add phantom "bin" dir to PATH)
# - rasterize.js (e.g. run below command in dir containing script)
phantomjs rasterize.js \
http://skipperkongen.dk/2010/11/14/hard-to-less-hard/ skipperkongen.pdf

If you open the generated PDF you will see that it is not the prettiest sight. The PDF has only a passing resemblance to what the original blog post looks like if you open it in a “normal” browser. This is perhaps all according to specifications, but I (and I’m guessing you) would like a more aesthetically pleasing result.

Inspecting the generated PDF

Before we begin to understand why the generated PDF looks in a particular way, let us describe what we are seeing. So what does the PDF look like?

First, the generated PDF is missing the content header found on the web page. Second, the rendered PDF has an incredibly narrow page layout or uses a very big font size. Third, on my Mac there is a weird “private use” symbol in several places in the pdf. Regarding the third issue, there is a fun discussion over at StackExchange for Mac OS X about the “private use” symbol with some interesting background information.

Why does the generated PDF look this way?

In order to understand why PhantomJS renders a page in a certain way, it is relevant to look at the following pages:

There is honestly not a lot of content there, so let’s try to analyze the issues ourselves. Regarding the missing header, the HTML source code for the blog post specifies a “print” CSS style with the following CSS definition:

<style type="text/css" media="print">#wpadminbar { display:none; }</style>

Regarding the missing content header, tt seems that PhantomJS uses the “print” CSS style if available when generating a PDF.

Regarding the narrow layout, recall that we used rasterize.js as the control script for phantomjs. The code in the script will have a big impact on what we are seeing, which could include layout. Inside the rasterize.js script we find the following line:

page.viewportSize = { width: 600, height: 600 };

That partly explains the narrow layout. If we change these settings to width: 1800 and height: 1000 in a copy of the file (rasterize2.js) and rerun the screen capture we get a wider PDF canvas. However, the actual content layout is only partly fixed by this. A full solution will require more, e.g. working with the page CSS.

In the next part of this post, I’ll dig more into the PhantomJS API.