Creating a word cloud from PDF documents

Warning: This is not the hardest way to create a word cloud from pdf-documents, but it’s up there.

Say you have directory containing pdf documents:

$ ls
a.pdf
b.pdf
c.pdf
...

Say you want a word cloud of the words contained in the pdf documents, and you want to use the Linux command line. Say that you are only interested in words occuring between ABSTRACT and INTRODUCTION.

A word cloud is something that looks like this:

Wordle: highscalability

Step 1: Extract all content from pdf documents as HTML using find and pdftohtml (I got a suggestion to use pdftotext instead. In that case, it might be possible to skip the next step, i.e. using lynx to strip the tags):

find . -name "*.pdf" | xargs -n1 pdftohtml -stdout >> all.html

This produces a single file containing multiple HTML documents.

Step 2: Strip HTML tags using lynx:

lynx -dump all.html >> all.txt

This produces a rather noisy text file.

Step 3: Remove non-printable characters using perl:

perl -lpe s/[^[:print:]]+//g all.txt >> clean.txt

This produces a noisy text file but sans non-printable characters.

Step 4: Keep only the sections of text between ABSTRACT and INTRODUCTION (each occuring multiple times in an alternating fashion):

sed -n '/ABSTRACT/,/INTRODUCTION/p' < clean.txt | \
grep -v -w INTRODUCTION > abstracts.txt

Step 5: Download an stopwords file:

curl -o stopwords.txt http://skipperkongen.dk/files/english-stopwords-short.txt

Step 6: Keep only characters, make them lower case, put each word on a line, remove stopwords and some garbage. Sort them for good measure:

grep -v -w ABSTRACT < abstracts.txt | \
sed 's/[^a-zA-Z]/ /g' | \
tr '[:upper:]' '[:lower:]' | \
tr ' ' '
' | \
sed '/^$/d' | \
sed '/^[a-z]$/d' | \
grep -v -w -f stopwords.txt | \
sort > words.txt

Step 7: At this point the file words.txt could be plugged into a piece of word cloud software like www.wordle.net.

If you want, you can also create a frequency file now with the following command:

uniq -c < words.txt | sort -r -n > frequencies.txt

Step 8: Only create a word cloud for the 500 most common terms. Create a “go-file” from the frequencies file.

head -n500 < frequencies.txt | cut -f3 -d' ' > go-file.txt

Step 9: Filter words.txt by 500 most common terms:

cat words.txt | grep -w -f go-file.txt > commonwords.txt

If you don’t like what you see, you can revisit your stopwords file and enter more terms.

5 Replies to “Creating a word cloud from PDF documents”

Leave a Reply