Warning: This is not the hardest way to create a word cloud from pdf-documents, but it’s up there.
Say you have directory containing pdf documents:
$ ls a.pdf b.pdf c.pdf ...
Say you want a word cloud of the words contained in the pdf documents, and you want to use the Linux command line. Say that you are only interested in words occuring between ABSTRACT and INTRODUCTION.
A word cloud is something that looks like this:
Step 1: Extract all content from pdf documents as HTML using find and pdftohtml (I got a suggestion to use pdftotext instead. In that case, it might be possible to skip the next step, i.e. using lynx to strip the tags):
find . -name "*.pdf" | xargs -n1 pdftohtml -stdout >> all.html
This produces a single file containing multiple HTML documents.
Step 2: Strip HTML tags using lynx:
lynx -dump all.html >> all.txt
This produces a rather noisy text file.
Step 3: Remove non-printable characters using perl:
perl -lpe s/[^[:print:]]+//g all.txt >> clean.txt
This produces a noisy text file but sans non-printable characters.
Step 4: Keep only the sections of text between ABSTRACT and INTRODUCTION (each occuring multiple times in an alternating fashion):
sed -n '/ABSTRACT/,/INTRODUCTION/p' < clean.txt | \ grep -v -w INTRODUCTION > abstracts.txt
Step 5: Download an stopwords file:
curl -o stopwords.txt http://skipperkongen.dk/files/english-stopwords-short.txt
Step 6: Keep only characters, make them lower case, put each word on a line, remove stopwords and some garbage. Sort them for good measure:
grep -v -w ABSTRACT < abstracts.txt | \ sed 's/[^a-zA-Z]/ /g' | \ tr '[:upper:]' '[:lower:]' | \ tr ' ' ' ' | \ sed '/^$/d' | \ sed '/^[a-z]$/d' | \ grep -v -w -f stopwords.txt | \ sort > words.txt
Step 7: At this point the file words.txt could be plugged into a piece of word cloud software like www.wordle.net.
If you want, you can also create a frequency file now with the following command:
uniq -c < words.txt | sort -r -n > frequencies.txt
Step 8: Only create a word cloud for the 500 most common terms. Create a “go-file” from the frequencies file.
head -n500 < frequencies.txt | cut -f3 -d' ' > go-file.txt
Step 9: Filter words.txt by 500 most common terms:
cat words.txt | grep -w -f go-file.txt > commonwords.txt
If you don’t like what you see, you can revisit your stopwords file and enter more terms.