Warning: This is not the hardest way to create a word cloud from pdf-documents, but it's up there.
Say you have directory containing pdf documents:
$ ls
a.pdf
b.pdf
c.pdf
... |
Say you want a word cloud of the words contained in the pdf documents, and you want to use the Linux command line. Say that you are only interested in words occuring between ABSTRACT and INTRODUCTION.
Install some tools if you're on Mac:
brew install poppler
Step 1: Extract all content from pdf documents as HTML using find and pdftohtml (I got a suggestion to use pdftotext instead. In that case, it might be possible to skip the next step, i.e. using lynx to strip the tags):
find . -name "*.pdf" | xargs -n1 pdftohtml -stdout >> all.html |
This produces a single file containing multiple HTML documents.
Step 2: Strip HTML tags using lynx:
lynx -dump all.html >> all.txt |
This produces a rather noisy text file.
Step 3: Remove non-printable characters using perl:
perl -lpe s/[^[:print:]]+//g all.txt >> clean.txt |
This produces a noisy text file but sans non-printable characters.
Step 4: Keep only the sections of text between ABSTRACT and INTRODUCTION (each occuring multiple times in an alternating fashion):
sed -n '/ABSTRACT/,/INTRODUCTION/p' < clean.txt | \ grep -v -w INTRODUCTION > abstracts.txt |
Step 5: Download an stopwords file:
curl -o stopwords.txt http://skipperkongen.dk/files/english-stopwords-short.txt |
Step 6: Keep only characters, make them lower case, put each word on a line, remove stopwords and some garbage. Sort them for good measure:
grep -v -w ABSTRACT < abstracts.txt | \ sed 's/[^a-zA-Z]/ /g' | \ tr '[:upper:]' '[:lower:]' | \ tr ' ' ' ' | \ sed '/^$/d' | \ sed '/^[a-z]$/d' | \ grep -v -w -f stopwords.txt | \ sort > words.txt |
Step 7: At this point the file words.txt could be plugged into a piece of word cloud software like www.wordle.net.
If you want, you can also create a frequency file now with the following command:
uniq -c < words.txt | sort -r -n > frequencies.txt |
Step 8: Only create a word cloud for the 500 most common terms. Create a "go-file" from the frequencies file.
head -n500 < frequencies.txt | cut -f3 -d' ' > go-file.txt |
Step 9: Filter words.txt by 500 most common terms:
cat words.txt | grep -w -f go-file.txt > commonwords.txt |
If you don't like what you see, you can revisit your stopwords file and enter more terms.
Nice tutorial! One small comment — in step 2 you can also strip HTML tags using sed -e ‘s/]*>//g’ all.html
Thanks again for putting this together, Dipankar
Hi Dipankar
You’re welcome, and thanks for the tip :-)
Kostas
Fantastic! Thank you.
Pingback: Revisiting Matt Might’s 3 shell scripts to improve your writing |
Pingback: Creating a word cloud from PDF documents | Random Thoughts and Notes