The following assumes a linux command line to be present (or Mac OS X terminal in my case).
I want to wrangle text from the internet, turn it into JSON data, and ultimately stick it in CouchDB. Here I’m trying to turn a random text file containing prime numbers into structured JSON data that looks like this:
The original file is here: http://primes.utm.edu/lists/small/1000.txt. It is fairly structured to begin with, but it’s not JSON.
The First 1,000 Primes
(the 1,000th is 7919)
For more information on primes see http://primes.utm.edu/
2 3 5 7 11 13 17 19 23 29
31 37 41 43 47 53 59 61 67 71
73 79 83 89 97 101 103 107 109 113
The following line does turns it into JSON:
curl http://primes.utm.edu/lists/small/1000.txt | \
tail +4 | \
tr -cs "[:digit:]" "," | \
sed -e 's/^,/\[/' -e 's/,$/\]/' \
Let’s look at it with cat to make sure:
$ cat primes.json
Explanation of the command
curl is used to download the file and print it on standard output in the terminal. With no arguments it issues a HTTP GET for http://primes.utm.edu/lists/small/1000.txt.
tail +4 discards the first four lines.
tr -cs "[:digit:]" "," converts the text into digits followed by commas. The new text has a comma before the first digit, and a comma after the last one. No linebreaks or spaces: ,2,3,5,7...,7919,
sed -e 's/^,/\[/' -e 's/,$/\]/' is perhaps a bit hard to read. It replaces the comma before the first digit with '[', and replaces the comma after the last digit with ']'.