The following assumes a linux command line to be present (or Mac OS X terminal in my case).
I want to wrangle text from the internet, turn it into JSON data, and ultimately stick it in CouchDB. Here I'm trying to turn a random text file containing prime numbers into structured JSON data that looks like this:
[2, 3, 5, 7,...] |
The original file is here: http://primes.utm.edu/lists/small/1000.txt. It is fairly structured to begin with, but it's not JSON.
The First 1,000 Primes (the 1,000th is 7919) For more information on primes see http://primes.utm.edu/ 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 ... end.
The following line does turns it into JSON:
curl http://primes.utm.edu/lists/small/1000.txt | \ tail +4 | \ tr -cs "[:digit:]" "," | \ sed -e 's/^,/\[/' -e 's/,$/\]/' \ > primes.json |
Let's look at it with cat to make sure:
$ cat primes.json [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,... |
Explanation of the command
curl is used to download the file and print it on standard output in the terminal. With no arguments it issues a HTTP GET for http://primes.utm.edu/lists/small/1000.txt.
tail +4 discards the first four lines.
tr -cs "[:digit:]" "," converts the text into digits followed by commas. The new text has a comma before the first digit, and a comma after the last one. No linebreaks or spaces: ,2,3,5,7...,7919,
sed -e 's/^,/\[/' -e 's/,$/\]/' is perhaps a bit hard to read. It replaces the comma before the first digit with '[', and replaces the comma after the last digit with ']'.