In this example I had a big log file (many million lines), that I wanted to split into smaller logfiles (each one million lines) for processing on Elastic MapReduce.
-rw-r--r-- 1 kostas staff 543067012012 Oct 11 13:45 huge_logfile |
This is a job for the split command. Because individual lines in the log file must be kept intact, the -l option is used to specify the number of lines in each file. In this example, certain lines are first filtered out with grep, to show how split is used when data is piped in:
grep 'some-pattern' huge_logfile | split -a 6 -l 1000000 - log_ |
The dash in the split command is used to accept input from standard input, while the log_ is used as a prefix for generated filenames. The -a 6 option tells split to use a 6 character extension after the prefix when naming files. The output looks like this:
huge_logfile log_aaaaaa log_aaaaab log_aaaaac log_aaaaad log_aaaaae ... |