How to randomly sample k lines from a file in *nix

You can use the shell to extract a random sample of lines from a file in *nix. The two commands you need are "shuf" and "head" (+ "tail" for CSV files with a header). The shuf command will randomly shuffle all the lines of its input. The head command will cut of the input after the first k lines. Examples for both general files and CSV files are given below.

General pattern

To randomly sample 100 lines from any file in *nix:

shuf INPUT_FILE | head -n 100 > SAMPLE_FILE

Pattern for CSV

If you file is a CSV file, you probably want to extract the header and only sample the body. You can use the head and tail commands, respectively, to extract the header and sample the contents of the CSV file.

Extract the header of the CSV file:

head -1 INPUT_FILE.csv > SAMPLE_FILE.csv

Sample 100 lines from the body of the CSV file and append to sample file (notice ">" above versus ">>" below):

tail +2 INPUT_FILE.csv | shuf | head -100 >> SAMPLE_FILE.csv

Install dependencies on Mac

On Mac, the shuf command is not shipped with the OS. You can get it via brew. It will be named "gshuf":

brew install coreutils

So, on Mac you should replace shuf with gshuf in the example above.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.