Helpful Linux commands when working with large datasets.


The penguin is our best friend.

Every time I work with files whose content I cannot easily preview because the size _usually gigabytes or more _prevents the data to be quickly loaded or processed locally, GNU/Linux is my best friend. I demonstrate below a set of very useful GNU tools in the MacOS.

First, go to the BASH shell and make sure to install the GNU coreutils. If you already have Homebrew _ a popular OS X package manager_ then you can just run:

$ brew install coreutils

This will install the GNU version of common tools, in lieu of the default BSD version. Note that a ‘g’ prefix will be added to the name of commands that may already exist in the system to avoid conflict. If you are curious about why GNU versus BSD, you can learn more here.

Now, let’s make good use of these tools. Suppose that you have a very large file and you’d like to see what is in the first 20 lines, for example. In the shell, you can simply run:

$ head -n 20 filename.csv

The same can be accomplished using the sed command:

$ sed 20q filename.txt > outputfile.txt #q means "quit" at line 20

The optional last part, “> output.txt”, will write the lines into an output file instead of printing on the screen.

If the file hasn’t been downloaded from its source URL, then you can quickly get the first 20 lines using curl:

$ curl <> | head -n 20 > output.txt

The command “curl” fetches the raw HTML using a defined protocol such as HTTP and displays it to the standard output. It’s essentially the same as “viewing the source” of a web page.

In case you want to know the number of lines in the file, you can quickly and easily do so as well:

$ wc -l filename.csv

“wc” prints the newline, word, and byte counts, but “-l” restricts the output to just the number of lines. Note that this command will work if the lines are separated by feedline (\n), but not if by a carriage return (\r).

What if you want to break the file into smaller pieces? In this case, you could use split:

$ split -l 1000k -a 4 filename.csv file_out. #split by number of lines

In the example above, the input “filename.csv” is split into files with 1 million lines. Each file is named “file_out.”, plus a 4 letters suffix as defined by the option “-a”. Note that the header is only preserved in the first split. If splitting by the number of bytes instead, replace “-l” by “-b”, in which case lines of data may be broken.

In order to split the file and preserve the header in each output, one idea is to create a BASH function to store the header and apply it as a filter:

$ keep_header() { { head -n 1 filename.csv; cat; } > "$FILE"; };
$ export -f keep_header;
$ tail -n +2 filename.csv | gsplit --lines=1000 --filter=keep_header - temp_

The first line defines a function named “keep_header”, which assigns the first line of “filename.csv” to the variable “FILE”. The second line is a necessary step to add the function (-f) to the global environment. Finally, in the third line “tail +2” takes everything from line 2 and on of “filename.csv”, and splits into files with 1000 lines each, keeping the header. The prefix for the output files name is “temp_ “.

Concatenating the split files back into one file can be done with the “cat” command, which will write the content of each specified input file as one standard output:

$ cat file_out.* > concat_file.csv

Finally, obtaining a random sample from a file can be done with the “shuf” command. For example, to get a random sample of 50000 lines from “filename.csv” and output it as “subset_50k.csv”, you can run the following:

$ gshuf -n 50000 filename.csv -o subset_50k.csv

Again, it’s likely that the header line will not be preserved.

The commands shown above can save a significant amount of time when performing initial checks in large datasets. If you have other useful suggestions, please share in the comments section.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s