Compbio 002: The command line is your friend

For those of you who don't know what the command line is, it is simply a text based way of interacting with your computer. No mouse to click on icons and select from a drop-down menu with. Sounds awful, right? Well over the past three years, I have come to love the command line, but when I first encountered it, it was painful and I didn't quiet see the point. But with practice, it is actually quicker and easier to do many things in the command line than it would be if someone designed a graphical program instead.

A dataset with hundreds of thousands of rows would be be a nightmare to work with in Excel, but in the command line, not much of a problem at all. Most biological datasets are just some form of a plain text file. Whether it is a fasta file (.fasta/.fa) for storing sequences in, a fastq file for storing reads from a sequence run in (.fastq/.fq), or a genome annotation file (.GTF/.GFF), all plain text files. The UNIX command line has many powerful tools for dealing with text files. So it is perfect for a computational biologist. Many of the tasks you might want to learn programming for can actually be implemented with simple tools from the command line, without a single bit of programming.

There are plenty of places online to get used to working with the command line interface in a UNIX(-like) environment online. I would suggest these as good places to start. This post isn't a tutorial per se, but is a taster to try to get you in the right mindset before starting down the rabbit hole.

Codecademy (Interactive)
https://www.codecademy.com/en/courses/learn-the-command-line/lessons/environment/exercises/bash-profile

The Command Line Crash Course (Non-interactive)
http://cli.learncodethehardway.org/book/

The first thing that you should realise is that a table in Excel is basically just a giant text file with extra annotations. To make a table, you don't need Excel, all you need are values, with special characters (usually commas or tabs) to separate the columns, and a new line symbol to separate the rows. Here we will imagine you are working with a tab-delimited text file (.tsv/.tab/.txt) with transcript expression information.

To access the command line, you need a program to open a terminal in the bash language. These are native to Macs and Linux systems. You can emulate them on Windows with programs like Cygwin, and now Windows has its own native Linux terminal with Ubuntu on Windows (https://www.microsoft.com/store/apps/9nblggh4msv6). The text file used in the below examples is available here. But only attempt to follow along at home after taking some lessons from the above resources.

It is usually a good idea to have a look at the first few lines of your file to get a feel for it, you can do this with the command head:

$ head Example_transcripts.txt
Transcript_ID Gene_name Untreated_abundance Treated_abundance Change
T_0001 SMG1 200 2 Down
T_0002 SMG1 50 0.5 Down
T_0003 RS2Z37 50 150 Up
T_0004 RS2Z37 250 250 No_change
T_0005 RS2Z37 50 50 No_change
T_0006 TOPLESS 25 25 No_change
T_0007 TOPLESS 100 100 No_change
T_0008 EF1alpha 500 500 No_change
T_0009 RS2Z38 25 100 Up

You can see that the first 10 lines of this file will be printed. It might not be neatly formatted and spaced like in Excel, but that is something to get used to. To see the last 10 lines use tail:

$ tail Example_transcripts.txt
T_0006 TOPLESS 25 25 No_change
T_0007 TOPLESS 100 100 No_change
T_0008 EF1alpha 500 500 No_change
T_0009 RS2Z38 25 100 Up
T_0010 RS2Z38 100 100 No_change
T_0011 ANR 300 300 No_change
T_0012 PEX5 100 100 No_change
T_0013 eIF5L1 100 250 Up
T_0014 SMG7-2 50 200 Up
T_0015 LUG 125 125 No_change

Notice how there is an overlap of the lines shown by both head and tail. That's because the file is very small, with only 16 lines (including the header). We can count the exact number of lines with the simple but useful command: wc -l.

$ wc -l Example_transcripts.txt
16 Example_transcripts.txt

Now this is where the real fun begins, now we can start to filter this file. If we only wanted to view lines (transcripts) of a particular gene? Well we can use grep. grep is a very powerful UNIX tool. It is a string (text character) matching tool:

$ grep "RS2Z37" Example_transcripts.txt
T_0003 RS2Z37 50 150 Up
T_0004 RS2Z37 250 250 No_change
T_0005 RS2Z37 50 50 No_change

Now maybe instead of printing the filtered lines to the screen, we wanted count the number of transcripts annotated as our gene of interest. On the command line, we can link our programs together, the output of one (grep) can become the input of another (wc -l) by piping. To pipe the output of one command into another we use the | symbol.

$ grep "RS2Z37" Example_transcripts.txt | wc -l
3

It is easy for us to count to three, but when working with a much larger file with more transcripts and genes, this comes in handy.

You can also do a grep to find lines that do NOT include your string with the -v option.

$ grep -v "RS2Z37" Example_transcripts.txt
Transcript_ID Gene_name Untreated_abundance Treated_abundance Change
T_0001 SMG1 200 2 Down
T_0002 SMG1 50 0.5 Down
T_0006 TOPLESS 25 25 No_change
T_0007 TOPLESS 100 100 No_change
T_0008 EF1alpha 500 500 No_change
T_0009 RS2Z38 25 100 Up
T_0010 RS2Z38 100 100 No_change
T_0011 ANR 300 300 No_change
T_0012 PEX5 100 100 No_change
T_0013 eIF5L1 100 250 Up
T_0014 SMG7-2 50 200 Up
T_0015 LUG 125 125 No_change

Now imagine that you do not want to print the results of a grep to the screen, but instead you wanted to make a new file. You can do that with a redirect >.

$ grep "RS2Z37" Example_transcripts.txt > output_file.txt

This here produces a new file (notice how there is no output printed, making the new file is the output). Now let's say we want to filter two things before we make our new file. We can do this with piping the output of one grep into another grep, before finally redirecting it away from printing it to the screen and into a new text file. If we only wanted SRSF6 transcripts that are Up, we can do this:

$ grep "RS2Z37" Example_transcripts.txt | grep "Up" > output_file_2.txt

There are many other UNIX tools out there and with many uses. In my next post, I discuss the very powerful command line tool AWK. AWK is an incredible filterer, however, I did not find it an intuitive one to learn. But after getting some of the basics, I have found that by writing a simple AWK one-liner (a command all on one line of text), I can replace some of my Python programs. This saves me time coding, debugging the syntax mistake I inevitably make, and the AWK one-liners are even faster than my Python code. So, once you have mastered the basics of bash, a new world of super-powerful and super-fast commands will be at your fingertips.

British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology.

A geneticist interested in splicing, RNA decay, DNA methylation and synthetic biology. This is my blog focusing on my adventures in computational biology.

British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology.

Sep 1 Compbio 002: The command line is your friend

Sep 7 Compbio 003: A biologist's guide to AWK

Aug 28 Compbio 001: A what is programming guide for a biologist

A geneticist interested in splicing, RNA decay, DNA methylation and synthetic biology. This is my blog focusing on my adventures in computational biology.