British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 002: The command line is your friend

For those of you who don't know what the command line is, it is simply a text based way of interacting with your computer. No mouse to click on icons and select from a drop-down menu with. Sounds awful, right? Well over the past three years, I have come to love the command line, but when I first encountered it, it was painful and I didn't quiet see the point. But with practice, it is actually quicker and easier to do many things in the command line than it would be if someone designed a graphical program instead. 

A dataset with hundreds of thousands of rows would be be a nightmare to work with in Excel, but in the command line, not much of a problem at all. Most biological datasets are just some form of a plain text file. Whether it is a fasta file (.fasta/.fa) for storing sequences in, a fastq file for storing reads from a sequence run in (.fastq/.fq), or a genome annotation file (.GTF/.GFF), all plain text files. The UNIX command line has many powerful tools for dealing with text files. So it is perfect for a computational biologist. Many of the tasks you might want to learn programming for can actually be implemented with simple tools from the command line, without a single bit of programming. 

There are plenty of places online to get used to working with the command line interface in a UNIX(-like) environment online. I would suggest these as good places to start. This post isn't a tutorial per se, but is a taster to try to get you in the right mindset before starting down the rabbit hole. 

Codecademy (Interactive)
https://www.codecademy.com/en/courses/learn-the-command-line/lessons/environment/exercises/bash-profile

The Command Line Crash Course (Non-interactive)
http://cli.learncodethehardway.org/book/

The first thing that you should realise is that a table in Excel is basically just a giant text file with extra annotations. To make a table, you don't need Excel, all you need are values, with special characters (usually commas or tabs) to separate the columns, and a new line symbol to separate the rows. Here we will imagine you are working with a tab-delimited text file (.tsv/.tab/.txt) with transcript expression information. 

To access the command line, you need a program to open a terminal in the bash language. These are native to Macs and Linux systems. You can emulate them on Windows with programs like Cygwin, and now Windows has its own native Linux terminal with Ubuntu on Windows (https://www.microsoft.com/store/apps/9nblggh4msv6). The text file used in the below examples is available here. But only attempt to follow along at home after taking some lessons from the above resources. 

It is usually a good idea to have a look at the first few lines of your file to get a feel for it, you can do this with the command head

$ head Example_transcripts.txt
Transcript_ID   Gene_name       Untreated_abundance     Treated_abundance       Change
T_0001  SMG1    200     2       Down
T_0002  SMG1    50      0.5     Down
T_0003  RS2Z37  50      150     Up
T_0004  RS2Z37  250     250     No_change
T_0005  RS2Z37  50      50      No_change
T_0006  TOPLESS 25      25      No_change
T_0007  TOPLESS 100     100     No_change
T_0008  EF1alpha        500     500     No_change
T_0009  RS2Z38  25      100     Up

You can see that the first 10 lines of this file will be printed. It might not be neatly formatted and spaced like in Excel, but that is something to get used to. To see the last 10 lines use tail:

$ tail Example_transcripts.txt
T_0006  TOPLESS 25      25      No_change
T_0007  TOPLESS 100     100     No_change
T_0008  EF1alpha        500     500     No_change
T_0009  RS2Z38  25      100     Up
T_0010  RS2Z38  100     100     No_change
T_0011  ANR     300     300     No_change
T_0012  PEX5    100     100     No_change
T_0013  eIF5L1  100     250     Up
T_0014  SMG7-2  50      200     Up
T_0015  LUG     125     125     No_change

Notice how there is an overlap of the lines shown by both head and tail. That's because the file is very small, with only 16 lines (including the header). We can count the exact number of lines with the simple but useful command: wc -l

$ wc -l Example_transcripts.txt
16 Example_transcripts.txt

Now this is where the real fun begins, now we can start to filter this file. If we only wanted to view lines (transcripts) of a particular gene? Well we can use grep. grep is a very powerful UNIX tool. It is a string (text character) matching tool:

$ grep "RS2Z37" Example_transcripts.txt
T_0003  RS2Z37  50      150     Up
T_0004  RS2Z37  250     250     No_change
T_0005  RS2Z37  50      50      No_change

Now maybe instead of printing the filtered lines to the screen, we wanted count the number of transcripts annotated as our gene of interest. On the command line, we can link our programs together, the output of one (grep) can become the input of another (wc -l) by piping. To pipe the output of one command into another we use the | symbol. 

$ grep "RS2Z37" Example_transcripts.txt | wc -l
3

It is easy for us to count to three, but when working with a much larger file with more transcripts and genes, this comes in handy. 

You can also do a grep to find lines that do NOT include your string with the -v option. 

$ grep -v "RS2Z37" Example_transcripts.txt
Transcript_ID   Gene_name       Untreated_abundance     Treated_abundance       Change
T_0001  SMG1    200     2       Down
T_0002  SMG1    50      0.5     Down
T_0006  TOPLESS 25      25      No_change
T_0007  TOPLESS 100     100     No_change
T_0008  EF1alpha        500     500     No_change
T_0009  RS2Z38  25      100     Up
T_0010  RS2Z38  100     100     No_change
T_0011  ANR     300     300     No_change
T_0012  PEX5    100     100     No_change
T_0013  eIF5L1  100     250     Up
T_0014  SMG7-2  50      200     Up
T_0015  LUG     125     125     No_change

Now imagine that you do not want to print the results of a grep to the screen, but instead you wanted to make a new file. You can do that with a redirect >. 

$ grep "RS2Z37" Example_transcripts.txt > output_file.txt

This here produces a new file (notice how there is no output printed, making the new file is the output). Now let's say we want to filter two things before we make our new file. We can do this with piping the output of one grep into another grep, before finally redirecting it away from printing it to the screen and into a new text file. If we only wanted SRSF6 transcripts that are Up, we can do this: 

$ grep "RS2Z37" Example_transcripts.txt | grep "Up" > output_file_2.txt 

There are many other UNIX tools out there and with many uses. In my next post, I discuss the very powerful command line tool AWK. AWK is an incredible filterer, however, I did not find it an intuitive one to learn. But after getting some of the basics, I have found that by writing a simple AWK one-liner (a command all on one line of text), I can replace some of my Python programs. This saves me time coding, debugging the syntax mistake I inevitably make, and the AWK one-liners are even faster than my Python code. So, once you have mastered the basics of bash, a new world of super-powerful and super-fast commands will be at your fingertips. 

Compbio 003: A biologist's guide to AWK

Compbio 001: A what is programming guide for a biologist