British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 003: A biologist's guide to AWK

There are many times when analysing large datasets when one might want to remove many rows from the table because the contents do not match what you want. Or you want to count all of the line with specific values, like all the occurrences of a specific gene in the table. When I first started doing computational biology, I wrote lots of short Python scripts to handle this. But this lead to have dozens of single-use Python scripts cluttering my computer, adding up to to a reasonable amount of time writing each one. Mistakes always find their way in. Whether it was forgetting to turn "22" from a string (text value) to an integer (number value) or forgetting a colon, there was always some debugging to do. The more code one writes, the more opportunity for error. 

Then I discovered AWK. To say it changed my life might be overstatement, but not by a lot. With AWK, you can replace a lot of simple filtering scripts with a one-liner in the command line. Less code, less mistakes. Not only that, but it is FAST. Learning the syntax for AWK is the biggest challenge. 

The basic syntax of AWK is:

awk <command> infile.txt > outfile.txt

This will make an output file, removing any of the lines filtered out by the <command>. The syntax for the <command> is not the most straightforward but the rest of the post will take you through it with lots of examples. One note before we get into how to filter with AWK: AWK will separate columns using any whitespace (spaces or tabs). You can force it to pick a separator with -F. AWK -F "\t" forces AWK to select tabs as the separator of columns, to avoid confusion with other whitespace (like spaces). 

AWK works on a 1-index, so the first column is number 1 rather than 0 (like in Python). So, if you wanted to just print out the values of the first column, you can do that as so (we will work with the same file from the last post, which you can download from here):

$ awk -F "\t" '{print $1}' Example_transcripts.txt
Transcript_ID
T_0001
T_0002
T_0003
T_0004
T_0005
T_0006
T_0007
T_0008
T_0009
T_0010
T_0011
T_0012
T_0013
T_0014
T_0015

The $ symbol, within the command, indicates the column to be selected (it treats it as a shell variable). Notice the single quotations ('') and the curly parentheses ({}) around print and the column number. These are needed for the command to work. Let's say we wanted to print all columns except the 3rd from this file which has 5 total:

$ awk -F "\t" '{print $1,$2,$4,$5}' Example_transcripts.txt
Transcript_ID Gene_name Treated_abundance Change
T_0001 SMG1 2 Down
T_0002 SMG1 0.5 Down
T_0003 RS2Z37 150 Up
T_0004 RS2Z37 250 No_change
T_0005 RS2Z37 50 No_change
T_0006 TOPLESS 25 No_change
T_0007 TOPLESS 100 No_change
T_0008 EF1alpha 500 No_change
T_0009 RS2Z38 100 Up
T_0010 RS2Z38 100 No_change
T_0011 ANR 300 No_change
T_0012 PEX5 100 No_change
T_0013 eIF5L1 250 Up
T_0014 SMG7-2 200 Up
T_0015 LUG 125 No_change

This is rather useful if you quickly need to delete a column from your table; easier than opening it up in Excel and doing it there. 

But the real power of AWK is in its ability to filter files for specific values in specified columns. A GTF file typically contains many different genomic features (an example of a zipped GTF file can be downloaded here). Here is a GTF-like file to play with (download this here): 

$ head -20 exon_example.txt
1       gene    1000    2000    "gene_id ""GOI1""; exon_number ""3"";"
1       transcript      1000    2000    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""3"";"
1       transcript      1000    2000    "gene_id ""GOI1""; transcript_id ""GOI1.2""; exon_number ""2"";"
1       exon    1000    1300    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""1"";"
1       exon    1400    1500    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""2"";"
1       exon    1600    2000    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""3"";"
1       exon    1000    1300    "gene_id ""GOI1""; transcript_id ""GOI1.2""; exon_number ""1"";"
1       exon    1600    2000    "gene_id ""GOI1""; transcript_id ""GOI1.2""; exon_number ""2"";"
1       gene    5000    7000    "gene_id ""GOI2""; exon_number ""3"";"
1       transcript      5000    7000    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""3"";"
1       transcript      5000    6500    "gene_id ""GOI2""; transcript_id ""GOI2.2""; exon_number ""2"";"
1       exon    5000    5500    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""1"";"
1       exon    5600    5900    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""2"";"
1       exon    6000    7000    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""3"";"
1       exon    5000    5500    "gene_id ""GOI2""; transcript_id ""GOI2.2""; exon_number ""1"";"
1       exon    6000    6500    "gene_id ""GOI2""; transcript_id ""GOI2.2""; exon_number ""2"";"

But if you were only interested in the exon features, we can selected for the presence of "exon" in column 2 of the GTF-like file: 

$ awk -F "\t" '$2 == "exon"' exon_example.txt
1       exon    1000    1300    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""1"";"
1       exon    1400    1500    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""2"";"
1       exon    1600    2000    "gene_id ""GOI1""; transcript_id ""GOI1.1""; exon_number ""3"";"
1       exon    1000    1300    "gene_id ""GOI1""; transcript_id ""GOI1.2""; exon_number ""1"";"
1       exon    1600    2000    "gene_id ""GOI1""; transcript_id ""GOI1.2""; exon_number ""2"";"
1       exon    5000    5500    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""1"";"
1       exon    5600    5900    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""2"";"
1       exon    6000    7000    "gene_id ""GOI2""; transcript_id ""GOI2.1""; exon_number ""3"";"
1       exon    5000    5500    "gene_id ""GOI2""; transcript_id ""GOI2.2""; exon_number ""1"";"
1       exon    6000    6500    "gene_id ""GOI2""; transcript_id ""GOI2.2""; exon_number ""2"";"

Or you can select all lines that do not contain exon in them and output it into a new file with a redirect (>): 

$ awk -F "\t" '$2 != "exon"' exon_example.txt > outfile.txt

Simply trying to do this with grep (as covered here) isn't an option as each line has "exon" present in the final column of most GTF files. Therefore, limiting the search for a match to the contents of a column is essential. 

But here is my favourite part: AWK can also filter numerical values. No need to convert the number values from strings to integers/floats like you do in Python; AWK does that for you! If you only want values greater or equal to 100 in column 4, you can just do as below (notice that there are no quotation marks ("") around the value to be filtered as with text values): 

$ awk -F "\t" '$4 >= 100' Example_transcripts.txt
Transcript_ID   Gene_name       Untreated_abundance     Treated_abundance       Change
T_0003  RS2Z37  50      150     Up
T_0004  RS2Z37  250     250     No_change
T_0007  TOPLESS 100     100     No_change
T_0008  EF1alpha        500     500     No_change
T_0009  RS2Z38  25      100     Up
T_0010  RS2Z38  100     100     No_change
T_0011  ANR     300     300     No_change
T_0012  PEX5    100     100     No_change
T_0013  eIF5L1  100     250     Up
T_0014  SMG7-2  50      200     Up
T_0015  LUG     125     125     No_change

But if you want to filter on a couple of columns at once, this can be done. If you needed both columns 3 and 4 to be over 100, you can use an AND statement (&&):  

$ awk -F "\t" '$3 > 100 && $4 > 100' Example_transcripts.txt
Transcript_ID   Gene_name       Untreated_abundance     Treated_abundance       Change
T_0004  RS2Z37  250     250     No_change
T_0008  EF1alpha        500     500     No_change
T_0011  ANR     300     300     No_change
T_0015  LUG     125     125     No_change

But perhaps you only need one of these values to be true, either in column 3 or 4, to be over 100. In which case you can use the OR statement (||):

$ awk -F "\t" '$3 > 100 || $4 > 100' Example_transcripts.txt
Transcript_ID   Gene_name       Untreated_abundance     Treated_abundance       Change
T_0001  SMG1    200     2       Down
T_0003  RS2Z37  50      150     Up
T_0004  RS2Z37  250     250     No_change
T_0008  EF1alpha        500     500     No_change
T_0011  ANR     300     300     No_change
T_0013  eIF5L1  100     250     Up
T_0014  SMG7-2  50      200     Up
T_0015  LUG     125     125     No_change

For many files, you will want to keep the header. While not intuitive, you can do this with AWK by adding NR==1 with the OR operator || following it. Then enter the other filtering commands that you want to perform after the OR:

$  awk -F "\t" 'NR==1 || $3 > 100 || $4 > 100' Example_transcripts.txt
Transcript_ID   Gene_name       Untreated_abundance     Treated_abundance       Change
T_0001  SMG1    200     2       Down
T_0003  RS2Z37  50      150     Up
T_0004  RS2Z37  250     250     No_change
T_0008  EF1alpha        500     500     No_change
T_0011  ANR     300     300     No_change
T_0013  eIF5L1  100     250     Up
T_0014  SMG7-2  50      200     Up
T_0015  LUG     125     125     No_change

Now we can unleash the power of AWK. With this set of commands modified to your particular problem, you can replace short filtering scripts or opening up Excel to delete columns. 

Sometimes it might be easier to stack multiple commands (including AWK and grep) to filter for different requirements together. You might want to count the number of lines that meet your filtering criteria rather than print it (or save it in a new file (you can do that with wc -l). Below is an example of stacking multiple commands together by piping (|) the output of one into the next, to get a count of transcripts that meet your criteria: 

$ grep "Up" Example_transcripts.txt | awk -F "\t" '$3 > 100 || $4 > 100' | wc -l
3

Writing short AWK one-liners will save you a lot of time - once you get used to the syntax. Finally, here are some links to resources I used to learn AWK that you will want to check out: 

Bioinformatics one-liners
https://github.com/stephenturner/oneliners#readme

AWK GTF! How to Analyze a Transcriptome Like a Pro
http://reasoniamhere.com/2013/09/16/awk-gtf-how-to-analyze-a-transcriptome-like-a-pro-part-1/

Awk - A useful little language
https://dev.to/rrampage/awk---a-useful-little-language-2fhf

Compbio 004: Practical Python for biologists - Use dictionaries

Compbio 002: The command line is your friend