British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 019: Getting the most out of your data on the command line

The command line (terminal) is not just a place to run bioinformatics software. The UNIX command line has had years of development, much of it focused on the processing of plain text files. As most file formats in bioinformatics are actually plain text files (FASTA, FASTQ, GTF etc; link here), we can take advantage of these fast tools to manipulate them. Simple tools like head, tail, wc and grep commands have previously been covered here, and AWK is a powerful tool to filter and manipulate text files (link here). This post will go further with some of these tools and introdroduce others to help you with efficiently with your data. 


Working with .qz files

Often, you will be working with compressed files in the form of gzipped (.gz) files; instead of All_your_reads.fastq you might have All_your_reads.fastq.gz. This will save lots of space but cannot be read directly by the head or wc tools. You could unzip these files and then read the non-compressed versions, but that takes up massives amounts of storage space and means that you have a second file to track. 

Instead, you can simply read these files with zcat each time you want to work with them. By piping the output of zcat into the tool you want to run, you can now read the file, it will be a little slower, but saves on storage space:

$ zcat All_your_reads.fastq.gz | wc -l
@SRR99999999.1 1/1
GTTGGAACTAACAAAATGTACTTGAAATAGTAATATTGTTTACATTAAAC
+
BCCA@GDGGGGGGGGGGGGGGGGGGGGGGGGGDGDGEFDEGGGGGGGGGG


Become a cut master

Often, you might have a VERY wide table (an abundance columns). At any particular moment in time, you are probably only interested in a subset of these columns. So, you need to become a cut master. By using the cut command, you can select the columns you want, like the feature start positions from this GTF: 

$ head -5 Ppatens_318_v3.3.gene_exons.gtf | cut -f 4
8931
10057
10338
10624
10892

This command will only show the 4th column. Like in AWK && R, but unlike Python, the first column is numbered 1; this makes it very simple for me, but I am sure others feel differently. 

But what if you want to view more than one column at a time? Again very simple, use a comma-separated list (eg cut -f 1,4,5,7):

$ head -5 Ppatens_318_v3.3.gene_exons.gtf | cut -f 1,4,5,7
Chr01   8931    9859    +
Chr01   10057   10173   +
Chr01   10338   10492   +
Chr01   10624   10698   +
Chr01   10892   11182   +

Now we have the chromosome, feature start, feature stop and strand, without any of the other information we might not want at the time. We can do this in AWK, as I covered in my AWK introduction (link here), but the syntax is not as simple as with cut -f, so I find it much easier just to use cut. 


Sorting out your life

From a GTF, we can easily extract the chromosome, start, stop and strand with cut -f, but that file would be very large and would have many redundant rows containing the same information. With the sort -u command, we can remove those repeated lines and be left with a smaller and more manageable file. 

$ cut -f 1,4,5,7 Ppatens_318_v3.3.gene_exons.gtf | sort -u > new_file.txt 

The sort -u command is very useful and worth committing to memory. 

 

Removing a header row

Headers can be really useful, but also really frustrating to remove when working on the command line. There are numerous ways to do this, the best way I have seen came to my attention through Pall Melsted on Twitter (@pmelsted). 

This method is so useful and simple:

$ tail -n+2 014_UPF1_outfile.txt | wc -l
5

Now we can get an accurate count of all of our data rows rather than including the header row. When filtering a file, the header might be lost by the nature of the filtering, so to compare to the true total of data rows you need to account for the header and this is a great way to do it. 


grep your way to success

I covered the basics of grep before (link here). grep is a great pattern matching program: Simply enter a pattern (like a gene name) and every line containing that gene name is extracted:

$ grep "Pp3c1_20" Ppatens_318_v3.3.gene_exons.gtf | wc -l
1984

In this case, we get MANY matches (1984). This is because the gene name Pp3c1_20 also recognises any gene name in which this is a subset of the name, like Pp3c1_20580 and Pp3c1_20430. How do you avoid the problem? With the -w option: 

$ grep -w "Pp3c1_20" Ppatens_318_v3.3.gene_exons.gtf | wc -l
28

Now we only have lines that are an exact match for this pattern. This only includes letters, an underscore and numbers - so the quotation marks around the gene name within the GTF file do not count, which is really lucky for us. 

Sometimes, you might want to see the lines around the row in which you pattern is found. This is simple with grep. You can use the -A option to set how rows after the matching row you want to see (-A, after, get it???). You can use the -B option to set how rows before the matching row you want to see (-B, before, get it???). Let's put these to the test. If we wanted to see the sequence of a particular transcript from a fasta file, we can search for the transcript ID in the fasta header. Notice this is a gzipped file; we could call zcat on this first, as above, but their is a version of grep designed for this very case: zgrep (thanks to @piper_jason for the knowhow). So we can run zgrep instead of both zcat and grep in sequence:

$ zgrep "Pp3s350_10V3.2" Ppatens_318_v3.3.transcript.fa.gz
>Pp3s350_10V3.2 pacid=32899361 locus=Pp3s350_10 annot-version=v3.3

However, as you can see, all we get back is the header. But with -A, and trial-and-error, we can get the sequence. Let's start off with grep -A 10:

$ zgrep -A 10 "Pp3s350_10V3.2" Ppatens_318_v3.3.transcript.fa.gz
>Pp3s350_10V3.2 pacid=32899361 locus=Pp3s350_10 annot-version=v3.3
GCACCATGGCCAAGCTTTTGTTCCTCGTTGCTGTTCTCGTGATAGCTGCAGCCGTGGCGGCTGATGCTCACGTAGCACCT
GCTCTCGCTCCTGCCGCAGCGAAGGCTCCTGCGCCGACTCCGGTTCCCGCACCTGCTCGTGCCCCCTTCGTGAAGAAGGA
TTCGAGCTTCTCGCTGAAGTGCGTCATTCACAACGTGGCGACGAGGGACGTAGCGGTGAAGCTGGTCGTTCGGGGCAAGG
ACGCGGCGAAGCGTGTGGTGGCAAAGTCTGGGCAGTGGACCGCCATCGGGCCCGTGAAGCTGGGGTGCTCCGTTCCGGTG
GTGCACGTGTGGGTGACAGTGAAGAACAACAGGGGCTGGGCTGTGACGAAGAGCTTCCCGATCAGGTTGGCCGAGTTTCT
GAAGAACGTCGTGCCCAAGTGCACGAAGGTGCTGGCGCTGACGGCTTTTGAGAAGGCGAAGGCGAGGGGGAGCGAGAGGT
GTCTGGTGGTGAAGGTCGGCAAGAAGGAGGTCCTGACCGTGAAGTACTAGATTCCAGATCCCAGTCAAGGATTGTTGACT
TGTTGGAGTCTCTGCAATCGGGCTGTGAGAATGGGGACACGCTCTCCACTGTAGGAGTTGATCTGATCGGGTCGGGTGGT
GCCAGCACCGGAAGTCTCGTTCAAGAGACATTGTGTGATGCACATAGGAGCGGGAACATGGATTGTAGGAGATATTAGTT
ATAGGTGCGTCGGAAGCGACGCAGTTGGGGATGTCGAGGACGTCGGCAGCTTGATTGTGTTTGTGGGAGGACCACTTCGT

Not enough, yet, let's try grep -A 12:

$ zgrep -A 12 "Pp3s350_10V3.2" Ppatens_318_v3.3.transcript.fa.gz
>Pp3s350_10V3.2 pacid=32899361 locus=Pp3s350_10 annot-version=v3.3
GCACCATGGCCAAGCTTTTGTTCCTCGTTGCTGTTCTCGTGATAGCTGCAGCCGTGGCGGCTGATGCTCACGTAGCACCT
GCTCTCGCTCCTGCCGCAGCGAAGGCTCCTGCGCCGACTCCGGTTCCCGCACCTGCTCGTGCCCCCTTCGTGAAGAAGGA
TTCGAGCTTCTCGCTGAAGTGCGTCATTCACAACGTGGCGACGAGGGACGTAGCGGTGAAGCTGGTCGTTCGGGGCAAGG
ACGCGGCGAAGCGTGTGGTGGCAAAGTCTGGGCAGTGGACCGCCATCGGGCCCGTGAAGCTGGGGTGCTCCGTTCCGGTG
GTGCACGTGTGGGTGACAGTGAAGAACAACAGGGGCTGGGCTGTGACGAAGAGCTTCCCGATCAGGTTGGCCGAGTTTCT
GAAGAACGTCGTGCCCAAGTGCACGAAGGTGCTGGCGCTGACGGCTTTTGAGAAGGCGAAGGCGAGGGGGAGCGAGAGGT
GTCTGGTGGTGAAGGTCGGCAAGAAGGAGGTCCTGACCGTGAAGTACTAGATTCCAGATCCCAGTCAAGGATTGTTGACT
TGTTGGAGTCTCTGCAATCGGGCTGTGAGAATGGGGACACGCTCTCCACTGTAGGAGTTGATCTGATCGGGTCGGGTGGT
GCCAGCACCGGAAGTCTCGTTCAAGAGACATTGTGTGATGCACATAGGAGCGGGAACATGGATTGTAGGAGATATTAGTT
ATAGGTGCGTCGGAAGCGACGCAGTTGGGGATGTCGAGGACGTCGGCAGCTTGATTGTGTTTGTGGGAGGACCACTTCGT
TATTCATCATAATGTGCATCGGTCGAGATCAGCGTCCATTGTTGCGACGCTGAAGGCCGCTACCTGCAAACTTGTTGTGC
TTGGGTGGTCAAACGTGTGGTGTGGATGGCTGTGGATTGATGTGATTCTG

Perfect, now we have all the lines for this fasta file. To see rows above the matching line, it is a simple change (grep -B 2): 

$ zgrep -B 2 "Pp3s350_10V3.2" Ppatens_318_v3.3.transcript.fa.gz
TTCGTTATTCATCATAATGTGCATCGGTCGAGATCAGCGTCCATTGTTGCGACGCTGAAGGCCGCTACCTGCAAACTTGT
TGTGCTTGGGTGGTCAAAC
>Pp3s350_10V3.2 pacid=32899361 locus=Pp3s350_10 annot-version=v3.3

Magical. But grep has even more to offer. One of the most common things I need to do is to find rows that match multiple patters. Thankfully, grep can be made to perform AND and OR searches. We can start off with a simple grep search for this transcript (PAC:32968375): 

$ grep "PAC:32968375" Ppatens_318_v3.3.gene_exons.gtf | wc -l
14

To simulate an AND commmand with grep, you can pipe the output of one search inot a second:  

$ grep "PAC:32968375" Ppatens_318_v3.3.gene_exons.gtf | grep "exon" | wc -l
7

To perform an OR command, you can add the -E option to grep, and then enter the two patterns, seperated by a |

$ grep -E "PAC:32968375|PAC:32968376" Ppatens_318_v3.3.gene_exons.gtf | wc -l
28

Finally, we can perform a NOT search with the -v option. A NOT search will reveal any rows where the pattern is not found. This can be really useful. 

$ grep -v "exon" Ppatens_318_v3.3.gene_exons.gtf | wc -l
528900


End and further reading

I have barely scratched the surface of what tools on the command line can do for you in computational biology. Here is a list of further reading, perhaps you will find something that will change your entire workflow! 

Fantastic list of bash one-liners by Stephen Turner (@strnr):
https://github.com/stephenturner/oneliners

To look at the powerful filtering tool AWK:
https://www.badgrammargoodsyntax.com/compbio/2017/9/4/compbio-003-a-biologists-guide-to-awk

Some realy cool uses here:
https://astrobiomike.github.io/bash/six_commands

Compbio 020: Reads, fragments and inserts - what you need to know for understanding your sequencing data

Compbio 018: Getting more significance out of R