I really love AWK (see here), but one thing I haven't been able to figure out is how to filter or combine the contents of two files with AWK. So if the key information on a set of genes was split between two files, we can instead do it with Python.
So we can run through a couple of examples with two (input) files. The first (download here), has a list of genes with expression values in two treatments:
$ head Input1_expression.txt
Gene_ID Treatment_1 Treatment_2
G_001 10.10 11.41
G_002 5.10 4.95
G_003 55.00 54.99
G_004 30.78 65.86
The second file has the same genes, but has gene lengths and gene types (download here):
$ head Input2_GeneInfo.txt
Gene_ID Length Gene_type
G_001 1100 protein_coding
G_002 1903 protein_coding
G_003 2010 protein_coding
G_004 700 transposable_element
So, the first thing that we need to do is read in both of the inputs into memory (more detailed guide here). The first input file:
$ infile1 = "/home/jamesl/Bad_grammar_good_syntax/Two_inputs/Input1_expression.txt" #Set the path of 1st input file
$ infile1 = open(infile1, 'rU') #Opens file for reading
$ header1 = infile1.readline() #Removes the header
$ exp_lines = infile1.readlines() #Makes list of lines
The second input file:
$ infile2 = "/home/jamesl/Bad_grammar_good_syntax/Two_inputs/Input2_GeneInfo.txt" #Set the path of 2nd input file
$ infile2 = open(infile2, 'rU') #Open the file for reading
$ header2 = infile2.readline() #Removes the header
$ info_lines = infile2.readlines() #Makes list of lines
In the first example, all we want to do is filter out any gene that isn't protein-coding. We'll create the output file to store what we write:
$ outfile = "/home/jamesl/Bad_grammar_good_syntax/Two_inputs/Output1_Filtering.txt" #Setting path to the output file
$ outfile = open(outfile, 'w') #Creating output file
$ outfile.write(header1) #Making the header
Now we can run through the second file (the one containing gene types) and make a list of all protein-coding genes:
$ protein_coding = []
$ for info_line in info_lines:
$ info_line = info_line.strip() #Remove newline
$ info_line = info_line.split("\t") #Line into list
$ if info_line[2] == "protein_coding": #Checks last column is protein_coding
$ protein_coding.append(info_line[0]) #Take 1st line element & add to list
$ else:
$ continue #Move to the next line
$ infile2.close() #Closes the file
We have a list with all of the IDs of protein-coding genes. Now we can run through the other file (with the expression values), ask if the first column (containing the gene IDs) is present in this list and, if true, then we will print this line. If not, we skip it:
$ for exp_line in exp_lines:
$ exp_line = exp_line.strip() #Remove newline symbol
$ exp_line = exp_line.split("\t") #Line into list
$ if exp_line[0] in protein_coding: #Gene in list?
$ outfile.write('\t'.join(exp_line) + '\n') #Write to the outfile
$ else:
$ continue #Move to the next line
$ infile1.close() #Closes the file
$ outfile.close() #Closes the file
After ensuring that we closed all of the files we were working with, we can look at the new file with non-protein-coding genes removed:
$ head Output1_Filtering.txt
Gene_ID Treatment_1 Treatment_2
G_001 10.10 11.41
G_002 5.10 4.95
G_003 55.00 54.99
That's great. But perhaps we realise that the response of transposable elements might also be of interest - now we want to keep this gene in the analysis. But this time we want to include the gene lengths. We will append the gene length to the end of each line in the first file (expression). First, we open the input files again (as above), then we need to make a new output file:
$ outfile = "/home/jamesl/Bad_grammar_good_syntax/Two_inputs/Output2_Lengths.txt" #Setting path to the output file
$ outfile = open(outfile, 'w') #Creating output file
$ header1 = header1.strip() #Remove newline
$ outfile.write(header1 + "\t" + "Length" + "\n") #Making the header
While going through the second file (the one containing gene types), we can store the gene lengths in a dictionary (more details on dictionaries here). The gene ID will act as the key, and lengths as the value:
$ lengths_dict = {} #Making the dictonary
$ for info_line in info_lines:
$ info_line = info_line.strip() #Remove newline
$ info_line = info_line.split("\t") #Line into list
$ lengths_dict[info_line[0]] = info_line[1]
$ infile2.close() #Closes the file
$ print(lengths_dict)
{'G_003': '2010', 'G_002': '1903', 'G_001': '1100', 'G_004': '700'}
Now we have out dictionary with gene ID-length, key-value pairs. We can go through the expression file and match gene IDs in that file to gene lengths stored in the dictionary:
$ for exp_line in exp_lines:
$ exp_line = exp_line.strip() #Remove newline
$ exp_line = exp_line.split("\t") #Line into list
$ if exp_line[0] in lengths_dict: #Gene in dict?
$ exp_line.append(lengths_dict[exp_line[0]])
$ outfile.write('\t'.join(exp_line) + '\n') #Write to the outfile
$ else:
$ continue #Move to the next line
$ infile1.close() #Closes the file
$ outfile.close() #Closes the file
$ head Output2_Lengths.txt
Gene_ID Treatment_1 Treatment_2 Length
G_001 10.10 11.41 1100
G_002 5.10 4.95 1903
G_003 55.00 54.99 2010
G_004 30.78 65.86 700
We did it, we have been able to filter a file based on the contents of another file, and appended information from one file to the contents to another, based on a common key (gene ID). Now you are ready to take on most challenges in Python for computational biology. For a copy of the Jupyter notebook used for this post, download here.
If anyone can tell me how to do this simply and easily in AWK, I will be eternally grateful to them.