British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 013: Practical Python for Biologists - Opening and saving files

Learning Python with a lot of traditional resources is great, but for a biologist who wants to get into their data and learn by doing, reading and writing new files might not yet have been covered. Here is a quick guide to opening (reading) and saving (writing) files in Python with a couple of simple functions. There are more complex ways to do this in a Python script to be run on the command line, but the method here will also work in a Python script, or in a Jupyter notebook. Using this approach means that you need to "hard code" the location of the input and output files into your code each time you want to run it on a new file. 

Let's say we have an input file, called "input013.txt" and we want to read it into Python (download here). It is a tab-delimited text file with three columns, no header. It has the name of a model organism, a metric (imagery popularity, for example), and another imaginary metric in the third column (perhaps how cool a model system it is on a scale of 0-1). 

$ head input013.txt
Arabidopsis     15      0.3
Physcomitrella  5       0.9
Drosophila      30      0.3

With a file like this, you might want to manipulate values within the file and write out a new tab-delimited text file with a new name. We can do this in a script (I like to use the Sublime Text editor) or we can do it in a Jupyter notebook (download notebook here). The core code will be the same. First we need to open/read the input file. So we can set the input file Python variable "infile" and make it equal a string, which is the path to the input file on your system: 

$ infile = "/home/jamesl/Bad_grammar_good_syntax/input013.txt"

To open the file, we use the open() function, setting the additional parameter to 'r' for read (rather than write, which would save over the contents of the file). Sometimes, files have an unusual newline symbol, like the one used by Windows and this can cause problems; setting this parameter to 'rU' instead of 'r' to allow Python to expect the unexpected newline symbols. 

$ infile = open(infile, 'r') #Use 'rU' if you expect non-UNIX newline symbols. 

Then we need to manipulate the contents of the file in Python somehow. Saving each line as a list seems like a good idea and the readlines() function is great for this: 

$ infile_lines = infile.readlines() #Makes list of all the lines
$ print(infile_lines)
['Arabidopsis\t15\t0.3\n', 'Physcomitrella\t5\t0.9\n', 'Drosophila\t30\t0.3\n']

Now we can work on the contents of this file. Let's say that we want to make a new outfile with the weighted scores from column two and three as a new fourth column. First, we need to open a new output file on our system, using 'w' instead of 'r' so we can save to the new file we are opening: 

$ outfile = "/home/jamesl/Bad_grammar_good_syntax/output013.txt"
$ outfile = open(outfile, 'w') 

Now to go through each line (list element), one at a time -  doing the calculation we need to do, and then appending this value to the end of the line (currently a list). Finally, we can write the now extended line to the output file. 

$ for line in infile_lines:
$     line = line.strip() #Remove newline symbol
$     line = line.split("\t") #Line into list
$     line[1] = int(line[1]) #Value to an integer
$     line[2] = float(line[2]) #Value to float number
$     line.append(str(line[1]*line[2])) #Multiples values & returns a string appended to the list
$     line[1] = str(line[1]) #Integer to string
$     line[2] = str(line[2]) #Float number to string
$     outfile.write('\t'.join(line) + '\n') #Write the line to the outfile
$ infile.close() #Closes the file (must do)
$ outfile.close() #Closes the file (must do)

An important thing to remember is that when you write the outfile, values in the list need to be strings, if you keep the values as integers or floating point numbers, they will not join together and be written. This code is purposefully verbose so anyone can follow along, but we can write it with fewer lines of code if you want: 

$ for line in infile_lines:
$     line = line.strip() 
$     line = line.split("\t") 
$     line.append(str(int(line[1])*float(line[2])))
$     outfile.write('\t'.join(line) + '\n')
$ infile.close() 
$ outfile.close() 

The write() function is new to Python3 (it was done with print in Python2). Here we add the output file in front of write(), then we use '\t'.join() to make the list (line) be joined together with tabs so that the output file will be tab-delimited as the input file was. 

The output file has been made:

$ head output013.txt
Arabidopsis     15      0.3     4.5
Physcomitrella  5       0.9     4.5
Drosophila      30      0.3     9.0

Now we can read in all sorts of files, from tab-delimited text files, or FASTA files etc for you to manipulate. 

To read up more on working with files, I recommend Python for Biologists
https://pythonforbiologists.com/working-with-files/

Compbio 014: Practical Python for biologists - Finding motifs with Python

Compbio 012: Making Venn diagrams the right way (using Python)