British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 009: Practical Python for biologists - What are modules?

Hopefully you have some familiarity with Python at this point. But you may or may not have come across modules (sometimes called packages). When, in Python, you need to write something like "import re" or "import numpy", this is calling a module. Don't worry if you have not had to do this yet, this post will explain what you need to know. 

Python comes with some pre-set functions such as print(), len(), and str(). You can write your own functions in Python using the def() function. A module is a special Python file that you can call from within Python (with import), which includes lots of functions and variables, expanding the functionality of Python. Some modules might be included when you first downloaded Python, others you will have to download and install for yourself. There are a number of ways to add a new Python package to your system and will depend on your setup. I will not cover that here, instead I will introduce what a module is and how to import one. 

Why have modules? Well, the core development team of Python is not able to make every function that one could possibly want. Also, why fill your system up with lots of useless files with functions that you will never use. Packaging variables and functions up in a module allows the user to download only the ones that they want, often developed by people with specific interests. The module matplotlib is a beautiful plotting package used to make great graphs and figures. The package numpy is a really useful way of dealing with numbers. The package pandas is a way of replicating the dataframe (table-like data structure) used in R. Not all Python users will use all of these, but some will find them indispensable. 

There are a number of ways of importing a module - this is when the functions and variables are loaded into Python so that you can call them. Load up Python, such as in your UNIX(-like) terminal by typing:

$ python3

You can then load the packages you need. If you do not have the module already installed, you can install it on the command line using pip or conda (e.g. pip3 install numpy). Let's load the module numpy and use the array function (creates a data structure like a list): 

$ import numpy
$ a = numpy.array([2,3,4])
$ a
array([2, 3, 4])

We imported numpy with the simple "import numpy" command. But notice how, when we wanted to use numpy's function array(), that we needed to add "numpy." before array(). This is to let Python know that we want the array function from numpy specifically. Another package could in theory use the name "array". But writing out numpy can get a little boring over time. So what many people do instead is this:

$ import numpy as np
$ a = np.array([2,3,4])
$ a
array([2, 3, 4])

We can set what we import numpy as. Convention means that we import numpy as np and matplotlib as plt. The final way to import numpy would be as so:

$ from numpy import *
$ a = array([2,3,4])
$ a
array([2, 3, 4])

Here, we are simply importing all of the functions and variables within the module all in one go. This means that when we call a function, we do not need to specify which module it came from. This can save on typing but can cause conflicts if two packages use the same name for functions, so try to avoid if possible. Generally I stick to the second of these methods. 

I highly recommend checking out modules like regular expressions (re; great for string searching, so perfect for working with motifs), numpy (numerical functions), matplotlib (plotting functions), scipy (functions for statistical testing) and pandas (for making and working with dataframes). 
 

Compbio 010: Ubuntu on Windows

Compbio 008: What the FASTQ? File formats in computational biology