British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

CompBio 028: Python vs R: an endless war

So people love to have huge disagreements over small differences. Some people love the programming language Python and hate another called R. Some people praise R and trash Python. It doesn’t end there, with people hating on both and insisting for the best code you need to write in C/C++ /Rust, and the wars between C++ and Rust programmers look entertaining. Don’t even get me started whether you should intend code with tabs or spaces (Just watch this brilliant clip from the HBO comedy Silicon Valley). Even which text editor to use in the Terminal (vim vs nano) causes debate. Given how passionate people get over these disagreements over similar differences (see Narcissism of Small Differences), I should stay well out of this topic. But here I go against my better judgement…

This post is intended to give a personal view of Python and R as someone who went from the wetlab to coding. To be clear, I am not trying to claim that one of these languages is not innately superior, but I wrote it because I think that their are some interesting differences between the two and how people are introduced to them, which can make a big difference to how people think about programming.

A cartoon blue and yellow snake and a female pirate with a sword saying "Rrrr".

Before I get into the core of this post, some readers may be asking “but if they are both Turing-complete languages, then you can do anything in either so why does this debate even exist and it is all about personal preference”? And yes, it partly is. Yes any computation can be done in either language, given that they are both Turing-complete, but how easy it is to do the things that biologists care about does differ. So does the community/culture surrounding each language. These two things are not mutually exclusive, with the culture/community driving how the language and its packages/modules are designed, but also the design of the language influences how the community/culture sees it and teaches it. Here are some examples from my experience learning both discussed below (Full disclosure, Python was my first language and I did a course in it. I tried to pick up R at the same time with online courses but it didn’t take and I had to return to R later without much formal education and I am sure that will have biased me).

Strengths of Python: learning basic principles of coding

When I think of programming, I immediately think of for loops and if statements. A for loop allows you go through a list of values, like each row in a table, one at a time. When going through a loop and you need to make a decision “yes I want to keep this row in the output” or “no I want to discard this row immediately”, then a if statement allows you modify something if a specific value is present at a position with a list to that value.

When you start learning Python, these are some of the first concepts you will come across. For me, it changed how I think about coding and broke it down into simple steps that I could chain together to build something that was functional and I understood every step. Like placing one Lego brick at a time and ending up with an awesome model, you can build up a Python script bit by bit until you have built a beautiful program that does multiple small actions. I can build great complexity, step by step, from first principles of for loops and if statements. It can take a while to write it all out, line by line, but I not only get a sense of smug satisfaction when it is done, but also I can predict unintended behaviours from unexpected inputs and quickly update the code to accomodate the unusual data type usually with another well placed if statement).

R in contrast does not like you to do for loops or if statements (I check this with someone who like R and a DuckDuckGo search confirms it e.g. this). R is a language for getting stuff done (see below) and doesn’t want you wasting your precious time with learning this basic concepts or using its slow implementation of for loops.

Strengths of R: just getting stuff done with data

Doing things my preferred way in Python, building each functional part of the script comes at a cost. More lines of code means more time and more chance for bugs. The number of errors (bugs) is correlated with the number of lines of code (see here), so the more you have to write, the more chance for errors. R beats this because you can immediately get your table (from text file or Excel) into R with one line of code as a dataframe, and then immediately start to work on it. Base Python does not even have the notion of a table. Instead you need to rely on lists of lists or dictionaries in Python to make tables (see an old blog post by me with an example here), or install the popular pandas package to replicate R like dataframes in Python. So R becomes immediately useful for people wanting to manipulate data in tables, which is pretty common for biologists. Want to remove all lines where the value in the 2nd column is lower than 10? You can do that in one line with R, no extra packages required. In base Python, I would need to write multiple lines of code (to setup a for loop and then if statements to check the value and then to write to an out file). So for just getting things done, R is a champion (a lot is happening “under the hood” so you do not need to worry about it). But what if you need to do something you’ve not seen in R? In Python I just imagine the logical way I would process the data in my mind and then write out the process in Python, using for loops and if statements. In R, I just need to rote learn all these different functions that act on dataframes, which isn’t fun or easy for me. In reality, I just ask fellow UWA researcher Brady Johnston (@bradyajohnston Twitter; bradyajohnston@aus.social Mastodon; @bradyajohnston.bsky.social Bluesky) for help.

Of course you COULD write out for loops and if statements in R, but I am yet to find an online tutorial that helps you with that. The expectation is that you use a pre-written function that does all the for loops for you and you just edit that one line of code. R is powerful. I recently had a complex column in a table that I needed to split into multiple new columns. I knew how to do this in Python (painfully as I’d done it before), taking into account many different special characters being used to divide up the data within this column. Eventually I figured it out in R, with a lot of trial and error, but was simple (once you knew how).

One way vs many ways

In Python, there are usually multiple ways that you can write code to achieve one good outcome, but the philosophy of the language is that you do it in the simplest, most “Pythonic” way. This has lead to most Google answers for Python coding questions generally giving similar single answers to help you (in my experience at least). In R, you have almost the exact opposite approach, it has no philosophy other than the mantra of getting stuff done quickly. When you look up how to do something in R, you will generally find many websites with tutorials at each offer multiple different approaches to solve the exact same problem for the user to pick their favourites from. Often these will involve how to do it in base R as well as using different packages that you can download to expand the language (see tidyverse discussion below). While I do kind of appreciate the flexibility built into this approach, I find that it just adds so much time to trying to figure out how to do that one simple thing that you want to do.

The philosophy and community of R is so fragmented, that there is even its own “sub-language” called the tidyverse. Described on its own website as “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures”. It is responsible for somethings that I love, like ggplot2, and some things that I hate, like the faux dataframes called tibbles, which are apparently better but lack some functions. Why is the tidyverse part of R rather than its own language? I do not know. History I guess? Because there is no law stopping them from doing this? The tidyverse has a cult like following of devotees that do not count me among them. It can be useful but after struggling to learn base R’s syntax to get by, I find it obnoxious to then have to learn a 2nd, unintuitive syntax to use the tidyverse. I could simply ignore the tidyverse but when I search for soooo many solutions to my R woes, at least the answers on online forums are using the tidyverse and appear to be unintelligible to me (I like to understand what I am doing with my code, without learning a secret second language within a language and I feel like an old man yelling at a cloud).

Screenshot from the Simpsons: a news paper clipping of grandpa Simpson with his first raised with the headline "OLD MAN YELLS AT CLOUD".

A final comment on R, which might gets its own blogpost one day, is that many computational tools to analyse your data, such as differential gene/transcript expression from RNA-seq are not R packages rather than standalone command line programs, which are usually hosted on Biocondunctor. Using examples of these is what forced me to learn R, and the beauty of ggplot2 is what got me to stick around. But I miss running a one line commands in the terminal to analyse my data, rather than dozens of interactive commands in R (or a very long script in R). To me this seems to go against one of the strengths of R, which is getting a lot done with a single line of code. With tools like Cufflinks, I can do a whole analysis with one command, but with sleuth, it is far more.

Summary

Both languages are great and have their strengths. I personally prefer the logic and philosophy of Python and I would like to do everything in Python all the time. But I found that R is just a lot better for making nice pretty plots and quickly working with data in tables. I will still often write a single standalone Python script when I need to take a large input file and process it in a single complex way I know that I will need to repeat it a few times. I choose to use R when I am working with the output data and want to turn it into pretty plots or quickly modify the table in a relativity simple way. For some, they use the pandas module for Python to give it R like control over dataframes within R and use multiple different plotting modules to make beautiful figures. I personally found that I did not like pandas and it was just easier to learn R and do my plotting with its ggplot2 library. Each to their own, but hopefully this has given you some appreciation of the debate and how to consider each language when you start. My intention wasn’t to help you pick a first language, but to explain why picking your first language may have unforeseen consequences. I’m glad I went Python first and learnt those core skills first before moving onto R but everyone needs to forge their own path.

PS A recent pet peeve is that R tries to gaslight me, as some functions act is unexpected ways, like read_excel() importing as a tibble and not a dataframe, or rownnames() not working on a tibble but does on a traditional dataframe, or the merge() function creating new row indexes rather than keeping the old ones or preserves the order (see here). Thanks tidyverse 🙃.

CompBio 027: Is AI generated code here to replace computational biologists?