Chosing a language

I’ve been trying to make up my mind about which programming languages I should know. I’ve come to the conclusion that I must know what matters to me. That’s not a very exciting conclusion.

But when you’re starting off in a new field, you are very curious about chosing the right tool for the job. You wander around on the internet, read some pros and cons of some particular language, start to learn about it, read some more pros and cons, switch to a new language, repeat.

When I began to learn some data science tools on Coursera two years ago, I did not know which language to pick. They introduced R, so I sticked with R. And I came to love it. I learned about the most useful packages and all that. Learned some graphics techniques too1. This time was great. I was not comparing R to other languages. I knew only about R. I was not saying “Oh, I wish R could do that!” or “If only R was as clean as Python”. I sticked to R, and learned a ton. It reminds me of the period of life where people usually learn the most : childhood. Just thinking about the mass of information we have to face as a baby is incredible. The process of learning a first programming language is really similar. We have to face many information simultaneously : what is a for loop2, what is a terminal, what is an IDE, what is a function, where are my data, how do I learn about X, what was the name of this function already, I wish I knew of a function that could do this and that, what is the name of these functions’ arguments, …

And then, when I was getting familiar with common hacking techniques, I grew curious about what was on the other side. Maybe Python is cool too. I checked Python. What ? No library(foo) ? Python3 is so different from Python2 that I have to use a virtualenv ? What is a virtualenv ? How do I set up all this ? R was so much simpler. I guess there is other languages that are closer to R and simpler to use than python. Well, julia is. Let’s learn some julia. Oh but wait, julia is only 0.3.2. I cannot use julia for now. Nobody knows it. But hey, I learned about speed of executions. I now know that R is slow. Maybe I can learn a compiled language like C ? Maybe it would be a good idea to know what the community think about a good first language to pick ? (Go on stackoverflow). Ok, what ? Structure and Interpretation of Computer Programs ? This is the second time I hear this name. I know Hadley suggests it as a good introduction to functional programming in R. Let’s read it. Scheme ? Parenthesis everywhere. Simple syntax. Lego for the mind. Ok then let’s learn Scheme. But Scheme is nowhere to be seen in a production environment. What the … is a production environment ? Maybe let’s learn some Clojure then. Ok so I need to learn about the JVM ? Clojure is functional programming on the JVM, like Scala. Scala ? Syntax similar to R, good constructs and all. But Scala is so java like that you have to use a dedicated IDE. And I love emacs. Well then not Scala. Maybe Clojure ? But wait.

What’s the point ?

What do I need from a programming language ?

I need it to be simple to use. I do not want to be a software engineer. I want to be able to analyse data. I want to be able to look at a biological data set, and get meanings from it. I do not need to know about GUI or stuff like that. Maybe R was a good choice after all. Maybe I could check out what is the most used language in my field ? Ok then it’s R and Python. Well, let’s stick to R, and learn some Python to wrangle some sequences.

Now I’m glad I did all that. It took me a year or so. But now I know why I am at my computer. I learned some pretty amazing stuff. But now I can focus on what is important to me as a data scientist : I want an analysis environment that is simple to use, closer to the data than to the metal, in which I can quickly abstract ideas3, and if needed, that I can make faster easily4.

R is all that. You can interface it with C++ seamlessly with dedicated package, to make it really really fast. Yet you can stay in your REPL all along, never worrying about the implementations details, never worrying about where your class is defined, what was the button of that IDE that did this, where was the button of that other IDE that did that, which software version am I using, if I develop some code, will it be easy to deploy to another computer, not using macOS like me ? R is a great environment in this regard. It abstracts many things from the data analyst, and make his life really easy.

Use R. Learn Python. Learn C++

As a final advice to anybody reading this, which like me, is trying to find his place in this world of software engineers and bioinformatician. You don’t want to be a software engineer. You want to analyse data. This is not your job to build up the algorithm of that particular piece of software, or the details of the implementation of a particular class or object. You want to know what is this gene doing. You want to know why this bacteria is always associated with this particular plant. You want to know why it is that a bacteria can be composed of 13% of GC, and another of 76%. There is some data associated with it.

Use python to make your data available to R. Python is very good at wrangling data, like perl was in its glory days. When the data is in R, use the right tool for the job. If a function is slow, try another approach. Try to vectorize it. Try to make it simpler. Use optimized functions. When it’s still slow, use C++. Those are the three tools one need to turn most analysis from an idea to a reality.

  1. ggplot2 to the rescue ! [return]
  2. Not in R of course ;) [return]
  3. functional programming was a good idea after all… [return]
  4. looking at you Rcpp and Cython [return]