Chosing a language
I’ve been trying to make up my mind about which programming languages I should know. I’ve come to the conclusion that I must know what matters to me. That’s not a very exciting conclusion.
But when you’re starting off in a new field, you are very curious about chosing the right tool for the job. You wander around on the internet, read some pros and cons of some particular language, start to learn about it, read some more pros and cons, switch to a new language, repeat.
When I began to learn some data science tools on
Coursera two years
ago, I did not know which language to pick. They introduced R
, so I sticked
with R
. And I came to love it. I learned about the most useful packages and
all that. Learned some graphics techniques too1. This
time was great. I was not comparing R
to other languages. I knew only about
R
. I was not saying “Oh, I wish R
could do that!” or “If only R
was as
clean as Python
”. I sticked to R
, and learned a ton. It reminds me of the
period of life where people usually learn the most : childhood. Just thinking
about the mass of information we have to face as a baby is incredible. The
process of learning a first programming language is really similar. We have to
face many information simultaneously : what is a for
loop2, what is a terminal, what is an IDE, what is a
function, where are my data, how do I learn about X, what was the name of this
function already, I wish I knew of a function that could do this and that, what
is the name of these functions’ arguments, …
And then, when I was getting familiar with common hacking techniques, I grew
curious about what was on the other side. Maybe Python is cool too. I checked
Python. What ? No library(foo)
? Python3 is so different from Python2 that I
have to use a virtualenv
? What is a virtualenv
? How do I set up all this ?
R
was so much simpler. I guess there is other languages that are closer to R
and simpler to use than python. Well, julia
is. Let’s learn some julia
. Oh
but wait, julia
is only 0.3.2. I cannot use julia
for now. Nobody knows
it. But hey, I learned about speed of executions. I now know that R is
slow. Maybe I can learn a compiled language like C ? Maybe it would be a good
idea to know what the community think about a good first language to pick ? (Go
on stackoverflow). Ok, what ? Structure and Interpretation of Computer Programs
? This is the second time I hear this name. I know Hadley suggests it as a good
introduction to functional programming in R
. Let’s read it. Scheme
?
Parenthesis everywhere. Simple syntax. Lego for the mind. Ok then let’s learn
Scheme. But Scheme is nowhere to be seen in a production environment. What the …
is a production environment ? Maybe let’s learn some Clojure then. Ok so I need
to learn about the JVM ? Clojure is functional programming on the JVM, like
Scala. Scala ? Syntax similar to R
, good constructs and all. But Scala is so
java like that you have to use a dedicated IDE. And I love emacs. Well then not
Scala. Maybe Clojure ? But wait.
What’s the point ?
What do I need from a programming language ?
I need it to be simple to use. I do not want to be a software engineer. I want
to be able to analyse data. I want to be able to look at a biological data set,
and get meanings from it. I do not need to know about GUI or stuff like
that. Maybe R
was a good choice after all. Maybe I could check out what is the
most used language in my field ? Ok then it’s R
and Python
. Well, let’s
stick to R
, and learn some Python
to wrangle some sequences.
Now I’m glad I did all that. It took me a year or so. But now I know why I am at my computer. I learned some pretty amazing stuff. But now I can focus on what is important to me as a data scientist : I want an analysis environment that is simple to use, closer to the data than to the metal, in which I can quickly abstract ideas3, and if needed, that I can make faster easily4.
R
is all that. You can interface it with C++ seamlessly with dedicated
package, to make it really really fast. Yet you can stay in your REPL all along,
never worrying about the implementations details, never worrying about where
your class is defined, what was the button of that IDE that did this, where was
the button of that other IDE that did that, which software version am I using,
if I develop some code, will it be easy to deploy to another computer, not using
macOS like me ? R
is a great environment in this regard. It abstracts many
things from the data analyst, and make his life really easy.
Use R. Learn Python. Learn C++
As a final advice to anybody reading this, which like me, is trying to find his place in this world of software engineers and bioinformatician. You don’t want to be a software engineer. You want to analyse data. This is not your job to build up the algorithm of that particular piece of software, or the details of the implementation of a particular class or object. You want to know what is this gene doing. You want to know why this bacteria is always associated with this particular plant. You want to know why it is that a bacteria can be composed of 13% of GC, and another of 76%. There is some data associated with it.
Use python to make your data available to R. Python is very good at wrangling data, like perl was in its glory days. When the data is in R, use the right tool for the job. If a function is slow, try another approach. Try to vectorize it. Try to make it simpler. Use optimized functions. When it’s still slow, use C++. Those are the three tools one need to turn most analysis from an idea to a reality.