Functional programming is a style of programming which place the emphasis on functions. Functions helps to clearly express the intent of a piece of code. To me, R is a functional language. Its deep roots goes back to Scheme, a functional LISP dialect. The imperative part of R are to be found in its S legacy. In this post, I want to describe how I use functions in R, mixing imperative and purely functional programming.

I will take as an example an analysis I’ve done in the past few months. It was a quick one day long analysis, but I think it would have been much longer if I had not used R1.

I work in biology. In november, we received results of a run of Sanger sequencing. It was done in 96 well plates on 800bp PCR amplicons. We quickly found out that some sequences showed traces of contamination : we had secondary peak specifically on our divergence markers. We had reasons to believe that sequences with secondary peak would be distributed randomly over the 96 well plate if it was not a sign of contamination. I decided to test the random distribution using a quick simulation.

The simulation would do the following :

  • generate 96 well plates with as much contaminated well as there were in our plates.
  • generate a lot of those plates.
  • per well, count the average number of contaminated well, and average this count over the plate.
  • compare this simulated statistic with the observed one in our plate.

We first need to load some packages.

library(assertthat)
library(ggplot2)
library(dplyr)

Functions express intent

We want a simple function to sample a number at random between 1 and n.

random_well <- function(n) sample(1:n, 1)

This function does only one simple thing. But it has a name that clearly express what it does and abstracts the sampling process, so simple as it is.

We also want a simple function to generate an empty plate, ie a plate where no well is contaminated. A 96 well plate can easily be represented as a 8 by 12 matrix.

initiate_plate <- function() matrix(rep(0, 96), nrow = 8, ncol = 12)

In the same manner, the function is really simple, but now I can just call initiate_plate and be done with it : I now have an 8 by 12 matrix with 0 all over, ready to be filled up with simulated contaminations !

Recursion shortens the distance from mind to code

We now want to contaminate n well inside of this empty plate.

plate_generate <- function(plate = initiate_plate(), n) {
    assert_that(is.numeric(n), n < 96)

    ## this function contaminate a plate well if it is not already contaminated.
    ## it uses recursion to do so.
    contaminer <- function(mat) {
        x <- random_well(8) # sample one line at random
        y <- random_well(12) # sample one column at random

        if (mat[x, y] != 1) {
            mat[x, y] <- 1
            mat
        } else {
            contaminer(mat)
        }
    }

    ## if the plate already have *n* contaminated wells, return it.
    ## otherwise create a plate with *n - 1* contaminated wells.
    ## this kind of recursion is particularly useful to think about.
    if (sum(plate < n)) {
        ## if the plate does not have n wells contaminated
        contaminer(plate_generate(plate, n - 1))
        ## generate another plate with *n - 1* contaminated wells,
        ## and contaminate another random well.
    } else {
        plate
    }
}

plate_generator uses recursion. It is a style of function calling that I believe to be discouraged in R. But as my stack cannot exceed 96, I can use recursion without worrying too much about overflowing the R stack.

I have enclosed a function inside of it, since I do not need it outside of this context. This function, contaminer, set the well / matrix value to 1 at a random coordinate if the well is not already at 1. If the well is already contaminated (equals to 1), I call contaminer on the plate again : it will pick another random well, and contaminate it. If the well is already contaminated (equals to 1), I call contaminer on the plate again : it will pick another random well, and contaminate it. If the well is already … I think you begin to understand how recursion works.

plate_generator uses contaminer to create a matrix with n contamination inside, ie n wells with a 1 value and 96 - n wells with a 0 value. It does so by first checking if the total count of plate is equals to n. If not, we can generate a plate with n - 1 contamination, and add another contamination with contaminer. How can we generate a plate with n - 1 contamination ? It would be great if we had a function that generate a plate with the desired amount of contamination. Well, it appears we have. This function is called plate_generator. If we call plate_generator with n = n - 1, it will contaminate another well ; the plate sum is now equal to n and we can return it. But generating a plate with n - 1 contaminated well implies that we have a plate with n - 2 contaminated well and so on. This is recursion. I have one function that does one thing and call itself to reach its goal.

The plate_generator function is useful in the sense that I did not have to think about it really well. I know it is not optimised. I know that I can shorten the time of execution by first excluding wells that are already contaminated from the parameters space of contaminer, instead of picking / checking / contaminating a well randomly. The function would gain in speed. But I can generate ten thousands plate on my computer in less than 2 minutes. The function did not take that long to develop.

The previous code is maybe a little bit tedious to understand. Anyway, once it is tested, I do not have to care about how it is done. I can just call the function, I know it will always give the same result, even if the implementation change. It is a level of abstraction that clearly helps to develop clean and clever code by iterative steps. I am not afraid of breaking everything up : I know my functions does one thing and does it well. It has no side-effects. This function is pure : it does not modify anything outside its scope.

We now want to generate many random plates, with a finite number of contamination inside of it.

Running the simulation

n_plate_generate <- function(number_of_plate, number_of_conta) {
    lapply(1:number_of_plate, function(i) plate_generate(n = number_of_conta))
}

We now can generate three plates with three contaminated wells inside of it.

n_plate_generate(3, 3)
## [[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    0    0    0    0    0    0    0    0    0     0     1     0
## [2,]    0    0    0    0    0    0    0    0    0     0     0     0
## [3,]    0    0    0    0    0    0    1    0    0     0     0     0
## [4,]    0    0    0    0    0    0    0    0    0     0     0     0
## [5,]    0    0    0    0    0    0    0    0    0     0     0     0
## [6,]    1    0    0    0    0    0    0    0    0     0     0     0
## [7,]    0    0    0    0    0    0    0    0    0     0     0     0
## [8,]    0    0    0    0    0    0    0    0    0     0     0     0
##
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    0    0    0    0    0    0    0    0    0     0     0     0
## [2,]    0    0    0    0    0    0    0    0    0     0     0     0
## [3,]    0    0    1    0    0    0    0    0    0     0     0     0
## [4,]    0    0    0    0    0    0    0    0    0     0     0     0
## [5,]    0    1    0    0    0    0    0    0    0     0     0     0
## [6,]    0    0    0    0    0    0    0    0    0     0     0     0
## [7,]    0    0    0    0    1    0    0    0    0     0     0     0
## [8,]    0    0    0    0    0    0    0    0    0     0     0     0
##
## [[3]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    0    0    0    0    0    0    0    0    0     0     0     0
## [2,]    0    0    0    0    0    0    0    0    0     0     0     0
## [3,]    0    0    0    0    0    0    1    0    0     0     0     0
## [4,]    0    0    0    0    0    0    0    0    0     0     0     0
## [5,]    0    0    0    0    0    0    0    0    0     0     0     0
## [6,]    0    0    0    0    0    0    0    0    0     0     0     0
## [7,]    0    0    0    0    0    0    0    0    0     0     0     0
## [8,]    0    1    0    1    0    0    0    0    0     0     0     0

We then need to count, per well, the number of neighbouring contaminated well. This function is kinda tricky to explain, and demands some kind of expertise with typical biological plates. I had to hardcode many stuff, so I guess this is not the cleaner way to do it. But at least is works.

has_nearby_conta <- function(plate, line, column) {
    assert_that(is.matrix(plate), is.numeric(line), is.numeric(column))

    x      <- line
    y      <- column
    p      <- plate
    nearby <- c()

    bord_gauche     <- function(col) ifelse(col == 1 , TRUE, FALSE)
    bord_droite     <- function(col) ifelse(col == 12, TRUE, FALSE)
    bord_sup        <- function(row) ifelse(row == 1 , TRUE, FALSE)
    bord_inf        <- function(row) ifelse(row == 8 , TRUE, FALSE)

    coin_sup_gauche <- function(row, col) ifelse(bord_gauche(col) & bord_sup(row), TRUE, FALSE)
    coin_inf_gauche <- function(row, col) ifelse(bord_gauche(col) & bord_inf(row), TRUE, FALSE)
    coin_inf_droite <- function(row, col) ifelse(bord_droite(col) & bord_inf(row), TRUE, FALSE)
    coin_sup_droite <- function(row, col) ifelse(bord_droite(col) & bord_sup(row), TRUE, FALSE)

    if      (coin_sup_gauche(x, y)) { nearby <- c(nearby, p[x, y+1], p[x+1, y+1], p[x+1, y])}
    else if (coin_inf_gauche(x, y)) { nearby <- c(nearby, p[x-1, y], p[x-1, y+1], p[x, y+1])}
    else if (coin_sup_droite(x, y)) { nearby <- c(nearby, p[x, y-1], p[x+1, y-1], p[x+1, y])}
    else if (coin_inf_droite(x, y)) { nearby <- c(nearby, p[x, y-1], p[x-1, y-1], p[x-1, y])}
    else if (bord_gauche(y))        { nearby <- c(nearby, p[x-1, y], p[x-1, y+1], p[x, y+1], p[x+1, y+1], p[x+1, y])}
    else if (bord_droite(y))        { nearby <- c(nearby, p[x-1, y], p[x-1, y-1], p[x, y-1], p[x+1, y-1], p[x+1, y])}
    else if (bord_sup(x))           { nearby <- c(nearby, p[x, y-1], p[x+1, y-1], p[x+1, y], p[x+1, y+1], p[x, y+1])}
    else if (bord_inf(x))           { nearby <- c(nearby, p[x, y-1], p[x-1, y-1], p[x-1, y], p[x-1, y+1], p[x, y+1])}
    else {
        nearby <- c(nearby,
                    p[x-1, y-1], p[x, y-1], p[x+1, y-1],
                    p[x-1, y  ],            p[x+1, y  ],
                    p[x-1, y+1], p[x, y+1], p[x+1, y+1] )
    }

    ## normalise to 1 if > 1
    nearby[nearby > 0] <- 1

    ## return the mean of the nearby matrix.
    mean(nearby)
}

We apply this function to a plate, to count the mean number of well that have at least one well contaminated.

count_nearby_conta_well <- function(plate) {
    locate_conta_well <- function(plate) {
        which(plate != 0, arr.ind = TRUE)
    }

    ## stores the positions of contaminated wells inside a list of the form
    ## ( (row . column)
    ##   (row . column) )
    well_list <- locate_conta_well(plate)

    mean(
        apply(well_list, 1,
              function(x) {
            has_nearby_conta(plate, line = x[1], column = x[2])
        }))
}

During the debugging of this function, it was particulary useful to have clean and expressive names to decrypt the traceback() output.

We need to count the number of contaminated well inside our plate.

count_conta <- function(plate) length(which(plate != 0))

Let’s set N the number of plate in the monte carlo simulation.

n_plate <- 1000

We can then generate 1000 plates with C contamination inside of it, randomly distributed on the plate.

# generate `n_plate` with C contaminated well
random_weak <- n_plate_generate(n_plate, 35)
## count the average number of nearby contaminated well per well and average it
## over the plate. Do it to all the plate in random_weak.
random_weak_mean_list <- lapply(random_weak, count_nearby_conta_well)

Just so that you can see the data :

head(random_weak, n = 3)
## [[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    1    1    0    0    0    1    1    1    0     0     0     0
## [2,]    1    0    1    0    1    1    0    1    0     0     1     0
## [3,]    0    0    0    0    0    1    1    1    1     0     0     1
## [4,]    1    1    0    0    0    0    0    0    1     0     0     0
## [5,]    1    0    0    1    0    1    1    0    1     0     0     0
## [6,]    0    0    1    0    0    1    0    0    0     0     1     1
## [7,]    0    0    0    0    0    1    0    1    0     0     1     0
## [8,]    0    1    0    1    1    0    0    0    1     0     0     0
##
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    1    0    0    0    0    0    1    1    1     0     1     0
## [2,]    1    0    0    0    0    1    0    0    0     0     1     1
## [3,]    0    1    1    0    0    0    0    1    0     1     0     1
## [4,]    0    1    1    0    0    0    1    0    0     0     1     0
## [5,]    0    1    0    0    1    1    0    1    0     1     1     0
## [6,]    0    1    1    0    1    0    0    0    0     0     0     0
## [7,]    0    0    0    0    1    1    1    1    0     1     0     1
## [8,]    1    0    0    0    0    0    0    0    0     1     0     0
##
## [[3]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    0    0    0    1    1    0    0    0    0     0     0     1
## [2,]    0    0    1    0    1    1    1    0    0     1     0     1
## [3,]    0    1    0    0    1    0    0    0    0     1     1     1
## [4,]    0    0    0    0    1    1    0    1    0     1     1     0
## [5,]    1    0    0    0    1    0    0    0    1     0     0     0
## [6,]    0    0    0    1    0    0    0    1    0     0     0     1
## [7,]    0    0    0    0    1    0    0    1    0     0     1     0
## [8,]    1    1    1    0    1    1    0    0    1     0     1     0
head(random_weak_mean_list, n = 3)
## [[1]]
## [1] 0.3440476
##
## [[2]]
## [1] 0.3188095
##
## [[3]]
## [1] 0.3590476

For the first plate, the 96 wells have a mean of 0.3440476 wells contaminated.

plot_mean_distrib <- function(random_mean_list) {

    random_mean_list %>%
        unlist() %>%
        data.frame(wellcount = .) %>%
        ggplot(aes(wellcount)) +
        geom_histogram(binwidth = 0.005, fill = "gray") +
        labs(x = "Mean number of nearby contaminated well per well per plate") +
        theme_minimal() +
        theme(panel.ontop = TRUE,
              panel.grid.major.y =
                  element_line(colour = "white", size = 0.5, linetype = "dotted"),
              panel.grid.minor.y =
                  element_line(colour = "white", size = 0.5, linetype = "dotted"))

}

A first plot :

plot_mean_distrib(random_weak_mean_list)

plot of chunk 2016-07-05-randomweakmean

Factorial in R :

I think this function is the cleanest and fastest way to express the factorial function in R. And yes, * is a function.

factorial <- function(n) {
    Reduce(f = `*`, x = 1:n, init = 1)
}

  1. Actually it is the language I am the most comfortable with, so it is not really a good comparison… [return]