Saturday 6 July 2013

Idiom in R: results you can C

Computing for Data Analysis was a pretty good introduction to R, but did not really talk about R idiom, which can make the difference between code that runs and code that runs quickly. Here is a basic example.

Using sprintf statements for formatting filenames: Consider a series of files. The goal is to read them all into R, but the filenames include a constant width variable: we're looking to load filenames such as ./data/001.csv and ./data/011.csv up to ./data/999.csv. How do we construct the name strings in R?

The numbers in the file names need to be padded and converted to the appropriate strings. Here are three ways of doing the padding.

The R way:

# setup
directory <- "data"
id = 1:999

# method 1
pad.R <- function(id) {
    num <- sprintf("%03d", as.integer(id))
    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

A brute-force method:

# method 2
pad.brute <- function(id) {
    num <- rep("", length(id))
    for (n in 1:length(id)) {
        if (id[n] < 10)  num[n] <- paste("00", id[n], sep = "") 
  else if (id[n] < 100)  num[n] <- paste("0", id[n], sep = "") 
  else num[n] <- as.character(id[n])
    }

    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

… but we know that for-loops are notoriously slow in R, so we could take a hybrid approach and define a function to take a single number as input and convert it. Then that function could be used with one of R's apply methods to convert the vector in one go.

A hybrid method:

# method 3
padder <- function(num) {
    if (num < 10) return(paste("00", num, sep = "")) 
 else if (num < 100) return(paste("0", num, sep = "")) 
 else return(as.character(num))
}

pad.hybrid <- function(id) {
    num <- sapply(id, padder)
    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

Comparison

These approaches all give the same results, but they are noticeably different.

system.time(path <- pad.R(id))
##    user  system elapsed 
##   0.001   0.000   0.001
system.time(path <- pad.brute(id))
##    user  system elapsed 
##   0.007   0.000   0.008
system.time(path <- pad.hybrid(id))
##    user  system elapsed 
##   0.004   0.000   0.004
path[c(3, 13, 103)]
## [1] "./data/003.csv" "./data/013.csv" "./data/103.csv"

For speed, they are equivalent when run on 1 or two elements at a time. However, when run on the full 999 element vector as shown here, both the 'brute force' and 'hybrid' methods are significantly slower than sprintf.

The discussions on the course forums did give a different perspective. Several self-identified 'professional programmers' preferred the if, if-else, else approach I've used in both methods 2 and 3. They considered it more readable and thus more maintainable.

I don't think this is the best approach. If you are a professional programmer, you are familiar with idiom, in whatever language you work in. You know that there are readable, maintainable, ways of doing what needs to be done efficiently. At it's root, deep down underneath, R is in the C family of languages. The basic in/out is based on the C standard library <stdio.h>. The professional way to use R is to use that R idiom efficiently and in a way that other R programmers will understand.

So learn your sprintf formatting codes. They may look like magic numbers the first time you meet them, but they are systematic and ubiquitous. They will be useful in many other contexts, including modern languages like Python and Java and therefore even Scala and Clojure. They will also speed up your code, and don't worry, most other professionals will understand them.

No comments:

Post a Comment