Friday, 28 February 2014

Intro to GitHub (for scientists)

The problem

I talked to a young woman yesterday who is a bio-engineering postdoc at Stanford. She has some code that she'd like to 'upload' to GitHub. She admitted that, well, actually, she hadn't managed to get any of her code onto her GitHub account yet, and she looked so overwhelmed and dejected that I felt bad. I know that frustration.
So here is yet another blog post to try and help with the learning curve that is git. I'm writing this for the scientist who has written some code and wants to share it on GitHub. Most scientists write code in order to accomplish a particular task, and thus are not familiar with professional programming practices, including writing documentation, unit testing, SCM (source control management) and version control. There are great benefits to learning these techniques, and their usefulness is becoming more and more apparent to academics as research relies more and more on computer programs. In fact, the table of contents for the journal Nature Methods just arrived in my inbox with a lead editorial on reproducible research. I quote:

Nature Methods strongly encourages researchers to take advantage of the opportunity that code repositories, such as GitHub, provide to improve a software tool before submission. Even if others do not examine and test the code, the act of preparing the code and necessary documentation for deposit and use by others will help avoid delays in publication.
In short, if you are coding and publishing work derived from your code, the process of uploading your code for collaboration will help bring it up to a publishable quality. And yes, it's even more important than making the figures pretty.

GitHub == collaboration

Firstly, if you just want to upload some code, you need to take step back, a deep breath and 'think different'. There is no 'upload' button on GitHub for a good reason. It is not built for uploading code and leaving it to rot on a server, but for fostering collaboration between programmers. Thus, in order to share your code on GitHub, you first need to get it ready for collaboration. To do that, you need to set up version control. This is somewhat more complicated than finding the 'track changes' setting in Microsoft Word, but it is also far more useful.
As you try to think different, keep in mind that GitHub was built with a particular set of workflows in mind. Those workflows have to do with managing a code-base that is constantly being updated by multiple contributors. GitHub tries to ease the difficulties of this type of collaboration by bringing several things together.
  • version control -- similar to 'track changes' in a word processor, but for whole projects
  • cloud-based storage -- simultaneously backup and share your code
  • user accounts -- keep it private, share with a team or make it public as your needs change
  • social -- there's messaging, so you can talk, argue, document, discuss.... collaborate!
But version control is the main thing that sets GitHub apart from any number of social sharing platforms, and makes it so powerful for people who code. To use GitHub effectively you therefore need to understand the basics of version control with git. It's hairy and scary at first, but no worse than... well, ok, it is worse than a lot of things, but sometimes a learning curve is the price you pay for really capable tools. So, put a set of bookmarks in your browser, make a cheatsheet and keep it handy. A local cheatsheet with a good searchable title does wonders. If you're thinking 'Oh, I'll just use it once' or 'I'll remember', well, git is for professionals. Are you a professional?

Git == version control

Version control is to 'track changes' for a text document, as Superman is to Tarzan; as Microsoft Word is to TextEdit, as New York City is to Detroit. Firstly, with version control, you are tracking changes to an entire project, not a single document. Changes are tracked line by line with comments and attribution through time as the project grows, changes, and splits; as subroutines develop into full projects of their own; as new owners take control of the code base. It is flexible and thorough and reliable. You should learn to use it.
Use case: You've got some code you developed for your thesis that you want to upload to GitHub. Maybe someone else will find it useful. There are GUI front ends to git, which may help with many tasks, but git was designed to be run from the command line, and this simple use case it not too difficult to master at that level, so let's just go for it.

Download and install git

Walk through the steps at github set-up. Today, they suggest that you download their native app, but note that this only manages part of the workflow. The steps you have to do to get git working still involve the terminal. Be brave. The steps are:
  1. get a GitHub account
  2. download and install git
  3. link your local system to your GitHub account
    • tell git your name, email, GitHub login information
    • set up security keys so that GitHub knows you are you

Prepare your codebase

There are a couple of adjustments you probably want to make before releasing your code in the wild. GitHub recommends that all code comes with a License, a Readme and a .gitignore file.
  • License If your code comes with a license, it's easier for collaborators to re-use the code and build on it. Specifically, it makes clear what they are allowed to do. It may not be important to you, but your code will be easier to share if you make it clear what the rules are.
  • Readme
    You have to explain your work at some point. If you do it in a file named README, GitHub will automatically put it on the front page of the repository. This is very helpful for anyone trying to understand what you did and why. The README can be just a text file. If it is in markdown, perhaps with additional flavoring GitHub will render it with headings and styles, which is much nicer for the reader and not difficult for the writer.
  • .gitignore
    Some files don't need to be tracked. For instance, some old Mac directories contain .DS_Store files with directory display information for the Finder app. That doesn't need to be part of the repository. So here's my .gitignore for an old MatLab project:
      $cat .gitignore
      .DS_Store 
    
    Pretty simple. You might also add *.log or tmp/ to the .gitignore file, depending on your context. Basically, any files that are automatically generated or updated on compile should not be tracked.

Make a local repository

The project you want to get onto GitHub probably consists of a directory or directory tree containing a series of text files, and possibly some image files or data files. In order for git to track changes to this project, you have to put these files into a repository. This is simple to do once you've prepared it for sharing.
Find your command line. On a Mac, you can use the Terminal app. Navigate to the directory holding your project. If you have never ever used the command line before, this might be challenging. If you want to dive in, you can certainly do so with three little commands: ls, cd, pwd. You can look at the man pages for these commands by typing, for example $man ls, or you could try a crash course in using the command line.
Once you get to your project directory, type:
$git init   
You should see a message from git, something like:
Initialized empty git repository in /Users/suz/programming/octave/OrX/.git/
This initialises an empty repository, which looks like a file named .git. You can check that it's there by typing
$ls -a
Git gives some feedback about what it has done, but I often find it useful to check with
$git status
after each command to see what has happened.
Now we can get the project into the repository. To do this, first type
$git add .   
This prepares git to add your files to the repository, a process known as 'staging'.
Note: The . tells git to stage all the files in the directory tree to the repository. This is very handy when we add files because they don't all have to be specified by name. On the other hand, it isn't ideal, because there are often binary files or .log or even image files that update automatically. You won't want to keep track of those changes. Fortunately, git will automatically look for the .gitignore file we already prepared to get the list of exceptions.

Ready for some commitment? Type:
$git commit -m 'initial commit' 
You should get a full list of the files that git is committing to the repository. Check the status when it's done and you should give a reassuring message:
# On branch master
nothing to commit (working directory clean)

Success! now your project is actually in the repository and git can track any changes to the files. The repository will also keep all the messages that you put with each commit. Always use the -m flag and take the time to add a meaningful message.

Congratulations! Your code is now in a git repository, under version control. You are ready to collaborate.

Share your work


Make a repository on GitHub

  1. Log into your GitHub account
  2. Create a new repository on GitHub
    On your profile page, in the 'Repositories' tab is a list of repositories that you've contributed to. At the upper right should be a bright green 'New' button.
  3. Follow the directions in the form, adding an informative description so others know what treasure they have found.
Congratulations! You have a GitHub repository to share your code from!

Link the repositories

In git terminology, the current state of your code is the head of a branch. By default, the initial commit is called the 'master' branch. You can make other local branches, and probably should to try out new features. You can also make remote branches. At this point, your new GitHub repository is essentially an empty remote branch. By custom, this branch will be referred to as 'origin'. To point git to it, type (on one line), with your appropriate username and project title:

$git remote add origin https://github.com/username/project.git

This command translates roughly as "Dear git; Please be so kind as to add a connection to a remote repository. I will be referring to the remote repository as 'origin' in our future correspondence. It can be found at https://...... Thank you for your kind assistance in this matter. Sincerely, yours truly, esq."

Upload your code

Ok, ok, there is no upload on GitHub, but it is payoff time. Once you have a local repository linked to a remote repository, you can just push the code from one to the other.

$ git push -u origin master

Translation: "Dear git; Please push the recent changes I committed to my local repository, known as master, into the remote repository known as origin. Also, please be aware that I'd like this to be the default remote repository, sometimes referred to as 'upstream'. Thank you again for your kind assistance. I am forever in your debt. Sincerely, Yours truly, esq. and, etc."

Success!!! You have now successfully pushed your code to GitHub.

Or at least I hope you were successful. If not, if you've tried to follow this post and the directions at GitHub and you still feel lost, there is more help out there. Many universities are running Software Carpentry bootcamps to help students and faculty develop more professional programming skills. The skills taught aim to improve software collaboration and impart the skills needed to carry out reproducible research. Two key tools they teach are version control with git and collaboration via GitHub.
Live long and collaborate!

Wednesday, 29 January 2014

UV reactive bead spectra

UV reactive bead spectra

Background

I was interested in designing some educational experiments using UV-reactive beads to teach about the presence/absence and intensity of UV light under different atmospheric conditions. The beads turn from clear or colorless to various bright colors when exposed to UV light. Unfortunately, these beads are just too sensitive – they even react strongly to the stray 400 nm light coming through glass windows. And reach full color intensity within a minute, even at 51° N in February! It's fun to watch, but not very useful for detecting physiological UV conditions for vitamin D synthesis.

I don't have a lot of scientific information abou these, so the range of UV light needed for the color change is not well characterized, and the resultant absorbance spectra of the beads (related to their aparent colors) is not readily available. I decided to study them with a reflectance spectrophotometer to see what I could learn.

It may not be useful for the educational project I had in mind, but I'll write it up here anyway, along with the R-code used to analyze the data….

Experimental

I ordered UV Reactive beads from UV gear, UK.

I used a table in the garden on a sunny day for the experiments. The sky was open directly above and direct sunlight came from the south. There were a few leafless tree branches in the way, and some buildings nearby, so it was not full sky exposure, but at least there was full direct sunlight. The table was covered with a small white blanket to provide a consistent background. The beads equibilbrated 10-15 min in direct sunlight before spectral measurements were made.

I used an Ocean Optics model USB2000+ spectrometer connected to a 1 meter UV-Vis fiber optic cable (part number QP400-1-UV-VIS from Ocean Optics) with a cosine correction filter at the end (Ocean Optics part CC-3-UV-S) to minimize stray light. The spectrometer was connected by USB to a MacBook Pro (OS X mountain lion 10.8.1) running SpectraSuite software for data collection. Spectra were measured with 30 averages in reflectance mode and saved as tab-delimited text files. The reflectance measurements required a 'reference spectrum' (light spectrum) of the white blanket and a second, 'dark' reference, which was obtained by blocking the end of the cosine correction filter. To collect the reflectance spectra, I pointed the fiber optic light pipe at individual beads from a distance of < 0.5 cm.

Spectral analysis can be done in the SpectraSuite software, however, many basic functions had not (yet?) been implemented in the OS X version of the software. Also, I wanted to practice my R skills, so I decided to load it into R and see what I could learn. R has multiple spectroscopy packages, and in fact a new one has come out since I started this project. I looked into using hyperspec, but the data structure seemed overly complicated for the simple analysis I had in mind. So what follows is my simple R analysis of reflectance data on sunlight-exposed UV beads.

Load in the Data

# change working directory
try(setwd("/Users/suz/Documents/vitD Schools/UV bead exp 19022013/spectrometer data/"))

# data is tab-separated, with a header. The end of file wasn't recognized
# by R automatically, but a read function can specify the number of rows
# to read.
spec.read <- function(spec.name) {
    read.table(spec.name, sep = "\t", skip = 17, nrows = 2048)
}

# the function can then read in all the files ending in '.txt', put them
# in a list.
spec.files <- list.files(pattern = ".txt")
df.list <- lapply(spec.files, spec.read)

# convert the list to a matrix: the first column is the wavelengths and
# the other columns are experimental data -- one reflectance measurement
# at each wavelength.
spec.mat <- matrix(df.list[[1]][, 1], nrow = 2048, ncol = (1 + length(spec.files)))
spec.mat[, 2:11] <- sapply(df.list, "[[", 2)

matplot(spec.mat[, 1], spec.mat[, 2:11], type = "l", lty = 1, lwd = 3, xlab = "wavelength (nm)", 
    ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 25000, "Raw Data")

Looks like we need to do some clean-up!

Clean up the data

Baselines and edges

At the edges of the spectral range, the reflectance data is dominated by noise, so it isn't useful. The baselines for the different spectra also need aligning, and we'll scale them to the same intensity for comparison. The intensity range observed in the data depends somewhat on the angle of the probe, even with a cos filter in place.

# define terms for the processing:
nm.min <- 400
nm.max <- 800  # edges of the displayed spectrum
base.min <- 720
base.max <- 900  # define baseline correction range
peak.range.min <- 420
peak.range.max <- 680  # where to find peaks for scaling.

# remove ends of the data range that consist of noise
spec.mat <- spec.mat[(spec.mat[, 1] > nm.min) & (spec.mat[, 1] < nm.max), ]

# normalize baselines, set baseline range = 0.
spec.base <- colMeans(spec.mat[(spec.mat[, 1] > base.min) & (spec.mat[, 1] < 
    base.max), ])
spec.base[1] <- 0  # don't shift the wavelengths
spec.mat <- scale(spec.mat, center = spec.base, scale = FALSE)

Choose colors for the plot by relating the file names to R's built-in color names.

bead.col <- sapply(strsplit(spec.files, " bead"), "[[", 1)
# replace un-recognized colors with r-recognized colors (see 'colors()')
bead.col <- gsub("darker pink", "magenta", bead.col)
bead.col <- gsub("dk ", "dark", bead.col)
bead.col <- gsub("lt ", "light", bead.col)
bead.col <- gsub("lighter ", "light", bead.col)

# plot corrected data
matplot(spec.mat[, 1], spec.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 10, "Baseline Corrected")

From this plot, we can see that the lighter colored beads have smaller peaks than the darker beads. The lighter color probably represents less dye in the beads. There seems to be a lot of variation in the peak intensity of some beads, particularly the yellow beads and the dark blue beads. Based on the width of the peaks, the purple and magenta beads appear to have mixtures of dyes for both pink and blue colors. The dark blue beads appear to be either a mixture of all the dye colors or a mixture of pink and blue, but much more dye is used than for the paler pink or blue beads. The yellow bead spectra is oddly shaped on the short wavelength side, probably due to instrument cutoffs around 400 nm.

Scaling and smoothing

Now I'll scale the data to the same range. It turns out that the R command 'scale' is perfect for this.

# scale the peaks based on the min reflected intensity
peak.range <- which((spec.mat[, 1] > peak.range.min) & (spec.mat[, 1] < peak.range.max))
spec.min <- apply(spec.mat[peak.range, ], 2, min)
spec.min[1] <- 1  # don't scale the wavelengths
spec.mat <- scale(spec.mat, center = FALSE, scale = abs(spec.min))

The spectra are also jittery due to noise. This can be removed by filtering. This filters over a range of 10 points.

data.mat <- spec.mat[, 2:11]
dataf.mat <- apply(data.mat, 2, filter, rep(1, 10))
dataf.mat <- dataf.mat/10
specf.mat <- matrix(c(spec.mat[, 1], dataf.mat), nrow = dim(spec.mat)[1], ncol = dim(spec.mat)[2], 
    byrow = FALSE)
matplot(specf.mat[, 1], specf.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 0.2, "Scaled and Smoothed")

I could attempt to prettify the spectra further by using actual colors from pictures of the beads. It looks like the Bioconductor project has a package 'EBImage' that should be just what I want, but it looks like I need to update to R version 3.0.1 in order to run it. So I guess I get some spot colors from JImage.

bead.yellow <- rgb(185, 155, 85, maxColorValue = 255)
bead.orange <- rgb(208, 155, 82, maxColorValue = 255)
bead.purple <- rgb(110, 25, 125, maxColorValue = 255)
bead.pink <- rgb(190, 100, 120, maxColorValue = 255)
bead.dkblu <- rgb(22, 18, 120, maxColorValue = 255)
bead.ltblu <- rgb(130, 135, 157, maxColorValue = 255)
bead.dkpink <- rgb(200, 25, 140, maxColorValue = 255)

bead.col2 <- c(bead.dkpink, bead.dkblu, bead.dkblu, bead.pink, bead.ltblu, bead.orange, 
    bead.pink, bead.purple, bead.yellow, bead.yellow)

matplot(specf.mat[, 1], specf.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col2, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 0.2, "Colors from Photo")

Analysis

I can carry this anlaysis further by quantifying the wavelength of the peak absorbance and the peak width (usually Full Width at Half Maximum – FWHM) for each spectrum. This could be useful in further analyses, reports, or as a feature in a machine learning approach.

To extract this information, I could try to fit a series of gaussians to the peak, representing the fraction of pink, blue or yellow dye present, but the quality of the data doesn't really justify this, particularly as I don't have a good shape for the yellow absorbance peak or adequate reference data for each of the dyes. As a quick and dirty method, I'll take the median position of the data values that are at 95% of peak. That should give something in the center of the peak. Since the peaks have all been scaled, that corresponds to the center of the data values < -0.95.

Likewise, the usual peak width (FWHM) woud be the range of values < -0.5, however, the poor baseline at shorter wavelengths makes this range more or less unusable. For this reason, we can take a more well-behaved approaximate peak width measurement as the range of reflectance values < -0.8.

# approximate peak wavelength
indcs <- apply(specf.mat[, 2:11], 2, function(x) median(which(x <= (-0.95))))
pks <- specf.mat[indcs, 1]

# approximate peak width
lowidx <- apply(specf.mat[, 2:11], 2, function(x) min(which(x <= (-0.8))))
highidx <- apply(specf.mat[, 2:11], 2, function(x) max(which(x <= (-0.8))))

pkwidth80 <- (specf.mat[240, 1] - specf.mat[140, 1])/100 * (highidx - lowidx)

features <- data.frame(pks, pkwidth80, bead.col, bead.col2)
features <- features[order(pks), ]
features
##      pks pkwidth80  bead.col bead.col2
## 10 436.4     75.21    yellow   #B99B55
## 6  449.8     88.86    orange   #D09B52
## 9  451.7     98.07    yellow   #B99B55
## 7  529.3    111.35      pink   #BE6478
## 4  533.6     99.18 lightpink   #BE6478
## 1  549.8    110.61   magenta   #C8198C
## 8  568.7    133.10    purple   #6E197D
## 3  581.1    179.93  darkblue   #161278
## 2  585.4    136.79  darkblue   #161278
## 5  603.0     76.32 lightblue   #82879D

# save your work!
write.csv(features, "UVbead data features.csv", quote = TRUE)

One side effect of this is that we now have a small table of metrics that describe the relatively large original data set reasonably accurately and could be used to classify new data. In current data analytics parlance, this is known as 'feature extraction'. In traditional science, these characterizing features could be combined with others from different studies (crystallography, electrochemistry, UV-vis, IR, Raman, NMR, …) to help predict the effects of a chemical change or a different chemical environment on the behaviour of the molecule. Such studies are traditionally used to help direct synthetic chemists toward better products.

Conclusions

From this analysis, we can see that the blue beads are absorbing at longer wavelengths than the yellow, and the pink and purple beads have absorbances in between. The darker colors have broader absorbance peaks than the lighter colors, with the darkest blue having a range that appears to cover the purple, yellow and blue regions. These darker beads probably contain combinations of the dyes used for the different colors, and not just more of the dye used for the light blue beads.

The reflectance spectra are not as observant as our eyes. For instance, the yellow and orange beads are readily distinguished by eye, but not so clearly in the observed spectra or the extracted features. This may be due to the 400nm cutoff of the light pipe, which distorts the peak shapes for the yellow and orange beads. Ideally, we could observe the changes in the absorption spectra over the whole range 280-800 nm as a function of time. The current reflectance spectrometer setup, however, is only capable of capturing the 400-750 nm range. This means that we do not have access to the interesting behavior of these dyes at shorter wavelengths.

Chemically, I expect that absorption of the UV light in the 300-360 nm range causes a reversible conformation change in the dye molecules, as has long been known for the azo-benzenes and their derivatives. After the conformation change, the absorption maximum of the dye is shifted to a much longer wavelength. If the UV dyes in these beads are very closely related to each other, is likely that the yellow beads absorb at relatively short UV wavelengths and the blue beads at longer wavelengths before the transition, however, without measuring the UV absorption spectra we cannot know this. It is quite possible that their UV absorption spectra are nearly indistinguishable and the effects of their chemical differences are only apparent in the visible spectra. Without access to better hardware or more information about the molecules involved there is no way to know.

Next steps

We do have access to one thing that varies in the appropriate way. The experiments shown here were taken at relatively low UV light levels (February in England). During the summer, the angle of variaton of the sun is much larger. At dawn, it is at the horizon, while at noon, it is about 60° higher. Since light scattering in the atmosphere is a strong function of wavelength, the relative intensities of light at 300, 360, and 400 nm will vary with the angle of the sun. If the beads have different absorption peaks in the UV range, the time dependence of their color changes should vary by the time of day.

The simplest way to measure this is not with a spectrometer, but probably by following color changes in video images.

Tuesday, 10 December 2013

NHS hack day thoughts

What a glorious weekend! The sun finally came out, and while my family was out enjoying the countryside and biking to the beach, I spent it in central London at the NHS Hackday.

Not that I'm complaining. I had a great time. There were some really talented people there, and some very committed parents, doctors, programmers and just plain technically minded people. It was a really interesting weekend.

The hackers present were a pleasant mix of odd-ball grad student types, young doctors, programmers, developers and random 'IT' people, all with an interest in trying to contribute something to the efficiency, ability and smooth running of the NHS. The elephant in the room is whether any of these projects will ever become useful. Several applications had very useful ideas. The winning app 'Waitless' aimed to provide an SMS service in which people could send an SMS and get information about the distance to local NHS services and the likely waiting time once they get there. This way, someone with an earache could make an informed choice to go to their local walk-in clinic instead of the A&E department depending on wait times, opening hours, and distance.

The progress made on these apps over the course of the weekend was astounding, with several nearly becoming usable services, and certainly good proof of principle demos by the time of presentation at 3pm on Sunday. It is truly amazing what a team of 8 can get done in a weekend with modern developer tools, available APIs, open source software, and online services. Wow.

And me? well, not so much...
I ended up working on the aptly named FAIL project: 'Fatal Accident Inquiry Learning', which attempted to apply machine learning techniques to Scottish Fatal Accident Inquiry (FAI) Reports. Unfortunately, I spent most of the first day struggling to get nltk and various supporting technologies installed on my system and most of the second day learning the very basics of working with these technologies.  Carl had somewhat more success in getting some snippets of the reports into Carrot2, but the results were less than impressive.

Challenges:
1) I need more experience with Python
Everything I know about Python, I learned from Code Academy and from my homework for 'Intro to Data Science' on Coursera. The homework was a pretty good introduction for this project, as it involved sentiment analysis of a tweet stream. I was able to do some basic filtering and use the structure of the base homework code. There were some difficulties in translating the learning from the homework analyses of short tweets to these much larger, richer records, and I struggled to create a workflow between parts of the analysis.

2) I need more time with natural language processing
The FAIs are long legal text documents. It is possible to extract text from them, but it isn't easy. The text has some consistent elements, but is not in a consistent form. Some dates are 'day month year' format, while others are given as 'month day, year' and still others are 'day of the month of year'. This makes it somewhat difficult to even extract basic information such as the age of the victim. It should be possible to get this by using a grammar with NLTK, but well, I didn't manage to come to grips with it in 45 minutes I gave to it. Perhaps not surprising. Similarly with bi-grams, tri-grams, collocations and point mutation importance (PMI) to collect information on phrases and unusual words. I learned a lot, but wasn't able to put much of it into use. Yet.

...hopefully I'll find some time to try out topic modelling on this data at some point.

3) Relevance of the project
Ideally, we'd like to analyse these reports and make some inferences that are relevant for the NHS and that could lead to improvements in quality of care. Unfortunately, the data set we are looking at is not like hospital episode statistics -- it is not a statistic. Although there were some 1652 fatal accidents in Scotland in 2011, only 28 FAI reports were published that year. Our dataset consists of the 82 such published inquiry results from the last few years. Some inquiries are published long after the incident, but this indicates that inquiries are held for less than 2% of fatal accidents.

Inquiries can be called whenever there are unusual circumstances. They are required in some circumstances, such as when a death occurs in custody. By definition, then, these accidents are the outliers. Some of them are candidates for the 'Darwin Awards': tragedies begot by stupidity. Others are simply tragedies.

The Scottish authorities hold these inquiries with an eye toward preventing further accidents, and such investigations do have impact on our daily lives. Protocols for how often the highway lines are repainted, police guidelines for how people in custody are transported, and yes, even those ubiquitous labels: this is not a toy; not for children under 3 yrs of age; do not play on or around. FAIs are the fault-checking analyses that lead to health and safety advice.

So we tried various approaches to extracting information and comparing text in the reports, but ultimately we did not come up with a truly compelling use case for the data or inferences from it.

Observations on the data:
  • Accident statistics are interesting reading.
  • Each of these accidents is a story of its own. 
  • Men are in more fatal accidents than women. For all ages, nearly 65% of the accident victims are men.  For men < 65 years of age, the ratio climbs to nearly 75%. 
  • Fatal accidents are more common in older people. In 2011, 57% of the male accident victims were over 65 yo. At the same time, among female victims, 76% were over 65 yo. 
  • In younger age groups, poisoning is the most common cause of accidental fatalities. In 2011, there were no poisonings in children < 15 yo. The statistics include alcohol poisoning.
  • Falls are the most common fatal accident type for people over 65 years, and over 60% of the victims are women, showing a sharp contrast with all other accident types and ages. 
Observations on the hack day:
  • don't forget your coffee cup
  • wander around and see what different groups are doing  -- don't wait until the pub afterwards to find the person with a degree in computational linguistics!
  • what's your goal? if it's social, be social. If it's coding, join a group
  • you will probably learn more from a larger team with more varied skills
  • how competitive do you want to be? 
  • a video of a good use case is impressive
  • it's often more efficient to ask for help


 

Thursday, 7 November 2013

How to run a MOOC: a student perspective

Over the course of the past year, I've followed nine MOOC's on Coursera. I've also used other online learning tools including Code Academy and Khan Academy among others. I've enjoyed this period of learning and updated a number of skills, which I hope will be useful in the future.

The MOOC's I've taken have ranged from the wonderfully organised Machine Learning course taught by Coursera founder Andrew Ng to the extravagantly disorganised Startup Engineering course which was primarily led by another Stanford lecturer, statistician, and cofounder of Counsyl, Balaji S. Srinivasan. As more professors become involved in the MOOC phenomenon and try to gain audiences on YouTube and other social media, I thought I'd write up some of my experiences as a student and make some recommendations, or maybe it's a wish-list.

In most MOOCs, the course staff appear to be under-prepared for the demands of the platform. This is a recurring theme, particularly with newly offered classes, so if you are thinking of offering a MOOC, please, please have a beer with someone who has run one and get the full story. It is clearly not an easy thing to do, particularly when the student numbers get large. Coursera courses regularly have > 100,000 students. Jeff Leek's and Roger Peng's post mortem of the course 'Data Analysis' might be a good place to start.

In general, MOOC's tend to have a better student experience when the professor has taught the course material many times before and is not straying too far into new territory. This is particularly true for a first attempt at MOOC's. Bill Howe's course 'Introduction to Data Analysis' was not a particularly good student experience. I think this was because he tried to add too much to what he had taught before. It's better to be focussed and have only 50,000 happy, learning students than to try to do too much and have 120,000 frustrated, failing students. At least to me. You can plan to change an assignment in the second offering to incorporate new technology.

Recommendations:
  • Lectures: Have every lecture planned very well ahead of time, preferably before the course even starts. Lectures of 7-10 minutes work well. Some students like them longer, but others don't. Leave yourself plenty of time for technology hiccups -- estimate the time it will take, then multiply x2 and change the units. Hours become days. 
  • Resources: If you see 3 -5 similar forum threads running simultaneously, each with > 200 comments of people trying to help each other out, you have failed to get the message across clearly in lecture. A few links to additional introductory material can do wonders.
  • Forums: These will be going 24/7. There will always be complaints about the level of the assignment or the language used or something. Some of it will also be interesting discussion that you want to stay on top of - 24/7. Keeping up with the forums can be daunting, so a community TA or two can be useful. Forum organisation is important. As the course progresses, important threads get buried, and useful information is often buried at the bottom of a long thread. It's useful to have a section of the forum for each separate assignment as well as a section for software issues, platform issues, deadline problems and general discussion. A TA who can summarize important points regularly and point up useful posts if very helpful.
  • Extra interactions: Some students find local Meetups or study groups to be invaluable. 
  • Community TA's: These people are volunteers. Most community TA's appear to be more interested in interacting with the more advanced students and furthering their own learning than in supporting students who are having difficulties. Please review comments made by your community TA's. A few will fall into using snark to glorify themselves. The best ones will highlight useful forum contributions and links to help other students.
  • Assignments: Students have different expectations of assignments. One of the main advantages of an online system is instant feedback. I like assignments that contain questions of different difficulty levels. This lets me identify where I could spend more time and also how solid my knowledge is. I use the feedback from incorrect answers, so having two chances is useful, but a bit stressful. Having five chances is better. It's fine if the maximum number of points gained is reduced when more than two chances are used (e.g. automatic 20% reduction at the 3rd submission). This can be invaluable for international students who may have difficulty interpreting the questions. 
  • Deadlines: MOOC students are often unable to accommodate deadlines. Reasons vary. For me, I can put in 8-10 hours / week when my kids are in school, but during Fall break week, I might manage 2-3 hours. Other students have occasional work deadlines, or a long-planned vacation. Most MOOC students are managing to set aside a few hours for studying from otherwise busy lives, but those lives occasionally interfere. One useful approach is to have 10 late days that can be applied at any time. This means that if I join the MOOC late and miss the 1st deadline by 2 days, I can use two of my late days. If I have to miss a deadline because much of my weekend was taken up throwing a birthday party for my 5 year old, I can use another one. If the online system we were supposed to use is too overloaded and breaks down, students can apply late days to shift the deadline to a time when the system is less busy (and therefore functioning). This gives flexibility and responsibility to the students, which is really nice. Some teachers disagree, though.
  • Timezones: Your students will not only come from all over the world, as in a modern classroom, but will actually be all over the world. This means that you must be aware of time differences. Time zones for deadlines and for release of new lectures makes a difference, but more importantly if you require students to do an online collaboration of some sort, allow them to log in at different times. Some students can find hours in the middle of the day, while others only find them late at night, and those times are staggered all around the globe. Consider grouping time-zone regions so students can choose to participate at a convenient time. (Hint: 2am in China is not convenient.)
  • SNAFU's:  Things won't go as planned the first time. It will reflect better on you as a teacher and on your institution if you can adjust as needed. Jeff Leek had to drop a code reproducibility portion of his grading, and Antonio Rangel had to drop an experimental interactive market. Please be flexible in your use of new technology. By all means, try it out, but be aware of student needs, which will vary -- not everyone has an American credit card, which some web services require for registration, even if nothing will be charged. 
It will be interesting to see how MOOC's develop. These new online platforms are effective for learning, and education as we know it is clearly changing. So, as soon as my 10 year old finishes his decimal math on Khan Academy, I'm back to Financial Accounting

Monday, 30 September 2013

Coursera: Introduction to Data Science (Course Review)

I finished this a few months ago, but it will probably be offered again, so here's a review. This course was taught by Bill Howe at University of Washington and offered as a MOOC on Coursera.

Course Description: 
The Coursera description promises newbie to data ninja in 8 weeks. Workload 8-10 hours/ week. Those of us who have finished our statistical mechanics homework at 2 AM know that such promises are not only empty, but rather a guarantee of a course that is over-ambitious. As the description implies, this is an overview course that tries to do too much. Every student should realise this at the outset: an introductory course that claims to cover everything is certain to be a rough ride. (I know. I've taught some.)

Lectures: 
The lectures covered some really interesting content, and the lecturer appears to know the industry very well, particularly the Microsoft perspective and tools. I assume that many of his continuing education students are Microsoft employees who want to update particular skills. Such students are not the run-of-the mill for UW.

When I was a TA at UW, the undergraduates needed slow feeding with very small spoons. Lecture halls were filled with slumped bodies under baseball caps. The evening courses, in contrast, were filled with lively, interested, adults who learned independently and came to class with lists of questions. This course is aimed at those active, interested adult learners. That said, the number of hours listed for this class is a gross underestimate for the material covered and the assignments given.

Professor Howe's lecture style is not always engaging, and a lot of material is covered. There were often over 3 hours worth of lecture material to review during the week. Along with following links and reading supporting papers, this left very little time for the assignments themselves. Prof. Howe did a good job of introducing and comparing a range of current technology choices (particularly the comparison of different database technologies). As a data science newbie, I would have liked a bit more information and emphasis on use cases for different types of databases.

The database parts of the course were well presented, and this covered subjects that I hadn't seen in my other work. The data analysis elements were not so clearly taught, though, and there are better (slower) ways to learn this material on Coursera. If you have time, Jeff Leek's course 'Data Analysis' covers this much more thoroughly. Andrew Ng's now legendary Machine Learning course is also good, although more mathematically oriented, with less emphasis on organisation, data munging techniques, and communicating results.

Later lectures in this Intro. to Data Science course appeared to have incorrect answers in the in-lecture questions. I got bored of trying to keep track of the errors and inconsistencies in the course. The material needs a thorough editing before the next showing.

Assignments:  
The lectures were not particularly good preparation for the homework assignments. A lot of independent learning was required to make progress in the course. The assignments were also relatively difficult compared to what I expected from the course description. The first assignment was a sentiment analysis of a Tweet stream written in Python. I have a pretty good programming background, having started with Basic back in 1983, visiting Fortran, MatLab, Igor, Unix utilities, C, Ruby and continuing to objective-C, R and Functional programming in Scala. I pick things up quickly. The course description did not require a programming background, yet I had to spend several hours learning the ins and outs of Python from Code Academy before I could get a handle on the assignments.

The level of the first assignments was not commensurate with expectations from the course description. I learned a lot, more than I expected, in fact, and I can now implement a matrix multiplication in Python, SQL, or MapReduce based on the homework assignments. The auto-grader for the 1st assignment never did accept my answer for the final part. It also didn't give sufficient feedback for me to solve the problem, which probably had something to do with text encoding, but was very frustrating none-the-less. This sort of issue doesn't teach much. Save your perseverance for things that matter.

Overall, the assignments were challenging. I learned a lot, but not always what the point of the assignment was. I think there were a lot of complaints (more than normal) about the difficulty of the assignments, and later assignments were quite a bit easier than earlier ones. Assignments covered:

  • Python:   Tweet stream sentiment analysis
  • SQL:   Queries, tables and matrix multiplication
  • Tableau Visualization:    FAA Bird Strike Data 
    • write-up and peer assessment
    • note: I had to use Tableau via Amazon web services as it only runs on Windows.
  • MapReduce:   data joins, basic network analysis, matrix multiplication
  • Kaggle: Take part in a competition (I did facial keypoints detection)
    • write-up and peer assessment: ranking on the leaderboard did not matter.
A couple of the assignment deadlines were changed after the deadline had passed. This is very unfair to people who have worked hard to make the deadline, although it was reasonable in the case of the MapReduce homework where we were using a new web system that was supposed to be able to handle the volume of students. This is a continuing problem with MOOCs that have > 100k students enrolled. Any time the professor makes an assignment that will run on new technology, be prepared for a very frustrating experience. In my opinion, new web-based technology should not be used for graded assignments in MOOCs. They should be tested first as an optional assignment or a staged assignment so that 100k students are not accessing it in the same week.

Overall Recommendation:
Students:  I hope that the professor will offer this course in a pared-down form. As it is, if you're already awesome at Python and SQL, go ahead and dive in. Everyone else should consider this a taster course and audit only, at least with the current assignments. Be selective about which parts you choose to look at. If you experience slowdowns or poor behaviour with particular technologies in the assignments, put it aside and try again when the course is over or the deadline is passed. It seems like a class, but it's a free platform and you get what you pay for. A lot of professors are using this to try out new technologies, so don't expect it to all work as advertised.

Sunday, 7 July 2013

Biking with Children

Spurred by some recent twitter discussions linking to this StackExchange discussion, it's time to finally write this post. I've been thinking about it for an age.

My son was born in Amsterdam, and my daughter in London. We have been, well, not serious bicyclists, but commuters and heavy bicycle users for many years. This didn't stop when we had kids. In Amsterdam, we lived in the center and did everything by bicycle. We had no car, and didn't really miss it. Over our 8 years in London, we've spent 2 years car free, doing the shopping, day trips, etc. by bicycle or by public transit. We aren't sport bicyclists, bent on training and performance, but, well, maybe we're some shade of green.

Early days: < 6 months
Since bicycling is such an integral part of life in Amsterdam, within a couple days of birth, I found myself carrying my baby in a sling while riding my very stable Dutch bike around Amsterdam at slow speeds. This got us to the first health appointments at the GGD, but wasn't really a safe solution. My knees bumped the sling, too, so it wasn't very comfortable. Every time I got back home, I felt like I'd 'gotten away with it' again.

We needed to be able to get around, and a car just wasn't a useful solution, so we got a baby-mee. Baby G's nice, safe carseat clicked into a somewhat springy metal rack attached to my bike rack. Now we could put him in his car seat and continue our bicycle explorations. We got to see some nice places that way.

This was a great way to travel. I could get out of the house really easily with the new baby. He  wasn't heavy yet, so didn't add any instability to the bicycle, and he was strapped into a car seat with decent side impact protection. A low speed fall to the side would be safe. We had a rain cover that zipped over the top when needed. It could also be moved from one bike to another fairly easily, which was great for when my husband was in charge. Mostly it lived on my bike, though, and it was brilliant for trips to the doctor, the zoo, the park.

 But eventually, it was time to move on. For one thing, we needed to carry the shopping, too. And a full shopping bag goes on the back. Yes, Dutch people frequently can be seen with two or more large shopping bags dangling from the handlebars, but I learned early: never follow a Dutch person on a bicycle. The frailest of old ladies will turn across the path of an oncoming tram or down a tricky cobblestone alley full of pedestrians. We won't talk about traffic signals. I'm just not at that level; I know myself, and I don't bike well with shopping on the front.

young, but able:  6 - 15 months
From the time a baby can sit up, the Dutch tend to put them in a 'voorzitje', or front seat. Ours hung on the handlebars with two hooks and had a clamp around the stem. Most newer models have an updated attachment that clamps directly onto the stem. The hooks had a disadvantage in that as the child got heavier, so did the steering, but it was fabulous for several months. As far as I could tell, the babies all love these. It was always exciting, and G would stay awake for at least 30 minutes before starting to nod off, which was fine for dashing about town. We also had plenty of room on the back for groceries.The baby's feet are in foot rests, and you can get a good windscreen to keep out the rain and weather. Our model didn't work with drop handlebars, and if the child fell asleep on the way home from the zoo, it wasn't very comfortable; we were constantly trying to support a lolling head with an elbow while riding. This also made it less useful for longer rides.

For this age, it can also be convenient to have a rack attachment that takes a small folding stroller or pushchair. That way you have something to push around at your destination, which can be a lifesaver if you're headed to a garden, shopping mall, museum or similar walking-heavy destination. I didn't have problems with my knees hitting the voorzitje (pronounced for-zit-ya), but my husband had to spread his knees around it, which isn't the best riding position. If you can, try it out before you buy it.

toddlers:  15 months - 3 years
But after about 15 mo, baby G just grew to be too big. The mini front seats are quite close to the handlebars, so there just isn't room for a larger child. We moved him to a seat on the back. Why not a trailer? Well... we did try out a trailer in Amsterdam once; it was hell. It wasn't the trailer's fault, it was just the nature of Dutch bicycle traffic and Dutch bike infrastructure. The trailer was a bit too wide, a bit too long, and a bit too unexpected to be a comfortable riding experience. One thing that makes a bicycle so useful in Amsterdam is that it is narrow. In a town of narrow, medieval streets, this is essential. Cars generally allowed us about 2 inches of space, because that's how much space a Dutch bicyclist needs, and that's how much room there was on the streets. There were constant obstructions: delivery vans or moving vans or just something going on along a canal.  To get anywhere, the bicyclist has to swerve through the bollards, through a few parking places or along the sidewalks to pass by the obstruction. Then through the bollards again to regain the street. The trailer just didn't have the manoeuvrability we needed.

The Bobike seat, after 5 years in the shed.
It was worse at intersections. Bicycle traffic is chaotic. There is no lane control, no signalling, and if anyone can help it, no braking. Pulling a long, slow load into the midst of 10 or 20 bicyclists all trying to go their separate ways just wasn't a pleasant experience. It didn't feel socially responsible. Everyone else was carrying multiple children on the nice small footprint of their own bicycle, while we had an extraordinary, traffic clogging load. I wanted a lead bike out front warning everyone we were coming through.

But that was Amsterdam, our later trailer experiences were different (see below). In Amsterdam, G sat in a Bobike seat on the back, attached to the rear stays, not the rack. This, again, constricted the room available for groceries, but I put a small basket on the front and did most of my grocery shopping by foot. If I leant forward, I could wear a backpack, but if it sat upright, it was right in the baby's face.

From 15 months  to 3 years old, this was fine, but this guy grew a bit fast, and by 3 years, the seat wasn't working out any more. The clamps holding it to the stays would slip and *bump* he would come down on my fender, providing a very effective brake. I would get off, unload, move the clamps back into place, screw them down as hard as I could and a bit later, *bump*, down they would come again.

Another child seat option that remains useful.
One cause of this difficulty was that we moved to London shortly after G turned 2. He was riding in the same Dutch seat, but the surface was different. When we took the Thames Path to Kew Gardens, it was just too bumpy, with gravel and potholes and rough edges. He was too heavy, and the seat design wasn't up to it. We considered other seats, but none of them really seemed to solve all the problems. A rack-mounted seat certainly would have been better, but we also wanted to go for some longer rides and have something comfortable for G to nap in; we wanted to swap it between bikes; we wanted it to be a bit safer.


small child:  3 - 6 years
So when we saw a second-hand trailer at a boot sale, we got it. (boot sale in British is equivalent to flea market in American English). It wasn't a high end trailer, but it was functional. It had a clamp that tightened onto the bottom stay, with a safety backup strap and a flexible connector to the trailer towing bar. We could lay the bike down and pick it up again without upsetting the trailer. G could nap in it. It was built for two, so he could take a friend. We could swap it between bicycles. (It was more or less equivalent to this one). I put a mirror on my helmet so I could see the traffic behind and keep an eye on the little one.

At 3 yrs old, G started going to nursery 3 days a week. The nursery was 4 miles away, so I biked it with him in the trailer. He could stay cosy and dry while I battled the hill and got a bit of exercise back into my life. We could stop in the park on the way home if the weather was nice. The trailer was very useful, but it lagged: it didn't start smoothly from a stop, but would pull behind in an uncomfortable oscillation. It wasn't dangerous, just unpleasant. My husband liked the functionality of the trailer, but not the execution: too heavy for longer rides.

The response of traffic is mixed, but most people will give you a lot of room. When people see the trailer, they generally say 'Awww, or 'Wow' or 'what a way to travel'... we get a lot of comments. There is a real feeling of support for this approach. London traffic treats you differently as a bicyclist or as a bicyclist with a trailer. On most roads, I didn't feel pressured to move out of the way or to get a move on. Sometimes the white van guys will even wave at my passenger. There is (slightly) more understanding that you are carrying a load, and the load is precious.

We were wanting some longer rides. We organised a trip to Southern France with some friends, and we bought a Burley. The first several times I pulled it, I kept checking to see that it was actually behind me. It felt extremely smooth and light compared to the other trailer.

G was comfortable and very happy with his space in the trailer. On a long ride, there wasn't so much for him to do, but we took frequent breaks and found a lot of waterfalls. He got more chocolate croissants than any child should expect. We could hand him long pieces of grass and things to look at and hold out into the breeze. He would nap and re-arrange his environment. This is the kind of biking that you cannot do with a child in a seat.

The Burley folded flat and fit into a luggage rack on the Eurostar and TGV trains. It was easy enough to convert that we rode it across Paris from Gare du Nord to Gare d' Est and folded it again for the second train.

second time around:
And then came P. We live near a busy train track, and the crossing bars are down a lot. During the week it's easily 40 minutes out of the hour. Since G's school and the local shops are on the other side, most of our daily trips involve stairs. I walked everywhere when P was small. I could carry her lightweight pushchair over the stairs, where I wouldn't have been able to manage either a bike with seat or a trailer. I didn't have the baby-mee, but I occasionally put her carseat in the trailer to ride somewhere. There wasn't a very good attachment for this, but I could strap it in for stability, and she was reasonably well-protected. She would eventually complain if the road was bumpy, so we took the speed bumps very slowly and tried to stay off of the Thames Path.

She has missed out on the fun of the voorzitje, but she enjoys going for bike rides. She has her own space in the trailer, and often doesn't want to get out when we arrive. She brings along a stuffed animal and can often be heard making up a song or game while we travel. We haven't considered getting a bike seat for her. Actually, I just cleared out the old rear seat from the shed.


Minimalist child seat.
As a sturdy 4 1/2 year old, P is now going to nursery about 1/2 mile away. We have tried out all the possible methods of transit; some days we even seem to crawl. On the bike, there are three possibilities: the trailer, the back rack or the top tube. One boy at her nursery arrives on a seat on his father's top tube (pictured). When she's in front of me, my knees have to go wide around her, and neither of us is comfortable for longer rides. On the touring bike, I am leaned forward over her, and this riding position is quite cosy. She is forced to lean over as well, just to fit into the available space. On the other hand, being up high and in front is exciting. She has to face the weather. For the past couple of weeks, though, the trailer has been the method of choice.

more than one:
typical scene in Amsterdam, shamelessly borrowed from this blog
Since G was already nearly 6 when P was born, we didn't have to do a lot of riding while carrying two children. I did hand on our 1st trailer to a nearby family who used it happily for the school run. This can be a tricky time - when the younger child is in nursery, there is about a year with three school runs a day, and many young children just aren't up to that much walking. A two seat trailer is a good solution, and generally cheaper and more flexible than a box-bike or tricycle.

Although Dutch town are filled with scenes of whole families riding on one bike, it can be difficult to manage. The front-seat + back-seat combination is the most common. This works reasonably well when the children are small, but the problem is always how to load it safely. Basically, a toddler's weight on a leaning bike is already half-way to an accident. The best solution seems to be to load the baby in front first, stabilize the bicycle, and have the toddler climb into their seat. Invest in a really solid kick stand that will give more than one point of support, and hold the brakes tightly on to keep the bike from rolling while they climb up. A stop or velcro strap that keeps the front wheel from turning can also help. Even when the children are in place, however, you still have to clip the toddler's straps, which requires an extra hand. I witnessed more than one fall, and more than one utterly frazzled mother. If you're just headed to the park with your partner, distribute the weight and put a seat on each bike.

If you have to lean the bike to get your leg over the top bar, be extremely cautious. Having all the children's weight at the top of the bike makes it very unstable. Holding onto a brake will keep it from rolling, but... it's just a tippy machine at this point.

Summary:
There are many options for riding with your children. Which one(s) you choose will depend on the children's ages and your own bicycling style and needs. We found that seats worked very well in the confines of medieval streets in Amsterdam and also for errands and short trips in London. They work better with a step-through frame (no or low top-tube) and an upright riding position. Outside of Amsterdam, the trailer was the way to go for us. It requires more storage space and actual, normal sized roads. It can carry a heavy load without any instability. We take it on and off the bicycle for each trip, which requires some attention to detail -- never leave off the safety strap! This takes less than 1 minute after a bit of practice. Since we use the trailer for vacations and daily errands, including shopping runs, it has definitely been worth the price.

Right now we are packing to move back to the US, and we'll be taking the Burley with us.










Saturday, 6 July 2013

Idiom in R: results you can C

Computing for Data Analysis was a pretty good introduction to R, but did not really talk about R idiom, which can make the difference between code that runs and code that runs quickly. Here is a basic example.

Using sprintf statements for formatting filenames: Consider a series of files. The goal is to read them all into R, but the filenames include a constant width variable: we're looking to load filenames such as ./data/001.csv and ./data/011.csv up to ./data/999.csv. How do we construct the name strings in R?

The numbers in the file names need to be padded and converted to the appropriate strings. Here are three ways of doing the padding.

The R way:

# setup
directory <- "data"
id = 1:999

# method 1
pad.R <- function(id) {
    num <- sprintf("%03d", as.integer(id))
    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

A brute-force method:

# method 2
pad.brute <- function(id) {
    num <- rep("", length(id))
    for (n in 1:length(id)) {
        if (id[n] < 10)  num[n] <- paste("00", id[n], sep = "") 
  else if (id[n] < 100)  num[n] <- paste("0", id[n], sep = "") 
  else num[n] <- as.character(id[n])
    }

    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

… but we know that for-loops are notoriously slow in R, so we could take a hybrid approach and define a function to take a single number as input and convert it. Then that function could be used with one of R's apply methods to convert the vector in one go.

A hybrid method:

# method 3
padder <- function(num) {
    if (num < 10) return(paste("00", num, sep = "")) 
 else if (num < 100) return(paste("0", num, sep = "")) 
 else return(as.character(num))
}

pad.hybrid <- function(id) {
    num <- sapply(id, padder)
    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

Comparison

These approaches all give the same results, but they are noticeably different.

system.time(path <- pad.R(id))
##    user  system elapsed 
##   0.001   0.000   0.001
system.time(path <- pad.brute(id))
##    user  system elapsed 
##   0.007   0.000   0.008
system.time(path <- pad.hybrid(id))
##    user  system elapsed 
##   0.004   0.000   0.004
path[c(3, 13, 103)]
## [1] "./data/003.csv" "./data/013.csv" "./data/103.csv"

For speed, they are equivalent when run on 1 or two elements at a time. However, when run on the full 999 element vector as shown here, both the 'brute force' and 'hybrid' methods are significantly slower than sprintf.

The discussions on the course forums did give a different perspective. Several self-identified 'professional programmers' preferred the if, if-else, else approach I've used in both methods 2 and 3. They considered it more readable and thus more maintainable.

I don't think this is the best approach. If you are a professional programmer, you are familiar with idiom, in whatever language you work in. You know that there are readable, maintainable, ways of doing what needs to be done efficiently. At it's root, deep down underneath, R is in the C family of languages. The basic in/out is based on the C standard library <stdio.h>. The professional way to use R is to use that R idiom efficiently and in a way that other R programmers will understand.

So learn your sprintf formatting codes. They may look like magic numbers the first time you meet them, but they are systematic and ubiquitous. They will be useful in many other contexts, including modern languages like Python and Java and therefore even Scala and Clojure. They will also speed up your code, and don't worry, most other professionals will understand them.