Friday 4 April 2014

Course Review: Data Analysis by Jeff Leek

I took Data Analysis on Coursera in the spring 2013. It was offered again in October 2013, but it may have been supplanted by the new Data Science specialisation, which starts in just a couple of days.

tl/dr: 5/5 don't miss out.

Overview:

This is a very well-run course that builds on Roger Peng's Computing for Statistical Analysis (in R) course. Professor Leek does a very good job of introducing tools and habits for 'reproducible research' including suggesting ways to organize the data munging problem for reproducibility. Professors Peng and Leek, both of John's Hopkins School of Medicine, are leaders in this area, and discuss the challenges and progress on their excellent blog, Simply Statistics. They are also at the core of the newly offered Data Science specialization on Coursera, which offers a nice series of courses that build on each other and give good detail in the various aspects of data analysis. It's also by far the most accessible data science study track available.

Reproducible research is very important for statistical science in general, and promises to be increasingly important over time. Consequently, this course is really a great opportunity. Professional programmers eventually learn the value of unit testing, documentation, and other approaches that make their code more useable. Professors Peng and Leek are at the forefront of developing the professional statistical analysis standards that will become the TDD+version control+issue tracking of data analysis. In other words: you must take this course if you are interested in professional level data analysis. Really. Well, actually, you could take all nine courses…

The Data Analysis course is 8 weeks long. In the first offering, it had weekly lectures of about 2 hours as well as weekly homework assignments that could add up to several hours. As usual, the Coursera recommendation of 3-5 hours / week is on the low side for most students with little experience. Additionally, there are two longer, data analysis assignments with real (web based) data. Additionally, the due date for the first analysis is just after many of the better statistical analysis tools are introduced in lecture. Both analyses are peer-graded by 5 other students and the mean of the central 3 scores is used. Peer grading may change in different offerings of the course as this is one area where Jeff Leek was not sure how the course should be organised. There's a full discussion of this in a Simply Statistics podcast. Also, as one might expect from statisticians, there's data! an interactive graph of completion rates for various online programs. This course had a completion rate of 5.4%, and many of the students active on the forums had done a lot of data analysis before.

Recommendations:

Make sure to set aside time for the data analysis assignments. These will easily take 8-10 hours and could take up to 20 depending on how obsessive you become. The data analysis grade is based on a peer-review, so your reviewers may not understand your analysis, even if it is perfectly correct. There was grousing about this on the forums, but I think the point is clear: the peer graders are not experts in data analysis, and you should never write a report aimed at experts in data analysis, so take it into account when you are writing the report. I had trouble with this on the second data analysis. It involved classifying accelerometer data by action performed (walk, run, walk up stairs, etc). The data had been expanded into a set of features including some filtering, Fourier analysis and other transformations. In my analysis, I used the language of signal processing and sampling to describe this feature set, but this language was not familiar to the peer reviewers, so they didn't realise that I was describing the spacing and number of data points very precisely.