Sunday 30 September 2012

Learning Some R

I'm following the free course: Programming for Data analysis, taught by Roger D. Peng of Johns Hopkins Bloomberg School of Public Health, and the Simply Statistics blog. It's offered through Coursera.org.

I should be over-prepared for this course. It's supposed to be completely introductory, with only limited programming or statistics background required, but I'm interested in learning R. So this provides an interesting introduction where I can find out about MOOC's and how they work. My latest app update is 'waiting for review' on iTunes, so I should be able to find a couple hours a week to put into it. 

The first week's lectures covered downloading and installing R, how to get help, some basic data types and selecting data from vectors, lists, etc. Data input and output. Although this is a programming class, the lectures were presented as slide-decks with voice-over. The slides could be downloaded as pdf, and translations were available as subtitles. Each slide had 2-5 sentences with some information about an R command, possibly including an example. I found this easy to follow, but low on information density. I listened to the lectures while making cupcakes for a bakesale. Then I increased the speed to 1.5x. I'll get more from referring to the pdf slides while doing the exercises.

There were no suggestions for outside materials, although some resources were referred to in the 'getting help' lecture video. 

Some 30,000+ students have apparently signed up for the course. The discussion board has several pages worth of questions. The introduce - yourself thread has the most views with nearly 5000. The other most popular threads have 1100 views or so at this point, so there is quite a bit of activity. Several of the "students" are already quite accomplished with R, and they are posting visualizations and code. This is a great learning resource. The rest of us are trying to share resources we find on the web:

Links to free R resources


It's not clear how many students will make it through the 1st quiz, much less the 1st assignment, but complaints on the discussion forum are significant, and are mostly coming from people without much programming background. Since the 1st assignment is not due until the end of week 2, the 1st week's lectures did not cover all the relevant material, many students feeling lost. The 'Not sure where to start…" thread has 2100 views. It would be nice if the 1st lectures were designed to allow you to get started writing code. That's not really necessary to get started using R, but it is necessary to do the course. Fortunately, the boards are monitored, and the lectures for the 2nd week were released a bit early to help with this. 

I do wish the lectures had a more theoretical founding. This is one of my pet peeves about unix world in general, although I don't think I'm alone. Nothing seems to ever be related to anything else. Although there is a full lecture on the history of R, it goes through where R was developed, who developed it, but not what it's theoretical bases were or why it was made the way it was. With every language there are underlying assumptions about what forms of data are important and how it should be saved and treated. Understanding these can make learning and using the language much easier. No such luck here. Is read.table fundamental? How does it relate to read in unix or C or other common languages of the early R days? This seems like arcane knowledge, but it is the kind of thing that forms an actual education. If programming languages aren't taught as isolated functions and control characters, but as a historical web of intellectual developments, students have a framework for learning. I'm not sure Professor Peng, as a statistician, has this framework himself (it isn't encouraged outside of liberal arts schools), so I'm probably just shouting into the wind. 

If you are used to learning in a classroom, the MOOC may seem very unusual. It is somewhere between independent learning and actually going to class. So far, the materials presented in the lecture have not been enough to even get full marks on the 1st quiz. (There's a gotcha about vector recycling in R which will not be apparent unless you either know it already, get lucky, or try it at the R command line.) The best way to learn is probably to have the R command line open and try things out during lecture. Even using the command line is not made easy, however, because the lectures are not structured as problem, discussion, resolution, but rather as a list of things you might find useful. You have to come up with the examples yourself, and most people will need outside help. 

So, many students are pooling expertise on the discussion forum to figure it out. This includes posts that benchmark possible solutions, and a lot of hints on how to get started. The great majority of answers are helpful and supportive. Perusing the forum will make the assignments doable. 

So overall, I'm not particularly impressed with the lecture style or structure of the teaching and information. On the other hand, I'm impressed with the breadth of the emerging student community. The assignments are challenging, and the outcome should be a reasonably good understanding. This learning won't come from the lectures, though. In other words, follow the excellent suggestions on how to succeed in a MOOC

And if you think that my complaints must be unique to this course, try looking at this blog post