Monday 30 September 2013

Coursera: Introduction to Data Science (Course Review)

I finished this a few months ago, but it will probably be offered again, so here's a review. This course was taught by Bill Howe at University of Washington and offered as a MOOC on Coursera.

Course Description: 
The Coursera description promises newbie to data ninja in 8 weeks. Workload 8-10 hours/ week. Those of us who have finished our statistical mechanics homework at 2 AM know that such promises are not only empty, but rather a guarantee of a course that is over-ambitious. As the description implies, this is an overview course that tries to do too much. Every student should realise this at the outset: an introductory course that claims to cover everything is certain to be a rough ride. (I know. I've taught some.)

Lectures: 
The lectures covered some really interesting content, and the lecturer appears to know the industry very well, particularly the Microsoft perspective and tools. I assume that many of his continuing education students are Microsoft employees who want to update particular skills. Such students are not the run-of-the mill for UW.

When I was a TA at UW, the undergraduates needed slow feeding with very small spoons. Lecture halls were filled with slumped bodies under baseball caps. The evening courses, in contrast, were filled with lively, interested, adults who learned independently and came to class with lists of questions. This course is aimed at those active, interested adult learners. That said, the number of hours listed for this class is a gross underestimate for the material covered and the assignments given.

Professor Howe's lecture style is not always engaging, and a lot of material is covered. There were often over 3 hours worth of lecture material to review during the week. Along with following links and reading supporting papers, this left very little time for the assignments themselves. Prof. Howe did a good job of introducing and comparing a range of current technology choices (particularly the comparison of different database technologies). As a data science newbie, I would have liked a bit more information and emphasis on use cases for different types of databases.

The database parts of the course were well presented, and this covered subjects that I hadn't seen in my other work. The data analysis elements were not so clearly taught, though, and there are better (slower) ways to learn this material on Coursera. If you have time, Jeff Leek's course 'Data Analysis' covers this much more thoroughly. Andrew Ng's now legendary Machine Learning course is also good, although more mathematically oriented, with less emphasis on organisation, data munging techniques, and communicating results.

Later lectures in this Intro. to Data Science course appeared to have incorrect answers in the in-lecture questions. I got bored of trying to keep track of the errors and inconsistencies in the course. The material needs a thorough editing before the next showing.

Assignments:  
The lectures were not particularly good preparation for the homework assignments. A lot of independent learning was required to make progress in the course. The assignments were also relatively difficult compared to what I expected from the course description. The first assignment was a sentiment analysis of a Tweet stream written in Python. I have a pretty good programming background, having started with Basic back in 1983, visiting Fortran, MatLab, Igor, Unix utilities, C, Ruby and continuing to objective-C, R and Functional programming in Scala. I pick things up quickly. The course description did not require a programming background, yet I had to spend several hours learning the ins and outs of Python from Code Academy before I could get a handle on the assignments.

The level of the first assignments was not commensurate with expectations from the course description. I learned a lot, more than I expected, in fact, and I can now implement a matrix multiplication in Python, SQL, or MapReduce based on the homework assignments. The auto-grader for the 1st assignment never did accept my answer for the final part. It also didn't give sufficient feedback for me to solve the problem, which probably had something to do with text encoding, but was very frustrating none-the-less. This sort of issue doesn't teach much. Save your perseverance for things that matter.

Overall, the assignments were challenging. I learned a lot, but not always what the point of the assignment was. I think there were a lot of complaints (more than normal) about the difficulty of the assignments, and later assignments were quite a bit easier than earlier ones. Assignments covered:

  • Python:   Tweet stream sentiment analysis
  • SQL:   Queries, tables and matrix multiplication
  • Tableau Visualization:    FAA Bird Strike Data 
    • write-up and peer assessment
    • note: I had to use Tableau via Amazon web services as it only runs on Windows.
  • MapReduce:   data joins, basic network analysis, matrix multiplication
  • Kaggle: Take part in a competition (I did facial keypoints detection)
    • write-up and peer assessment: ranking on the leaderboard did not matter.
A couple of the assignment deadlines were changed after the deadline had passed. This is very unfair to people who have worked hard to make the deadline, although it was reasonable in the case of the MapReduce homework where we were using a new web system that was supposed to be able to handle the volume of students. This is a continuing problem with MOOCs that have > 100k students enrolled. Any time the professor makes an assignment that will run on new technology, be prepared for a very frustrating experience. In my opinion, new web-based technology should not be used for graded assignments in MOOCs. They should be tested first as an optional assignment or a staged assignment so that 100k students are not accessing it in the same week.

Overall Recommendation:
Students:  I hope that the professor will offer this course in a pared-down form. As it is, if you're already awesome at Python and SQL, go ahead and dive in. Everyone else should consider this a taster course and audit only, at least with the current assignments. Be selective about which parts you choose to look at. If you experience slowdowns or poor behaviour with particular technologies in the assignments, put it aside and try again when the course is over or the deadline is passed. It seems like a class, but it's a free platform and you get what you pay for. A lot of professors are using this to try out new technologies, so don't expect it to all work as advertised.