Tuesday 16 April 2013

Big Data Hackathon London: A few lessons learned

I spent most of my weekend at the Big Data Hackathon London. I'm not hard-core, and I didn't pull an all nighter, but then, even the winning visualisation team said the code written between 4-7 am was rubbish. Better to get some sleep. This was my first ever 'hackathon', and part of the fun was just observing the phenomenon.

The basics: 

The hackathon was organised by Data Science London, and I found out about it through the Data Science Meetup group. I highly recommend this group if you are interested in learning about current methods and tools in Data Science. Their meetings are very interesting and educational, but you have to be very quick with the RSVP -- there's a lot of competition for the limited spaces. As always, the organisers did a great job. I didn't manage to take home any of the swag or awards, but I certainly drank my share of the coffee. And the whole weekend was totally free. Well done!

The hackathon took place at The Hub Westminster, which was a very nice, light, open space. The talk space holds about 100 or so people, and there is desk space and stand-up area where the food is served for milling around and meeting people. The space is well organized with good systems for internet and power. A pleasure to work in.

  • Lesson learned: Bring your own mug to cut down on waste 

The hackathon had three different categories of challenges that teams could submit.
  1. data science challenge 
  2. data visualization challenge 
  3. free-style data challenge 
Most people who came did not have a team lined up. The winning team in the data visualisation challenge got together when two of them carried signs around saying 'Node.js' and 'd3'. The other two thought this was a good idea, and a winning team was formed. One of the team members later said that their goal had just been to improve their javascript skills. The visualisation was quite lovely, and should be showing up in a 'major UK publication' someday soon.
  • Lesson learned: MongoDB + Node.js + d3 = powerful stuff 
  • Lesson learned: Connect a team through the technology you want to learn 
The hackathon started out with a presentation on Microsoft Azure and the suggestion that we use a free trial account (good for 3 months) to do our analysis.

After the talk, someone asked about setting up R on the system. I approached them after the talk, and that was the beginnings of a team. Our team, 'State of the A[R]t' set up an Azure account, and we were able to get R working on a Ubuntu virtual machine without much difficulty. Wenming Ye's blog was helpful for this. It's probably even more helpful if you want to use Python. The Kaggle assessment of the data science submissions relied on the ROCR package, and this relied on gplot, which required us to build R 3.0 from code. Fortunately, one of our team was ace at this and we had it running quite quickly. Meanwhile, the rest of us were looking at the data.

  • Lesson learned: Technology is broad and deep. Someone will like doing the parts you hate. Let them do it. (I have to re-learn this continuously. I try to do too much on my own.) 

The hackathon has a tight schedule. There were talks all afternoon, and if I had gone to all the talks, I would not have made much progress with the data. However, missing all the talks probably wasn't the best strategy either. Next time, I'll try to keep my head up and look around for which talks are truly interesting. Talks were presented by the hackathon sponsors, so highlighted their newest technologies. I can only hope that the talks will be posted so I can catch up with the parts I missed.

  • Lesson learned: It's about learning. Think about what are the learning opportunities today? Will the talks be available tomorrow? 

On Sunday, the data analysis winners each gave a brief indication of what they did. We ended up 89th overall, and we only did that well because one of my team-mates took a careful look at the original benchmark code. I don't think we were alone in this, as 15 teams finished within 0.00001 of us. Re-assuringly, though, we were working along similar lines to the winning team.

  • Lesson learned: Find a good starting place. 

The benchmark was not quite the simple logistic regression we expected from the description. We would have done much better if we had taken the time to look at the code for the benchmark as a first step.

  • Lesson learned: Work efficiently -- write functions or scripts for each step. 

At 12:32 on Sunday, I had a model that resembled the winning model. Maybe it would have done better than 89th, but I didn't get a chance to find out. It took me too long to make the model into a submittable prediction! I should have anticipated this, because the 1st submission also took ages. If I had written some of the steps into functions, it could have been much faster, and the team would have done better.

Overall, it was great fun. I met some lovely people and I learned a lot. Coursera's offerings, including Jeff Leek's 'Data Analysis' and Roger Peng's 'Computing for Data Analysis' gave me a good background for taking part in this event. Hopefully, learning some Network Analysis and a bit of Scala will prove useful for the next one.