Science Soup: 2013

Tuesday, 10 December 2013

NHS hack day thoughts

What a glorious weekend! The sun finally came out, and while my family was out enjoying the countryside and biking to the beach, I spent it in central London at the NHS Hackday.

Not that I'm complaining. I had a great time. There were some really talented people there, and some very committed parents, doctors, programmers and just plain technically minded people. It was a really interesting weekend.

The hackers present were a pleasant mix of odd-ball grad student types, young doctors, programmers, developers and random 'IT' people, all with an interest in trying to contribute something to the efficiency, ability and smooth running of the NHS. The elephant in the room is whether any of these projects will ever become useful. Several applications had very useful ideas. The winning app 'Waitless' aimed to provide an SMS service in which people could send an SMS and get information about the distance to local NHS services and the likely waiting time once they get there. This way, someone with an earache could make an informed choice to go to their local walk-in clinic instead of the A&E department depending on wait times, opening hours, and distance.

The progress made on these apps over the course of the weekend was astounding, with several nearly becoming usable services, and certainly good proof of principle demos by the time of presentation at 3pm on Sunday. It is truly amazing what a team of 8 can get done in a weekend with modern developer tools, available APIs, open source software, and online services. Wow.

And me? well, not so much...
I ended up working on the aptly named FAIL project: 'Fatal Accident Inquiry Learning', which attempted to apply machine learning techniques to Scottish Fatal Accident Inquiry (FAI) Reports. Unfortunately, I spent most of the first day struggling to get nltk and various supporting technologies installed on my system and most of the second day learning the very basics of working with these technologies. Carl had somewhat more success in getting some snippets of the reports into Carrot2, but the results were less than impressive.

Challenges:
1) I need more experience with Python
Everything I know about Python, I learned from Code Academy and from my homework for 'Intro to Data Science' on Coursera. The homework was a pretty good introduction for this project, as it involved sentiment analysis of a tweet stream. I was able to do some basic filtering and use the structure of the base homework code. There were some difficulties in translating the learning from the homework analyses of short tweets to these much larger, richer records, and I struggled to create a workflow between parts of the analysis.

2) I need more time with natural language processing
The FAIs are long legal text documents. It is possible to extract text from them, but it isn't easy. The text has some consistent elements, but is not in a consistent form. Some dates are 'day month year' format, while others are given as 'month day, year' and still others are 'day of the month of year'. This makes it somewhat difficult to even extract basic information such as the age of the victim. It should be possible to get this by using a grammar with NLTK, but well, I didn't manage to come to grips with it in 45 minutes I gave to it. Perhaps not surprising. Similarly with bi-grams, tri-grams, collocations and point mutation importance (PMI) to collect information on phrases and unusual words. I learned a lot, but wasn't able to put much of it into use. Yet.

...hopefully I'll find some time to try out topic modelling on this data at some point.

3) Relevance of the project
Ideally, we'd like to analyse these reports and make some inferences that are relevant for the NHS and that could lead to improvements in quality of care. Unfortunately, the data set we are looking at is not like hospital episode statistics -- it is not a statistic. Although there were some 1652 fatal accidents in Scotland in 2011, only 28 FAI reports were published that year. Our dataset consists of the 82 such published inquiry results from the last few years. Some inquiries are published long after the incident, but this indicates that inquiries are held for less than 2% of fatal accidents.

Inquiries can be called whenever there are unusual circumstances. They are required in some circumstances, such as when a death occurs in custody. By definition, then, these accidents are the outliers. Some of them are candidates for the 'Darwin Awards': tragedies begot by stupidity. Others are simply tragedies.

The Scottish authorities hold these inquiries with an eye toward preventing further accidents, and such investigations do have impact on our daily lives. Protocols for how often the highway lines are repainted, police guidelines for how people in custody are transported, and yes, even those ubiquitous labels: this is not a toy; not for children under 3 yrs of age; do not play on or around. FAIs are the fault-checking analyses that lead to health and safety advice.

So we tried various approaches to extracting information and comparing text in the reports, but ultimately we did not come up with a truly compelling use case for the data or inferences from it.

Observations on the data:

Accident statistics are interesting reading.
Each of these accidents is a story of its own.
Men are in more fatal accidents than women. For all ages, nearly 65% of the accident victims are men. For men < 65 years of age, the ratio climbs to nearly 75%.
Fatal accidents are more common in older people. In 2011, 57% of the male accident victims were over 65 yo. At the same time, among female victims, 76% were over 65 yo.
In younger age groups, poisoning is the most common cause of accidental fatalities. In 2011, there were no poisonings in children < 15 yo. The statistics include alcohol poisoning.
Falls are the most common fatal accident type for people over 65 years, and over 60% of the victims are women, showing a sharp contrast with all other accident types and ages.

Observations on the hack day:

don't forget your coffee cup
wander around and see what different groups are doing -- don't wait until the pub afterwards to find the person with a degree in computational linguistics!
what's your goal? if it's social, be social. If it's coding, join a group
you will probably learn more from a larger team with more varied skills
how competitive do you want to be?
a video of a good use case is impressive
it's often more efficient to ask for help

Thursday, 7 November 2013

How to run a MOOC: a student perspective

Over the course of the past year, I've followed nine MOOC's on Coursera. I've also used other online learning tools including Code Academy and Khan Academy among others. I've enjoyed this period of learning and updated a number of skills, which I hope will be useful in the future.

The MOOC's I've taken have ranged from the wonderfully organised Machine Learning course taught by Coursera founder Andrew Ng to the extravagantly disorganised Startup Engineering course which was primarily led by another Stanford lecturer, statistician, and cofounder of Counsyl, Balaji S. Srinivasan. As more professors become involved in the MOOC phenomenon and try to gain audiences on YouTube and other social media, I thought I'd write up some of my experiences as a student and make some recommendations, or maybe it's a wish-list.

In most MOOCs, the course staff appear to be under-prepared for the demands of the platform. This is a recurring theme, particularly with newly offered classes, so if you are thinking of offering a MOOC, please, please have a beer with someone who has run one and get the full story. It is clearly not an easy thing to do, particularly when the student numbers get large. Coursera courses regularly have > 100,000 students. Jeff Leek's and Roger Peng's post mortem of the course 'Data Analysis' might be a good place to start.

In general, MOOC's tend to have a better student experience when the professor has taught the course material many times before and is not straying too far into new territory. This is particularly true for a first attempt at MOOC's. Bill Howe's course 'Introduction to Data Analysis' was not a particularly good student experience. I think this was because he tried to add too much to what he had taught before. It's better to be focussed and have only 50,000 happy, learning students than to try to do too much and have 120,000 frustrated, failing students. At least to me. You can plan to change an assignment in the second offering to incorporate new technology.

Recommendations:

Lectures: Have every lecture planned very well ahead of time, preferably before the course even starts. Lectures of 7-10 minutes work well. Some students like them longer, but others don't. Leave yourself plenty of time for technology hiccups -- estimate the time it will take, then multiply x2 and change the units. Hours become days.
Resources: If you see 3 -5 similar forum threads running simultaneously, each with > 200 comments of people trying to help each other out, you have failed to get the message across clearly in lecture. A few links to additional introductory material can do wonders.
Forums: These will be going 24/7. There will always be complaints about the level of the assignment or the language used or something. Some of it will also be interesting discussion that you want to stay on top of - 24/7. Keeping up with the forums can be daunting, so a community TA or two can be useful. Forum organisation is important. As the course progresses, important threads get buried, and useful information is often buried at the bottom of a long thread. It's useful to have a section of the forum for each separate assignment as well as a section for software issues, platform issues, deadline problems and general discussion. A TA who can summarize important points regularly and point up useful posts if very helpful.
Extra interactions: Some students find local Meetups or study groups to be invaluable.
Community TA's: These people are volunteers. Most community TA's appear to be more interested in interacting with the more advanced students and furthering their own learning than in supporting students who are having difficulties. Please review comments made by your community TA's. A few will fall into using snark to glorify themselves. The best ones will highlight useful forum contributions and links to help other students.
Assignments: Students have different expectations of assignments. One of the main advantages of an online system is instant feedback. I like assignments that contain questions of different difficulty levels. This lets me identify where I could spend more time and also how solid my knowledge is. I use the feedback from incorrect answers, so having two chances is useful, but a bit stressful. Having five chances is better. It's fine if the maximum number of points gained is reduced when more than two chances are used (e.g. automatic 20% reduction at the 3rd submission). This can be invaluable for international students who may have difficulty interpreting the questions.
Deadlines: MOOC students are often unable to accommodate deadlines. Reasons vary. For me, I can put in 8-10 hours / week when my kids are in school, but during Fall break week, I might manage 2-3 hours. Other students have occasional work deadlines, or a long-planned vacation. Most MOOC students are managing to set aside a few hours for studying from otherwise busy lives, but those lives occasionally interfere. One useful approach is to have 10 late days that can be applied at any time. This means that if I join the MOOC late and miss the 1st deadline by 2 days, I can use two of my late days. If I have to miss a deadline because much of my weekend was taken up throwing a birthday party for my 5 year old, I can use another one. If the online system we were supposed to use is too overloaded and breaks down, students can apply late days to shift the deadline to a time when the system is less busy (and therefore functioning). This gives flexibility and responsibility to the students, which is really nice. Some teachers disagree, though.
Timezones: Your students will not only come from all over the world, as in a modern classroom, but will actually be all over the world. This means that you must be aware of time differences. Time zones for deadlines and for release of new lectures makes a difference, but more importantly if you require students to do an online collaboration of some sort, allow them to log in at different times. Some students can find hours in the middle of the day, while others only find them late at night, and those times are staggered all around the globe. Consider grouping time-zone regions so students can choose to participate at a convenient time. (Hint: 2am in China is not convenient.)
SNAFU's: Things won't go as planned the first time. It will reflect better on you as a teacher and on your institution if you can adjust as needed. Jeff Leek had to drop a code reproducibility portion of his grading, and Antonio Rangel had to drop an experimental interactive market. Please be flexible in your use of new technology. By all means, try it out, but be aware of student needs, which will vary -- not everyone has an American credit card, which some web services require for registration, even if nothing will be charged.

It will be interesting to see how MOOC's develop. These new online platforms are effective for learning, and education as we know it is clearly changing. So, as soon as my 10 year old finishes his decimal math on Khan Academy, I'm back to Financial Accounting.

Monday, 30 September 2013

Coursera: Introduction to Data Science (Course Review)

I finished this a few months ago, but it will probably be offered again, so here's a review. This course was taught by Bill Howe at University of Washington and offered as a MOOC on Coursera.

Course Description:
The Coursera description promises newbie to data ninja in 8 weeks. Workload 8-10 hours/ week. Those of us who have finished our statistical mechanics homework at 2 AM know that such promises are not only empty, but rather a guarantee of a course that is over-ambitious. As the description implies, this is an overview course that tries to do too much. Every student should realise this at the outset: an introductory course that claims to cover everything is certain to be a rough ride. (I know. I've taught some.)

Lectures:
The lectures covered some really interesting content, and the lecturer appears to know the industry very well, particularly the Microsoft perspective and tools. I assume that many of his continuing education students are Microsoft employees who want to update particular skills. Such students are not the run-of-the mill for UW.

When I was a TA at UW, the undergraduates needed slow feeding with very small spoons. Lecture halls were filled with slumped bodies under baseball caps. The evening courses, in contrast, were filled with lively, interested, adults who learned independently and came to class with lists of questions. This course is aimed at those active, interested adult learners. That said, the number of hours listed for this class is a gross underestimate for the material covered and the assignments given.

Professor Howe's lecture style is not always engaging, and a lot of material is covered. There were often over 3 hours worth of lecture material to review during the week. Along with following links and reading supporting papers, this left very little time for the assignments themselves. Prof. Howe did a good job of introducing and comparing a range of current technology choices (particularly the comparison of different database technologies). As a data science newbie, I would have liked a bit more information and emphasis on use cases for different types of databases.

The database parts of the course were well presented, and this covered subjects that I hadn't seen in my other work. The data analysis elements were not so clearly taught, though, and there are better (slower) ways to learn this material on Coursera. If you have time, Jeff Leek's course 'Data Analysis' covers this much more thoroughly. Andrew Ng's now legendary Machine Learning course is also good, although more mathematically oriented, with less emphasis on organisation, data munging techniques, and communicating results.

Later lectures in this Intro. to Data Science course appeared to have incorrect answers in the in-lecture questions. I got bored of trying to keep track of the errors and inconsistencies in the course. The material needs a thorough editing before the next showing.

Assignments:
The lectures were not particularly good preparation for the homework assignments. A lot of independent learning was required to make progress in the course. The assignments were also relatively difficult compared to what I expected from the course description. The first assignment was a sentiment analysis of a Tweet stream written in Python. I have a pretty good programming background, having started with Basic back in 1983, visiting Fortran, MatLab, Igor, Unix utilities, C, Ruby and continuing to objective-C, R and Functional programming in Scala. I pick things up quickly. The course description did not require a programming background, yet I had to spend several hours learning the ins and outs of Python from Code Academy before I could get a handle on the assignments.

The level of the first assignments was not commensurate with expectations from the course description. I learned a lot, more than I expected, in fact, and I can now implement a matrix multiplication in Python, SQL, or MapReduce based on the homework assignments. The auto-grader for the 1st assignment never did accept my answer for the final part. It also didn't give sufficient feedback for me to solve the problem, which probably had something to do with text encoding, but was very frustrating none-the-less. This sort of issue doesn't teach much. Save your perseverance for things that matter.

Overall, the assignments were challenging. I learned a lot, but not always what the point of the assignment was. I think there were a lot of complaints (more than normal) about the difficulty of the assignments, and later assignments were quite a bit easier than earlier ones. Assignments covered:

Python: Tweet stream sentiment analysis
SQL: Queries, tables and matrix multiplication
Tableau Visualization: FAA Bird Strike Data

write-up and peer assessment
note: I had to use Tableau via Amazon web services as it only runs on Windows.

MapReduce: data joins, basic network analysis, matrix multiplication
Kaggle: Take part in a competition (I did facial keypoints detection)

write-up and peer assessment: ranking on the leaderboard did not matter.

A couple of the assignment deadlines were changed after the deadline had passed. This is very unfair to people who have worked hard to make the deadline, although it was reasonable in the case of the MapReduce homework where we were using a new web system that was supposed to be able to handle the volume of students. This is a continuing problem with MOOCs that have > 100k students enrolled. Any time the professor makes an assignment that will run on new technology, be prepared for a very frustrating experience. In my opinion, new web-based technology should not be used for graded assignments in MOOCs. They should be tested first as an optional assignment or a staged assignment so that 100k students are not accessing it in the same week.

Overall Recommendation:
Students: I hope that the professor will offer this course in a pared-down form. As it is, if you're already awesome at Python and SQL, go ahead and dive in. Everyone else should consider this a taster course and audit only, at least with the current assignments. Be selective about which parts you choose to look at. If you experience slowdowns or poor behaviour with particular technologies in the assignments, put it aside and try again when the course is over or the deadline is passed. It seems like a class, but it's a free platform and you get what you pay for. A lot of professors are using this to try out new technologies, so don't expect it to all work as advertised.

Sunday, 7 July 2013

Biking with Children

Spurred by some recent twitter discussions linking to this StackExchange discussion, it's time to finally write this post. I've been thinking about it for an age.

My son was born in Amsterdam, and my daughter in London. We have been, well, not serious bicyclists, but commuters and heavy bicycle users for many years. This didn't stop when we had kids. In Amsterdam, we lived in the center and did everything by bicycle. We had no car, and didn't really miss it. Over our 8 years in London, we've spent 2 years car free, doing the shopping, day trips, etc. by bicycle or by public transit. We aren't sport bicyclists, bent on training and performance, but, well, maybe we're some shade of green.

Early days: < 6 months
Since bicycling is such an integral part of life in Amsterdam, within a couple days of birth, I found myself carrying my baby in a sling while riding my very stable Dutch bike around Amsterdam at slow speeds. This got us to the first health appointments at the GGD, but wasn't really a safe solution. My knees bumped the sling, too, so it wasn't very comfortable. Every time I got back home, I felt like I'd 'gotten away with it' again.

We needed to be able to get around, and a car just wasn't a useful solution, so we got a baby-mee. Baby G's nice, safe carseat clicked into a somewhat springy metal rack attached to my bike rack. Now we could put him in his car seat and continue our bicycle explorations. We got to see some nice places that way.

This was a great way to travel. I could get out of the house really easily with the new baby. He wasn't heavy yet, so didn't add any instability to the bicycle, and he was strapped into a car seat with decent side impact protection. A low speed fall to the side would be safe. We had a rain cover that zipped over the top when needed. It could also be moved from one bike to another fairly easily, which was great for when my husband was in charge. Mostly it lived on my bike, though, and it was brilliant for trips to the doctor, the zoo, the park.

But eventually, it was time to move on. For one thing, we needed to carry the shopping, too. And a full shopping bag goes on the back. Yes, Dutch people frequently can be seen with two or more large shopping bags dangling from the handlebars, but I learned early: never follow a Dutch person on a bicycle. The frailest of old ladies will turn across the path of an oncoming tram or down a tricky cobblestone alley full of pedestrians. We won't talk about traffic signals. I'm just not at that level; I know myself, and I don't bike well with shopping on the front.

young, but able: 6 - 15 months
From the time a baby can sit up, the Dutch tend to put them in a 'voorzitje', or front seat. Ours hung on the handlebars with two hooks and had a clamp around the stem. Most newer models have an updated attachment that clamps directly onto the stem. The hooks had a disadvantage in that as the child got heavier, so did the steering, but it was fabulous for several months. As far as I could tell, the babies all love these. It was always exciting, and G would stay awake for at least 30 minutes before starting to nod off, which was fine for dashing about town. We also had plenty of room on the back for groceries.The baby's feet are in foot rests, and you can get a good windscreen to keep out the rain and weather. Our model didn't work with drop handlebars, and if the child fell asleep on the way home from the zoo, it wasn't very comfortable; we were constantly trying to support a lolling head with an elbow while riding. This also made it less useful for longer rides.

For this age, it can also be convenient to have a rack attachment that takes a small folding stroller or pushchair. That way you have something to push around at your destination, which can be a lifesaver if you're headed to a garden, shopping mall, museum or similar walking-heavy destination. I didn't have problems with my knees hitting the voorzitje (pronounced for-zit-ya), but my husband had to spread his knees around it, which isn't the best riding position. If you can, try it out before you buy it.

toddlers: 15 months - 3 years
But after about 15 mo, baby G just grew to be too big. The mini front seats are quite close to the handlebars, so there just isn't room for a larger child. We moved him to a seat on the back. Why not a trailer? Well... we did try out a trailer in Amsterdam once; it was hell. It wasn't the trailer's fault, it was just the nature of Dutch bicycle traffic and Dutch bike infrastructure. The trailer was a bit too wide, a bit too long, and a bit too unexpected to be a comfortable riding experience. One thing that makes a bicycle so useful in Amsterdam is that it is narrow. In a town of narrow, medieval streets, this is essential. Cars generally allowed us about 2 inches of space, because that's how much space a Dutch bicyclist needs, and that's how much room there was on the streets. There were constant obstructions: delivery vans or moving vans or just something going on along a canal. To get anywhere, the bicyclist has to swerve through the bollards, through a few parking places or along the sidewalks to pass by the obstruction. Then through the bollards again to regain the street. The trailer just didn't have the manoeuvrability we needed.

The Bobike seat, after 5 years in the shed.

It was worse at intersections. Bicycle traffic is chaotic. There is no lane control, no signalling, and if anyone can help it, no braking. Pulling a long, slow load into the midst of 10 or 20 bicyclists all trying to go their separate ways just wasn't a pleasant experience. It didn't feel socially responsible. Everyone else was carrying multiple children on the nice small footprint of their own bicycle, while we had an extraordinary, traffic clogging load. I wanted a lead bike out front warning everyone we were coming through.

But that was Amsterdam, our later trailer experiences were different (see below). In Amsterdam, G sat in a Bobike seat on the back, attached to the rear stays, not the rack. This, again, constricted the room available for groceries, but I put a small basket on the front and did most of my grocery shopping by foot. If I leant forward, I could wear a backpack, but if it sat upright, it was right in the baby's face.

From 15 months to 3 years old, this was fine, but this guy grew a bit fast, and by 3 years, the seat wasn't working out any more. The clamps holding it to the stays would slip and *bump* he would come down on my fender, providing a very effective brake. I would get off, unload, move the clamps back into place, screw them down as hard as I could and a bit later, *bump*, down they would come again.

Another child seat option that remains useful.

One cause of this difficulty was that we moved to London shortly after G turned 2. He was riding in the same Dutch seat, but the surface was different. When we took the Thames Path to Kew Gardens, it was just too bumpy, with gravel and potholes and rough edges. He was too heavy, and the seat design wasn't up to it. We considered other seats, but none of them really seemed to solve all the problems. A rack-mounted seat certainly would have been better, but we also wanted to go for some longer rides and have something comfortable for G to nap in; we wanted to swap it between bikes; we wanted it to be a bit safer.

small child: 3 - 6 years
So when we saw a second-hand trailer at a boot sale, we got it. (boot sale in British is equivalent to flea market in American English). It wasn't a high end trailer, but it was functional. It had a clamp that tightened onto the bottom stay, with a safety backup strap and a flexible connector to the trailer towing bar. We could lay the bike down and pick it up again without upsetting the trailer. G could nap in it. It was built for two, so he could take a friend. We could swap it between bicycles. (It was more or less equivalent to this one). I put a mirror on my helmet so I could see the traffic behind and keep an eye on the little one.

At 3 yrs old, G started going to nursery 3 days a week. The nursery was 4 miles away, so I biked it with him in the trailer. He could stay cosy and dry while I battled the hill and got a bit of exercise back into my life. We could stop in the park on the way home if the weather was nice. The trailer was very useful, but it lagged: it didn't start smoothly from a stop, but would pull behind in an uncomfortable oscillation. It wasn't dangerous, just unpleasant. My husband liked the functionality of the trailer, but not the execution: too heavy for longer rides.

The response of traffic is mixed, but most people will give you a lot of room. When people see the trailer, they generally say 'Awww, or 'Wow' or 'what a way to travel'... we get a lot of comments. There is a real feeling of support for this approach. London traffic treats you differently as a bicyclist or as a bicyclist with a trailer. On most roads, I didn't feel pressured to move out of the way or to get a move on. Sometimes the white van guys will even wave at my passenger. There is (slightly) more understanding that you are carrying a load, and the load is precious.

We were wanting some longer rides. We organised a trip to Southern France with some friends, and we bought a Burley. The first several times I pulled it, I kept checking to see that it was actually behind me. It felt extremely smooth and light compared to the other trailer.

G was comfortable and very happy with his space in the trailer. On a long ride, there wasn't so much for him to do, but we took frequent breaks and found a lot of waterfalls. He got more chocolate croissants than any child should expect. We could hand him long pieces of grass and things to look at and hold out into the breeze. He would nap and re-arrange his environment. This is the kind of biking that you cannot do with a child in a seat.

The Burley folded flat and fit into a luggage rack on the Eurostar and TGV trains. It was easy enough to convert that we rode it across Paris from Gare du Nord to Gare d' Est and folded it again for the second train.

second time around:
And then came P. We live near a busy train track, and the crossing bars are down a lot. During the week it's easily 40 minutes out of the hour. Since G's school and the local shops are on the other side, most of our daily trips involve stairs. I walked everywhere when P was small. I could carry her lightweight pushchair over the stairs, where I wouldn't have been able to manage either a bike with seat or a trailer. I didn't have the baby-mee, but I occasionally put her carseat in the trailer to ride somewhere. There wasn't a very good attachment for this, but I could strap it in for stability, and she was reasonably well-protected. She would eventually complain if the road was bumpy, so we took the speed bumps very slowly and tried to stay off of the Thames Path.

She has missed out on the fun of the voorzitje, but she enjoys going for bike rides. She has her own space in the trailer, and often doesn't want to get out when we arrive. She brings along a stuffed animal and can often be heard making up a song or game while we travel. We haven't considered getting a bike seat for her. Actually, I just cleared out the old rear seat from the shed.

Minimalist child seat.

As a sturdy 4 1/2 year old, P is now going to nursery about 1/2 mile away. We have tried out all the possible methods of transit; some days we even seem to crawl. On the bike, there are three possibilities: the trailer, the back rack or the top tube. One boy at her nursery arrives on a seat on his father's top tube (pictured). When she's in front of me, my knees have to go wide around her, and neither of us is comfortable for longer rides. On the touring bike, I am leaned forward over her, and this riding position is quite cosy. She is forced to lean over as well, just to fit into the available space. On the other hand, being up high and in front is exciting. She has to face the weather. For the past couple of weeks, though, the trailer has been the method of choice.

more than one:

typical scene in Amsterdam, shamelessly borrowed from this blog.

Since G was already nearly 6 when P was born, we didn't have to do a lot of riding while carrying two children. I did hand on our 1st trailer to a nearby family who used it happily for the school run. This can be a tricky time - when the younger child is in nursery, there is about a year with three school runs a day, and many young children just aren't up to that much walking. A two seat trailer is a good solution, and generally cheaper and more flexible than a box-bike or tricycle.

Although Dutch town are filled with scenes of whole families riding on one bike, it can be difficult to manage. The front-seat + back-seat combination is the most common. This works reasonably well when the children are small, but the problem is always how to load it safely. Basically, a toddler's weight on a leaning bike is already half-way to an accident. The best solution seems to be to load the baby in front first, stabilize the bicycle, and have the toddler climb into their seat. Invest in a really solid kick stand that will give more than one point of support, and hold the brakes tightly on to keep the bike from rolling while they climb up. A stop or velcro strap that keeps the front wheel from turning can also help. Even when the children are in place, however, you still have to clip the toddler's straps, which requires an extra hand. I witnessed more than one fall, and more than one utterly frazzled mother. If you're just headed to the park with your partner, distribute the weight and put a seat on each bike.

If you have to lean the bike to get your leg over the top bar, be extremely cautious. Having all the children's weight at the top of the bike makes it very unstable. Holding onto a brake will keep it from rolling, but... it's just a tippy machine at this point.

Summary:
There are many options for riding with your children. Which one(s) you choose will depend on the children's ages and your own bicycling style and needs. We found that seats worked very well in the confines of medieval streets in Amsterdam and also for errands and short trips in London. They work better with a step-through frame (no or low top-tube) and an upright riding position. Outside of Amsterdam, the trailer was the way to go for us. It requires more storage space and actual, normal sized roads. It can carry a heavy load without any instability. We take it on and off the bicycle for each trip, which requires some attention to detail -- never leave off the safety strap! This takes less than 1 minute after a bit of practice. Since we use the trailer for vacations and daily errands, including shopping runs, it has definitely been worth the price.

Right now we are packing to move back to the US, and we'll be taking the Burley with us.

Saturday, 6 July 2013

Idiom in R: results you can C

Computing for Data Analysis was a pretty good introduction to R, but did not really talk about R idiom, which can make the difference between code that runs and code that runs quickly. Here is a basic example.

Using sprintf statements for formatting filenames: Consider a series of files. The goal is to read them all into R, but the filenames include a constant width variable: we're looking to load filenames such as ./data/001.csv and ./data/011.csv up to ./data/999.csv. How do we construct the name strings in R?

The numbers in the file names need to be padded and converted to the appropriate strings. Here are three ways of doing the padding.

The R way:

# setup
directory <- "data"
id = 1:999

# method 1
pad.R <- function(id) {
    num <- sprintf("%03d", as.integer(id))
    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

A brute-force method:

# method 2
pad.brute <- function(id) {
    num <- rep("", length(id))
    for (n in 1:length(id)) {
        if (id[n] < 10)  num[n] <- paste("00", id[n], sep = "") 
  else if (id[n] < 100)  num[n] <- paste("0", id[n], sep = "") 
  else num[n] <- as.character(id[n])
    }

    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

… but we know that for-loops are notoriously slow in R, so we could take a hybrid approach and define a function to take a single number as input and convert it. Then that function could be used with one of R's apply methods to convert the vector in one go.

A hybrid method:

# method 3
padder <- function(num) {
    if (num < 10) return(paste("00", num, sep = "")) 
 else if (num < 100) return(paste("0", num, sep = "")) 
 else return(as.character(num))
}

pad.hybrid <- function(id) {
    num <- sapply(id, padder)
    path <- paste("./", directory, "/", num, ".csv", sep = "")
    return(path)
}

Comparison

These approaches all give the same results, but they are noticeably different.

system.time(path <- pad.R(id))

##    user  system elapsed 
##   0.001   0.000   0.001

system.time(path <- pad.brute(id))

##    user  system elapsed 
##   0.007   0.000   0.008

system.time(path <- pad.hybrid(id))

##    user  system elapsed 
##   0.004   0.000   0.004

path[c(3, 13, 103)]

## [1] "./data/003.csv" "./data/013.csv" "./data/103.csv"

For speed, they are equivalent when run on 1 or two elements at a time. However, when run on the full 999 element vector as shown here, both the 'brute force' and 'hybrid' methods are significantly slower than sprintf.

The discussions on the course forums did give a different perspective. Several self-identified 'professional programmers' preferred the if, if-else, else approach I've used in both methods 2 and 3. They considered it more readable and thus more maintainable.

I don't think this is the best approach. If you are a professional programmer, you are familiar with idiom, in whatever language you work in. You know that there are readable, maintainable, ways of doing what needs to be done efficiently. At it's root, deep down underneath, R is in the C family of languages. The basic in/out is based on the C standard library <stdio.h>. The professional way to use R is to use that R idiom efficiently and in a way that other R programmers will understand.

So learn your sprintf formatting codes. They may look like magic numbers the first time you meet them, but they are systematic and ubiquitous. They will be useful in many other contexts, including modern languages like Python and Java and therefore even Scala and Clojure. They will also speed up your code, and don't worry, most other professionals will understand them.

Sunday, 5 May 2013

Hexaflexagon Madness

No, I can't hold a candle to Vi Hart's description. Go and enjoy. Then have some nice Mexican food.

But when you stop rolling on the floor holding your stomach, you might ask yourself 'where did the 3rd side come from'? and, well, I might have an answer.

A hexaflexagon is a 2 dimensional object in some sense. As you fold it up into a triangle, preparing to turn, it becomes 3 dimensional, and when it does, you can open it up. What you find is that the triangular pockets formed by the folds held the 3rd side. This 3rd side is inaccessible until you fold it, but when you open it, you are opening those pockets, revealing the hidden triangles. The former front becomes a symmetrically inverted back, and the former back side moves to the inside of the newly-formed pockets.

Another side-effect of the pockets is that if you keep folding and unfolding, effectively turning it inside out over and over again, it will rotate in the plane, without you turning it.

If you're trying to fold one, here's one tip:

You can estimate a 60° angle by carefully lining up the top corner with the bottom edge of the paper strip, as in the pink circle above. At other angles, the corner is either onto the paper or hanging off the edge, but at 60°, it will line up exactly, so long as the sides are straight.

Chirality is important. Make sure you've got three diamonds visible. If you don't, there is probably an up where there should be a down or vice versa.

I haven't quite got the hexa-hexaflexagon down yet, but we'll get there. The description at Hexaflexagon portal is very helpful, particularly the variation A hexa-hexa-flexagon, available as a PDF.

Oh, yeah, and while you're contemplating your notebook paper:

You didn't think that 9 1/2 x 11 inches was an international standard, did you? Guess what. Most of the rest of the world uses a different 'system', um, like an actual systematic system. The equivalent 'letter' size is A4, but we also have A1 (poster sized), A7 (index card), and other variously-sized characters in between. These sizes have the nicely chosen aspect ratio so that, $L/W = \sqrt{2} = 0.707 $... but that would be irrational, so they have to round off a bit.

This doesn't look very useful until you take the ratio of $\frac{\sqrt{2}}{2}$. Remember how to divide fractions involving roots? The bit you need to recall is that $2$ is just $\sqrt{2}\times\sqrt{2}$. This means that $$ \frac{\sqrt{2}}{2} = \frac{\sqrt{2}}{\sqrt{2}\sqrt{2}} = \frac1{\sqrt{2}} $$ when you simplify by canceling like terms on top and bottom. Then, taking the ratio of the long: short sides gives $$ 1: \frac1{\sqrt{2}} $$ Multiplying both sides by $\sqrt{2}$ gives a simpler form, which happens to be the same ratio as the original, large piece of paper: $$ \sqrt{2}:1$$ And no matter what the paper size, the math still works. Now that's a system. Folding an A4 and rotating 90° gives an A5, etc. As always, wikipedia is your friend.

A4 is 21.0 x 29.7 cm, so it's narrower than US Letter paper by 1.23 inches. Which is a perfectly sized strip for hexaflexagon folding.

I haven't decided on whether or not to hold a hexaflexagon party. It might have to wait until next October.

Tuesday, 16 April 2013

Big Data Hackathon London: A few lessons learned

I spent most of my weekend at the Big Data Hackathon London. I'm not hard-core, and I didn't pull an all nighter, but then, even the winning visualisation team said the code written between 4-7 am was rubbish. Better to get some sleep. This was my first ever 'hackathon', and part of the fun was just observing the phenomenon.

The basics:

The hackathon was organised by Data Science London, and I found out about it through the Data Science Meetup group. I highly recommend this group if you are interested in learning about current methods and tools in Data Science. Their meetings are very interesting and educational, but you have to be very quick with the RSVP -- there's a lot of competition for the limited spaces. As always, the organisers did a great job. I didn't manage to take home any of the swag or awards, but I certainly drank my share of the coffee. And the whole weekend was totally free. Well done!

The hackathon took place at The Hub Westminster, which was a very nice, light, open space. The talk space holds about 100 or so people, and there is desk space and stand-up area where the food is served for milling around and meeting people. The space is well organized with good systems for internet and power. A pleasure to work in.

Lesson learned: Bring your own mug to cut down on waste

The hackathon had three different categories of challenges that teams could submit.

data science challenge
data visualization challenge
free-style data challenge

Most people who came did not have a team lined up. The winning team in the data visualisation challenge got together when two of them carried signs around saying 'Node.js' and 'd3'. The other two thought this was a good idea, and a winning team was formed. One of the team members later said that their goal had just been to improve their javascript skills. The visualisation was quite lovely, and should be showing up in a 'major UK publication' someday soon.

Lesson learned: MongoDB + Node.js + d3 = powerful stuff
Lesson learned: Connect a team through the technology you want to learn

The hackathon started out with a presentation on Microsoft Azure and the suggestion that we use a free trial account (good for 3 months) to do our analysis.

After the talk, someone asked about setting up R on the system. I approached them after the talk, and that was the beginnings of a team. Our team, 'State of the A[R]t' set up an Azure account, and we were able to get R working on a Ubuntu virtual machine without much difficulty. Wenming Ye's blog was helpful for this. It's probably even more helpful if you want to use Python. The Kaggle assessment of the data science submissions relied on the ROCR package, and this relied on gplot, which required us to build R 3.0 from code. Fortunately, one of our team was ace at this and we had it running quite quickly. Meanwhile, the rest of us were looking at the data.

Lesson learned: Technology is broad and deep. Someone will like doing the parts you hate. Let them do it. (I have to re-learn this continuously. I try to do too much on my own.)

The hackathon has a tight schedule. There were talks all afternoon, and if I had gone to all the talks, I would not have made much progress with the data. However, missing all the talks probably wasn't the best strategy either. Next time, I'll try to keep my head up and look around for which talks are truly interesting. Talks were presented by the hackathon sponsors, so highlighted their newest technologies. I can only hope that the talks will be posted so I can catch up with the parts I missed.

Lesson learned: It's about learning. Think about what are the learning opportunities today? Will the talks be available tomorrow?

On Sunday, the data analysis winners each gave a brief indication of what they did. We ended up 89th overall, and we only did that well because one of my team-mates took a careful look at the original benchmark code. I don't think we were alone in this, as 15 teams finished within 0.00001 of us. Re-assuringly, though, we were working along similar lines to the winning team.

Lesson learned: Find a good starting place.

The benchmark was not quite the simple logistic regression we expected from the description. We would have done much better if we had taken the time to look at the code for the benchmark as a first step.

Lesson learned: Work efficiently -- write functions or scripts for each step.

At 12:32 on Sunday, I had a model that resembled the winning model. Maybe it would have done better than 89th, but I didn't get a chance to find out. It took me too long to make the model into a submittable prediction! I should have anticipated this, because the 1st submission also took ages. If I had written some of the steps into functions, it could have been much faster, and the team would have done better.

Overall, it was great fun. I met some lovely people and I learned a lot. Coursera's offerings, including Jeff Leek's 'Data Analysis' and Roger Peng's 'Computing for Data Analysis' gave me a good background for taking part in this event. Hopefully, learning some Network Analysis and a bit of Scala will prove useful for the next one.

Saturday, 2 February 2013

Programming with Mommy

Sometimes the things you do turn around and bite you, and sometimes they make you smile.

So this afternoon I was watching this video in which Greg Wilson talks about programming techniques, programming fashions and the importance of evidence in deciding what to do and how to go about it. There's a section in the middle about the "why-women-can't-be-good-programmers" debate, and he mentions this book, which discusses it at length, with evidence.

So I got to thinking about coding and myself and my daughter. And it just so happens that we were chatting about Angry Bird this morning:

P: Mommy, did you have Angry Birds when you were little?
S: No... no, we didn't have anything like Angry Birds. We could listen to music on tapes or records; we could watch television. There weren't many computers. There weren't any videos or CD's. I remember the 1st video game. It came out when I was about 15. Actually, I can show you what it looked like...

So we went and looked at 'Paddle Ball' at Khan Academy.

It's not the original Pong (nor is it the version that I remember seeing at a friend's house -- that was probably on an Atari VCS). It is close enough to that game that she could get the idea: not Angry Birds. And she could get another idea -- there was the code on the left side of the screen, and we could change it. We could make the ball pink, the background red, the paddle purple. We could change the sizes of the objects, and their speeds. We could interact with the game in a different way, and we did.

So my daughter got her introduction to programming at age 4.