Thursday, 12 June 2014

Ozone Weather

An open source weather app that calculates UV light intensity for the user's current time and location and gives a suggestion for how long it will take to make vitamin D. It also forecasts UV light intensities for the coming week. The code is up on Github.


Background:

This project is based on a tutorial from Ray Wenderlich's great website. (sample app: SimpleWeather). It adds Ozone data from the TEMIS database, using libxml2 and Hpple to parse the .html based on another tutorial how to parse HTML on iOS. It also includes my astronomy code from previous projects for calculating solar angles and estimating UV intensities.


Technologies

The iOS best practices tutorial included a number of interesting technologies. I've ignored the image blur filter they use in the main view, but I keep these:

  • cocoapods -- package manager and dependency tracker software. Basic usage:
    • > edit podfile cocoapods searches for a file named 'podfile'. Include the dependencies to be included with the project there (see example in the directory)
    • > pod install obtains the dependencies, creates a directory pods/ to hold the dependencies. It also creates a .xcworkspace file to hold the new project with dependencies. Use this workspace in xcode.
    • > pod update will update a dependency to the latest versions. I had to do this in this project because the TSMessage project had suffered from a problem at GitHub and they needed to put in a bug fix which broke my older version. Cocoapods to the rescue!
  • Mantle -- a project from GitHub for creating data models: aids conversions between JSON <--> NSObject (very handy with a json data feed, as we'll see!). I only wish there were more useful tools for directly converting a dictionary into an object. I didn't come up with a good solution for the ozone data, but at least it works, and it's pretty easy to see how it could be improved.
  • ReactiveCocoa -- allows you to use functional programming constructions in iOS apps. Ever run into a spaghetti of callbacks with KVO? Functional Reactive programming may not be the ultimate solution, but it certainly provides a different paradigm that applies to many common situations. Wow. If you haven't used it you, you gotta try this stuff.
  • hpple I needed an HTML parser for iOS. There are a lot of choices, but this one had a decent tutorial at Ray Wenderlich.com, and seemed easy to use (and it was). The parsing problem is pretty small as I only want to do a single page that hasn't changed in years. Regex would have worked, but I wanted to learn how to do it better.
  • TSMessages -- a ticker-style alert message system.
  • git for version control -- the technology driving collaborative development on github. Can't say I've mastered the learning curve yet. This is one of those situations where a video explanation really helps. Jessica Kerr (@jessitron) does a great one, git happens -- with sticky notes! (this particular presentation is for coders coming from a background in subversion, but I've also seen Jessica do a great intro for novices who haven't heard of version control at all. Just search for git happens.

As with any newly hatched project, there's the question of what to add to .gitignore. For a project using cocoapods, see the pros and cons at guides.cocoapods.org. For this project, I want to keep it lightweight, but also keep track of the dependencies. To do this, I'll add the pods/ directory to .gitignore, but keep the podfile, podfile.lock and other files under version control. I used a stackoverflow post to get an appropriate .gitignore file.


TEMIS

The ozone data I'm using is basically scraping a website. It is old website, by the Tropospheric Emission Monitoring Internet Service (TEMIS) and lacks a friendly api. Raw science. To get the column ozone for a location, I need to query the website with a url string of the type http://www.temis.nl/uvradiation/nrt/uvindex.php?lon=5.18&lat=52.1, where the lon and lat values are provided by my code. The response is only available as .html, so I need to get this response and parse it to extract the desired column ozone values. I'll use hpple to parse the html response and it's xpath query system to walk the DOM and extract values from the relevant table.

The entire page is formated as a series of nested tables. The page header is one table; a second table holds the body of the page with one column holding the frame to the left, a blank column, and a column holding the data table I'm interested in.

html -> body -> 2nd table -> tbody -> tr -> 3rd td -> dl -> 
dd -> table ->tbody ->
    tr -> td -> <h2> location </h2>
    tr -> 3x td -> (headers tagged <i>) Date, UV index, ozone
    tr -> 3x td -> (data values) day Month year, .1f, .1f DU

There are a lot of XML parsers and JSON parsers available. In my perfect world, .html files could be parsed as xml, but it doesn't work out that way. Many normal tags in .html are not xml compliant, so most xml parsers break down right away, inclduing the NSXMLParser included in iOS. Parsing .html is not an uncommon problem, so there are a number of librarires on GitHub that people have used. I was able to combine the Hpple parser library with reactive cocoa to create a pipeline straight from the .html response to my model objects.

There are still a few wrinkles that could use ironing. When dealing with locatioins time zones and solar data, times and dates become difficult to handle. The native date-handling in iOS doesn't make it easier. In particular, the UV Index published by TEMIS is for solar noon at the lat,lon of the location. There is no accurate way to use only iOS internals to capture this date correctly, particularly when daylight savings time is taken into account in different jurisdictions. Astronomy calculations are needed to assign these values to correct times.


Lessons

DateFormats:

'HH' parses 24hr time, hh parses am, pm time. 'hh' won't parse 17:20.
'YYYY' doesn't mean 2014. For normal years you need 'yyyy'.
The full spec for iOS 7 is at unicode.org

ReactiveCocoa:

NSLog is useful for getting info on intermediate stages.
There are some mysteries about what filter: and ignore: do that I should get a grip on.
There is a lot of potential in this library, and many functions to explore. map: is your friend. Use it flexibly.

There are many useful discussions in the issues on GitHub and on SO. One word of warning: this library is changing rapidly -- 2.0 was recently released and 3.0 is being crafted. The terminology is shifting, even for core ideas. Older posts may contain outdated code, and that's likely to change even faster with Apple's introduction of Swift!

  • prefer RACSignal over RACSequence.
  • Prefer 1-way binding with RAC over 2-way binding. (see discussion of issue proposign to drop RACChannel)
  • avoid subscribeNext:, doError:, and doNext:. These break the functional paradigm. See this StackOverflow question on fetching a collection Note: RACAble was replaced with RACObserve in 2.0.

I'm finding it somewhat difficlt to chain my signals together in the correct way, probably because I have some processing to do with different parts of the ozone signal.

Here's a helpful bit of code from techsfo.com/blog/2013/08 for managing nested asynchronous network calls in which one call depends on the results of the previous. Note: this is from August, so before RAC 2.0. I believe weakself is now created with the decorator pattern @weakify and destroyed with the decorator @strongify.


__weak id weakSelf = self;
[[[[[self signalGetNetworkStep1] flattenMap:^RACStream*(id *x) {
   // perform your custom business logic
   return [weakSelf signalGetNetworkStep2:x];
}] flattenMap:^RACStream*(id *y) {
   // perform additional business logic
   return [weakSelf signalGetNetworkStep3:y];
}] flattenMap:^RACStream*(id *z) {
   // more business logic
   return [weakSelf signalGetNetworkStep4:z];
}] subscribeNext:^(id*w) {
   // last business logic
}];

Unit Testing

The code coverage isn't good, but I did create some logic tests for my astronomy code using XCTest. My test class wouldn't let me call any private methods in the class I was testing, and I didn't want to move those function definitions into the public interface. Luckily I found another way: create a category in your test class (stackoverlfow).

// category used to test private methods
@interface OWSolarWrapper (Test)

// private functions from OWSolarWrapper.m
// date functions
-(NSNumber *) julianDateFor:(NSDate *)date;
-(NSNumber *) julianCenturyForJulianDate:(NSNumber *)julianDate;
-(NSNumber *) julianDateRelative2003For:(NSDate *)date;

// basic astronomy calcs
-(NSNumber *) equationOfTimeFor:(NSDate *)date;
-(NSDictionary *) solarParametersForDate:(NSDate *)date;

@end
Beware of floatValue:

[[NSNumber numberWithDouble:3.14159283] floatValue]
Write an extension so this raises an error! floatValue is only 24 bit, so it truncates after 6 decimal places. This introduced a bizarre rounding error in my astronomy code.

Location Testing:

Nice comment under the original tutorial:

If you want a specific location not included in xcode, you can create a gpx file for any location. You then import it into xcode to include it in your xcode location toggles.
        - marciokoko on March 4, 2014, 9:18 AM

but to calculate solar positions, I really need TimeZone information along with my testing locations, which .gpx doesn't include.

TODO

  • use different sky images for the background to reflect the weather prediction.
  • The original background of the table view had a blur filter attached. This was accomplished with a library, but I think similar is possible with CALayer. Might be good to explore a CALayer filter on the table-view cells or the underlying scrollview. The filter would respond to scrollview position.
  • UV information is only currently encoded as uvIndex ranges giving a color on the icon background. That's a start, but I'd like to do some calculation and determine two kinds of risk:
    1. . oveall intensity -- risk of acute sunburn, and
    2. . relatively high ratios of UVA at high intensity -- risk of deep damage that is harder for the body to repair.

    To do this, I need to work more at combining the weather signals, possibly changing the model significantly. Thank goodness for git branches!

Friday, 4 April 2014

Course Review: Data Analysis by Jeff Leek

I took Data Analysis on Coursera in the spring 2013. It was offered again in October 2013, but it may have been supplanted by the new Data Science specialisation, which starts in just a couple of days.

tl/dr: 5/5 don't miss out.

Overview:

This is a very well-run course that builds on Roger Peng's Computing for Statistical Analysis (in R) course. Professor Leek does a very good job of introducing tools and habits for 'reproducible research' including suggesting ways to organize the data munging problem for reproducibility. Professors Peng and Leek, both of John's Hopkins School of Medicine, are leaders in this area, and discuss the challenges and progress on their excellent blog, Simply Statistics. They are also at the core of the newly offered Data Science specialization on Coursera, which offers a nice series of courses that build on each other and give good detail in the various aspects of data analysis. It's also by far the most accessible data science study track available.

Reproducible research is very important for statistical science in general, and promises to be increasingly important over time. Consequently, this course is really a great opportunity. Professional programmers eventually learn the value of unit testing, documentation, and other approaches that make their code more useable. Professors Peng and Leek are at the forefront of developing the professional statistical analysis standards that will become the TDD+version control+issue tracking of data analysis. In other words: you must take this course if you are interested in professional level data analysis. Really. Well, actually, you could take all nine courses…

The Data Analysis course is 8 weeks long. In the first offering, it had weekly lectures of about 2 hours as well as weekly homework assignments that could add up to several hours. As usual, the Coursera recommendation of 3-5 hours / week is on the low side for most students with little experience. Additionally, there are two longer, data analysis assignments with real (web based) data. Additionally, the due date for the first analysis is just after many of the better statistical analysis tools are introduced in lecture. Both analyses are peer-graded by 5 other students and the mean of the central 3 scores is used. Peer grading may change in different offerings of the course as this is one area where Jeff Leek was not sure how the course should be organised. There's a full discussion of this in a Simply Statistics podcast. Also, as one might expect from statisticians, there's data! an interactive graph of completion rates for various online programs. This course had a completion rate of 5.4%, and many of the students active on the forums had done a lot of data analysis before.

Recommendations:

Make sure to set aside time for the data analysis assignments. These will easily take 8-10 hours and could take up to 20 depending on how obsessive you become. The data analysis grade is based on a peer-review, so your reviewers may not understand your analysis, even if it is perfectly correct. There was grousing about this on the forums, but I think the point is clear: the peer graders are not experts in data analysis, and you should never write a report aimed at experts in data analysis, so take it into account when you are writing the report. I had trouble with this on the second data analysis. It involved classifying accelerometer data by action performed (walk, run, walk up stairs, etc). The data had been expanded into a set of features including some filtering, Fourier analysis and other transformations. In my analysis, I used the language of signal processing and sampling to describe this feature set, but this language was not familiar to the peer reviewers, so they didn't realise that I was describing the spacing and number of data points very precisely.

Friday, 28 February 2014

Intro to GitHub (for scientists)

The problem

I talked to a young woman yesterday who is a bio-engineering postdoc at Stanford. She has some code that she'd like to 'upload' to GitHub. She admitted that, well, actually, she hadn't managed to get any of her code onto her GitHub account yet, and she looked so overwhelmed and dejected that I felt bad. I know that frustration.
So here is yet another blog post to try and help with the learning curve that is git. I'm writing this for the scientist who has written some code and wants to share it on GitHub. Most scientists write code in order to accomplish a particular task, and thus are not familiar with professional programming practices, including writing documentation, unit testing, SCM (source control management) and version control. There are great benefits to learning these techniques, and their usefulness is becoming more and more apparent to academics as research relies more and more on computer programs. In fact, the table of contents for the journal Nature Methods just arrived in my inbox with a lead editorial on reproducible research. I quote:

Nature Methods strongly encourages researchers to take advantage of the opportunity that code repositories, such as GitHub, provide to improve a software tool before submission. Even if others do not examine and test the code, the act of preparing the code and necessary documentation for deposit and use by others will help avoid delays in publication.
In short, if you are coding and publishing work derived from your code, the process of uploading your code for collaboration will help bring it up to a publishable quality. And yes, it's even more important than making the figures pretty.

GitHub == collaboration

Firstly, if you just want to upload some code, you need to take step back, a deep breath and 'think different'. There is no 'upload' button on GitHub for a good reason. It is not built for uploading code and leaving it to rot on a server, but for fostering collaboration between programmers. Thus, in order to share your code on GitHub, you first need to get it ready for collaboration. To do that, you need to set up version control. This is somewhat more complicated than finding the 'track changes' setting in Microsoft Word, but it is also far more useful.
As you try to think different, keep in mind that GitHub was built with a particular set of workflows in mind. Those workflows have to do with managing a code-base that is constantly being updated by multiple contributors. GitHub tries to ease the difficulties of this type of collaboration by bringing several things together.
  • version control -- similar to 'track changes' in a word processor, but for whole projects
  • cloud-based storage -- simultaneously backup and share your code
  • user accounts -- keep it private, share with a team or make it public as your needs change
  • social -- there's messaging, so you can talk, argue, document, discuss.... collaborate!
But version control is the main thing that sets GitHub apart from any number of social sharing platforms, and makes it so powerful for people who code. To use GitHub effectively you therefore need to understand the basics of version control with git. It's hairy and scary at first, but no worse than... well, ok, it is worse than a lot of things, but sometimes a learning curve is the price you pay for really capable tools. So, put a set of bookmarks in your browser, make a cheatsheet and keep it handy. A local cheatsheet with a good searchable title does wonders. If you're thinking 'Oh, I'll just use it once' or 'I'll remember', well, git is for professionals. Are you a professional?

Git == version control

Version control is to 'track changes' for a text document, as Superman is to Tarzan; as Microsoft Word is to TextEdit, as New York City is to Detroit. Firstly, with version control, you are tracking changes to an entire project, not a single document. Changes are tracked line by line with comments and attribution through time as the project grows, changes, and splits; as subroutines develop into full projects of their own; as new owners take control of the code base. It is flexible and thorough and reliable. You should learn to use it.
Use case: You've got some code you developed for your thesis that you want to upload to GitHub. Maybe someone else will find it useful. There are GUI front ends to git, which may help with many tasks, but git was designed to be run from the command line, and this simple use case it not too difficult to master at that level, so let's just go for it.

Download and install git

Walk through the steps at github set-up. Today, they suggest that you download their native app, but note that this only manages part of the workflow. The steps you have to do to get git working still involve the terminal. Be brave. The steps are:
  1. get a GitHub account
  2. download and install git
  3. link your local system to your GitHub account
    • tell git your name, email, GitHub login information
    • set up security keys so that GitHub knows you are you

Prepare your codebase

There are a couple of adjustments you probably want to make before releasing your code in the wild. GitHub recommends that all code comes with a License, a Readme and a .gitignore file.
  • License If your code comes with a license, it's easier for collaborators to re-use the code and build on it. Specifically, it makes clear what they are allowed to do. It may not be important to you, but your code will be easier to share if you make it clear what the rules are.
  • Readme
    You have to explain your work at some point. If you do it in a file named README, GitHub will automatically put it on the front page of the repository. This is very helpful for anyone trying to understand what you did and why. The README can be just a text file. If it is in markdown, perhaps with additional flavoring GitHub will render it with headings and styles, which is much nicer for the reader and not difficult for the writer.
  • .gitignore
    Some files don't need to be tracked. For instance, some old Mac directories contain .DS_Store files with directory display information for the Finder app. That doesn't need to be part of the repository. So here's my .gitignore for an old MatLab project:
      $cat .gitignore
      .DS_Store 
    
    Pretty simple. You might also add *.log or tmp/ to the .gitignore file, depending on your context. Basically, any files that are automatically generated or updated on compile should not be tracked.

Make a local repository

The project you want to get onto GitHub probably consists of a directory or directory tree containing a series of text files, and possibly some image files or data files. In order for git to track changes to this project, you have to put these files into a repository. This is simple to do once you've prepared it for sharing.
Find your command line. On a Mac, you can use the Terminal app. Navigate to the directory holding your project. If you have never ever used the command line before, this might be challenging. If you want to dive in, you can certainly do so with three little commands: ls, cd, pwd. You can look at the man pages for these commands by typing, for example $man ls, or you could try a crash course in using the command line.
Once you get to your project directory, type:
$git init   
You should see a message from git, something like:
Initialized empty git repository in /Users/suz/programming/octave/OrX/.git/
This initialises an empty repository, which looks like a file named .git. You can check that it's there by typing
$ls -a
Git gives some feedback about what it has done, but I often find it useful to check with
$git status
after each command to see what has happened.
Now we can get the project into the repository. To do this, first type
$git add .   
This prepares git to add your files to the repository, a process known as 'staging'.
Note: The . tells git to stage all the files in the directory tree to the repository. This is very handy when we add files because they don't all have to be specified by name. On the other hand, it isn't ideal, because there are often binary files or .log or even image files that update automatically. You won't want to keep track of those changes. Fortunately, git will automatically look for the .gitignore file we already prepared to get the list of exceptions.

Ready for some commitment? Type:
$git commit -m 'initial commit' 
You should get a full list of the files that git is committing to the repository. Check the status when it's done and you should give a reassuring message:
# On branch master
nothing to commit (working directory clean)

Success! now your project is actually in the repository and git can track any changes to the files. The repository will also keep all the messages that you put with each commit. Always use the -m flag and take the time to add a meaningful message.

Congratulations! Your code is now in a git repository, under version control. You are ready to collaborate.

Share your work


Make a repository on GitHub

  1. Log into your GitHub account
  2. Create a new repository on GitHub
    On your profile page, in the 'Repositories' tab is a list of repositories that you've contributed to. At the upper right should be a bright green 'New' button.
  3. Follow the directions in the form, adding an informative description so others know what treasure they have found.
Congratulations! You have a GitHub repository to share your code from!

Link the repositories

In git terminology, the current state of your code is the head of a branch. By default, the initial commit is called the 'master' branch. You can make other local branches, and probably should to try out new features. You can also make remote branches. At this point, your new GitHub repository is essentially an empty remote branch. By custom, this branch will be referred to as 'origin'. To point git to it, type (on one line), with your appropriate username and project title:

$git remote add origin https://github.com/username/project.git

This command translates roughly as "Dear git; Please be so kind as to add a connection to a remote repository. I will be referring to the remote repository as 'origin' in our future correspondence. It can be found at https://...... Thank you for your kind assistance in this matter. Sincerely, yours truly, esq."

Upload your code

Ok, ok, there is no upload on GitHub, but it is payoff time. Once you have a local repository linked to a remote repository, you can just push the code from one to the other.

$ git push -u origin master

Translation: "Dear git; Please push the recent changes I committed to my local repository, known as master, into the remote repository known as origin. Also, please be aware that I'd like this to be the default remote repository, sometimes referred to as 'upstream'. Thank you again for your kind assistance. I am forever in your debt. Sincerely, Yours truly, esq. and, etc."

Success!!! You have now successfully pushed your code to GitHub.

Or at least I hope you were successful. If not, if you've tried to follow this post and the directions at GitHub and you still feel lost, there is more help out there. Many universities are running Software Carpentry bootcamps to help students and faculty develop more professional programming skills. The skills taught aim to improve software collaboration and impart the skills needed to carry out reproducible research. Two key tools they teach are version control with git and collaboration via GitHub.
Live long and collaborate!

Wednesday, 29 January 2014

UV reactive bead spectra

UV reactive bead spectra

Background

I was interested in designing some educational experiments using UV-reactive beads to teach about the presence/absence and intensity of UV light under different atmospheric conditions. The beads turn from clear or colorless to various bright colors when exposed to UV light. Unfortunately, these beads are just too sensitive – they even react strongly to the stray 400 nm light coming through glass windows. And reach full color intensity within a minute, even at 51° N in February! It's fun to watch, but not very useful for detecting physiological UV conditions for vitamin D synthesis.

I don't have a lot of scientific information abou these, so the range of UV light needed for the color change is not well characterized, and the resultant absorbance spectra of the beads (related to their aparent colors) is not readily available. I decided to study them with a reflectance spectrophotometer to see what I could learn.

It may not be useful for the educational project I had in mind, but I'll write it up here anyway, along with the R-code used to analyze the data….

Experimental

I ordered UV Reactive beads from UV gear, UK.

I used a table in the garden on a sunny day for the experiments. The sky was open directly above and direct sunlight came from the south. There were a few leafless tree branches in the way, and some buildings nearby, so it was not full sky exposure, but at least there was full direct sunlight. The table was covered with a small white blanket to provide a consistent background. The beads equibilbrated 10-15 min in direct sunlight before spectral measurements were made.

I used an Ocean Optics model USB2000+ spectrometer connected to a 1 meter UV-Vis fiber optic cable (part number QP400-1-UV-VIS from Ocean Optics) with a cosine correction filter at the end (Ocean Optics part CC-3-UV-S) to minimize stray light. The spectrometer was connected by USB to a MacBook Pro (OS X mountain lion 10.8.1) running SpectraSuite software for data collection. Spectra were measured with 30 averages in reflectance mode and saved as tab-delimited text files. The reflectance measurements required a 'reference spectrum' (light spectrum) of the white blanket and a second, 'dark' reference, which was obtained by blocking the end of the cosine correction filter. To collect the reflectance spectra, I pointed the fiber optic light pipe at individual beads from a distance of < 0.5 cm.

Spectral analysis can be done in the SpectraSuite software, however, many basic functions had not (yet?) been implemented in the OS X version of the software. Also, I wanted to practice my R skills, so I decided to load it into R and see what I could learn. R has multiple spectroscopy packages, and in fact a new one has come out since I started this project. I looked into using hyperspec, but the data structure seemed overly complicated for the simple analysis I had in mind. So what follows is my simple R analysis of reflectance data on sunlight-exposed UV beads.

Load in the Data

# change working directory
try(setwd("/Users/suz/Documents/vitD Schools/UV bead exp 19022013/spectrometer data/"))

# data is tab-separated, with a header. The end of file wasn't recognized
# by R automatically, but a read function can specify the number of rows
# to read.
spec.read <- function(spec.name) {
    read.table(spec.name, sep = "\t", skip = 17, nrows = 2048)
}

# the function can then read in all the files ending in '.txt', put them
# in a list.
spec.files <- list.files(pattern = ".txt")
df.list <- lapply(spec.files, spec.read)

# convert the list to a matrix: the first column is the wavelengths and
# the other columns are experimental data -- one reflectance measurement
# at each wavelength.
spec.mat <- matrix(df.list[[1]][, 1], nrow = 2048, ncol = (1 + length(spec.files)))
spec.mat[, 2:11] <- sapply(df.list, "[[", 2)

matplot(spec.mat[, 1], spec.mat[, 2:11], type = "l", lty = 1, lwd = 3, xlab = "wavelength (nm)", 
    ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 25000, "Raw Data")

Looks like we need to do some clean-up!

Clean up the data

Baselines and edges

At the edges of the spectral range, the reflectance data is dominated by noise, so it isn't useful. The baselines for the different spectra also need aligning, and we'll scale them to the same intensity for comparison. The intensity range observed in the data depends somewhat on the angle of the probe, even with a cos filter in place.

# define terms for the processing:
nm.min <- 400
nm.max <- 800  # edges of the displayed spectrum
base.min <- 720
base.max <- 900  # define baseline correction range
peak.range.min <- 420
peak.range.max <- 680  # where to find peaks for scaling.

# remove ends of the data range that consist of noise
spec.mat <- spec.mat[(spec.mat[, 1] > nm.min) & (spec.mat[, 1] < nm.max), ]

# normalize baselines, set baseline range = 0.
spec.base <- colMeans(spec.mat[(spec.mat[, 1] > base.min) & (spec.mat[, 1] < 
    base.max), ])
spec.base[1] <- 0  # don't shift the wavelengths
spec.mat <- scale(spec.mat, center = spec.base, scale = FALSE)

Choose colors for the plot by relating the file names to R's built-in color names.

bead.col <- sapply(strsplit(spec.files, " bead"), "[[", 1)
# replace un-recognized colors with r-recognized colors (see 'colors()')
bead.col <- gsub("darker pink", "magenta", bead.col)
bead.col <- gsub("dk ", "dark", bead.col)
bead.col <- gsub("lt ", "light", bead.col)
bead.col <- gsub("lighter ", "light", bead.col)

# plot corrected data
matplot(spec.mat[, 1], spec.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 10, "Baseline Corrected")

From this plot, we can see that the lighter colored beads have smaller peaks than the darker beads. The lighter color probably represents less dye in the beads. There seems to be a lot of variation in the peak intensity of some beads, particularly the yellow beads and the dark blue beads. Based on the width of the peaks, the purple and magenta beads appear to have mixtures of dyes for both pink and blue colors. The dark blue beads appear to be either a mixture of all the dye colors or a mixture of pink and blue, but much more dye is used than for the paler pink or blue beads. The yellow bead spectra is oddly shaped on the short wavelength side, probably due to instrument cutoffs around 400 nm.

Scaling and smoothing

Now I'll scale the data to the same range. It turns out that the R command 'scale' is perfect for this.

# scale the peaks based on the min reflected intensity
peak.range <- which((spec.mat[, 1] > peak.range.min) & (spec.mat[, 1] < peak.range.max))
spec.min <- apply(spec.mat[peak.range, ], 2, min)
spec.min[1] <- 1  # don't scale the wavelengths
spec.mat <- scale(spec.mat, center = FALSE, scale = abs(spec.min))

The spectra are also jittery due to noise. This can be removed by filtering. This filters over a range of 10 points.

data.mat <- spec.mat[, 2:11]
dataf.mat <- apply(data.mat, 2, filter, rep(1, 10))
dataf.mat <- dataf.mat/10
specf.mat <- matrix(c(spec.mat[, 1], dataf.mat), nrow = dim(spec.mat)[1], ncol = dim(spec.mat)[2], 
    byrow = FALSE)
matplot(specf.mat[, 1], specf.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 0.2, "Scaled and Smoothed")

I could attempt to prettify the spectra further by using actual colors from pictures of the beads. It looks like the Bioconductor project has a package 'EBImage' that should be just what I want, but it looks like I need to update to R version 3.0.1 in order to run it. So I guess I get some spot colors from JImage.

bead.yellow <- rgb(185, 155, 85, maxColorValue = 255)
bead.orange <- rgb(208, 155, 82, maxColorValue = 255)
bead.purple <- rgb(110, 25, 125, maxColorValue = 255)
bead.pink <- rgb(190, 100, 120, maxColorValue = 255)
bead.dkblu <- rgb(22, 18, 120, maxColorValue = 255)
bead.ltblu <- rgb(130, 135, 157, maxColorValue = 255)
bead.dkpink <- rgb(200, 25, 140, maxColorValue = 255)

bead.col2 <- c(bead.dkpink, bead.dkblu, bead.dkblu, bead.pink, bead.ltblu, bead.orange, 
    bead.pink, bead.purple, bead.yellow, bead.yellow)

matplot(specf.mat[, 1], specf.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col2, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 0.2, "Colors from Photo")

Analysis

I can carry this anlaysis further by quantifying the wavelength of the peak absorbance and the peak width (usually Full Width at Half Maximum – FWHM) for each spectrum. This could be useful in further analyses, reports, or as a feature in a machine learning approach.

To extract this information, I could try to fit a series of gaussians to the peak, representing the fraction of pink, blue or yellow dye present, but the quality of the data doesn't really justify this, particularly as I don't have a good shape for the yellow absorbance peak or adequate reference data for each of the dyes. As a quick and dirty method, I'll take the median position of the data values that are at 95% of peak. That should give something in the center of the peak. Since the peaks have all been scaled, that corresponds to the center of the data values < -0.95.

Likewise, the usual peak width (FWHM) woud be the range of values < -0.5, however, the poor baseline at shorter wavelengths makes this range more or less unusable. For this reason, we can take a more well-behaved approaximate peak width measurement as the range of reflectance values < -0.8.

# approximate peak wavelength
indcs <- apply(specf.mat[, 2:11], 2, function(x) median(which(x <= (-0.95))))
pks <- specf.mat[indcs, 1]

# approximate peak width
lowidx <- apply(specf.mat[, 2:11], 2, function(x) min(which(x <= (-0.8))))
highidx <- apply(specf.mat[, 2:11], 2, function(x) max(which(x <= (-0.8))))

pkwidth80 <- (specf.mat[240, 1] - specf.mat[140, 1])/100 * (highidx - lowidx)

features <- data.frame(pks, pkwidth80, bead.col, bead.col2)
features <- features[order(pks), ]
features
##      pks pkwidth80  bead.col bead.col2
## 10 436.4     75.21    yellow   #B99B55
## 6  449.8     88.86    orange   #D09B52
## 9  451.7     98.07    yellow   #B99B55
## 7  529.3    111.35      pink   #BE6478
## 4  533.6     99.18 lightpink   #BE6478
## 1  549.8    110.61   magenta   #C8198C
## 8  568.7    133.10    purple   #6E197D
## 3  581.1    179.93  darkblue   #161278
## 2  585.4    136.79  darkblue   #161278
## 5  603.0     76.32 lightblue   #82879D

# save your work!
write.csv(features, "UVbead data features.csv", quote = TRUE)

One side effect of this is that we now have a small table of metrics that describe the relatively large original data set reasonably accurately and could be used to classify new data. In current data analytics parlance, this is known as 'feature extraction'. In traditional science, these characterizing features could be combined with others from different studies (crystallography, electrochemistry, UV-vis, IR, Raman, NMR, …) to help predict the effects of a chemical change or a different chemical environment on the behaviour of the molecule. Such studies are traditionally used to help direct synthetic chemists toward better products.

Conclusions

From this analysis, we can see that the blue beads are absorbing at longer wavelengths than the yellow, and the pink and purple beads have absorbances in between. The darker colors have broader absorbance peaks than the lighter colors, with the darkest blue having a range that appears to cover the purple, yellow and blue regions. These darker beads probably contain combinations of the dyes used for the different colors, and not just more of the dye used for the light blue beads.

The reflectance spectra are not as observant as our eyes. For instance, the yellow and orange beads are readily distinguished by eye, but not so clearly in the observed spectra or the extracted features. This may be due to the 400nm cutoff of the light pipe, which distorts the peak shapes for the yellow and orange beads. Ideally, we could observe the changes in the absorption spectra over the whole range 280-800 nm as a function of time. The current reflectance spectrometer setup, however, is only capable of capturing the 400-750 nm range. This means that we do not have access to the interesting behavior of these dyes at shorter wavelengths.

Chemically, I expect that absorption of the UV light in the 300-360 nm range causes a reversible conformation change in the dye molecules, as has long been known for the azo-benzenes and their derivatives. After the conformation change, the absorption maximum of the dye is shifted to a much longer wavelength. If the UV dyes in these beads are very closely related to each other, is likely that the yellow beads absorb at relatively short UV wavelengths and the blue beads at longer wavelengths before the transition, however, without measuring the UV absorption spectra we cannot know this. It is quite possible that their UV absorption spectra are nearly indistinguishable and the effects of their chemical differences are only apparent in the visible spectra. Without access to better hardware or more information about the molecules involved there is no way to know.

Next steps

We do have access to one thing that varies in the appropriate way. The experiments shown here were taken at relatively low UV light levels (February in England). During the summer, the angle of variaton of the sun is much larger. At dawn, it is at the horizon, while at noon, it is about 60° higher. Since light scattering in the atmosphere is a strong function of wavelength, the relative intensities of light at 300, 360, and 400 nm will vary with the angle of the sun. If the beads have different absorption peaks in the UV range, the time dependence of their color changes should vary by the time of day.

The simplest way to measure this is not with a spectrometer, but probably by following color changes in video images.

Tuesday, 10 December 2013

NHS hack day thoughts

What a glorious weekend! The sun finally came out, and while my family was out enjoying the countryside and biking to the beach, I spent it in central London at the NHS Hackday.

Not that I'm complaining. I had a great time. There were some really talented people there, and some very committed parents, doctors, programmers and just plain technically minded people. It was a really interesting weekend.

The hackers present were a pleasant mix of odd-ball grad student types, young doctors, programmers, developers and random 'IT' people, all with an interest in trying to contribute something to the efficiency, ability and smooth running of the NHS. The elephant in the room is whether any of these projects will ever become useful. Several applications had very useful ideas. The winning app 'Waitless' aimed to provide an SMS service in which people could send an SMS and get information about the distance to local NHS services and the likely waiting time once they get there. This way, someone with an earache could make an informed choice to go to their local walk-in clinic instead of the A&E department depending on wait times, opening hours, and distance.

The progress made on these apps over the course of the weekend was astounding, with several nearly becoming usable services, and certainly good proof of principle demos by the time of presentation at 3pm on Sunday. It is truly amazing what a team of 8 can get done in a weekend with modern developer tools, available APIs, open source software, and online services. Wow.

And me? well, not so much...
I ended up working on the aptly named FAIL project: 'Fatal Accident Inquiry Learning', which attempted to apply machine learning techniques to Scottish Fatal Accident Inquiry (FAI) Reports. Unfortunately, I spent most of the first day struggling to get nltk and various supporting technologies installed on my system and most of the second day learning the very basics of working with these technologies.  Carl had somewhat more success in getting some snippets of the reports into Carrot2, but the results were less than impressive.

Challenges:
1) I need more experience with Python
Everything I know about Python, I learned from Code Academy and from my homework for 'Intro to Data Science' on Coursera. The homework was a pretty good introduction for this project, as it involved sentiment analysis of a tweet stream. I was able to do some basic filtering and use the structure of the base homework code. There were some difficulties in translating the learning from the homework analyses of short tweets to these much larger, richer records, and I struggled to create a workflow between parts of the analysis.

2) I need more time with natural language processing
The FAIs are long legal text documents. It is possible to extract text from them, but it isn't easy. The text has some consistent elements, but is not in a consistent form. Some dates are 'day month year' format, while others are given as 'month day, year' and still others are 'day of the month of year'. This makes it somewhat difficult to even extract basic information such as the age of the victim. It should be possible to get this by using a grammar with NLTK, but well, I didn't manage to come to grips with it in 45 minutes I gave to it. Perhaps not surprising. Similarly with bi-grams, tri-grams, collocations and point mutation importance (PMI) to collect information on phrases and unusual words. I learned a lot, but wasn't able to put much of it into use. Yet.

...hopefully I'll find some time to try out topic modelling on this data at some point.

3) Relevance of the project
Ideally, we'd like to analyse these reports and make some inferences that are relevant for the NHS and that could lead to improvements in quality of care. Unfortunately, the data set we are looking at is not like hospital episode statistics -- it is not a statistic. Although there were some 1652 fatal accidents in Scotland in 2011, only 28 FAI reports were published that year. Our dataset consists of the 82 such published inquiry results from the last few years. Some inquiries are published long after the incident, but this indicates that inquiries are held for less than 2% of fatal accidents.

Inquiries can be called whenever there are unusual circumstances. They are required in some circumstances, such as when a death occurs in custody. By definition, then, these accidents are the outliers. Some of them are candidates for the 'Darwin Awards': tragedies begot by stupidity. Others are simply tragedies.

The Scottish authorities hold these inquiries with an eye toward preventing further accidents, and such investigations do have impact on our daily lives. Protocols for how often the highway lines are repainted, police guidelines for how people in custody are transported, and yes, even those ubiquitous labels: this is not a toy; not for children under 3 yrs of age; do not play on or around. FAIs are the fault-checking analyses that lead to health and safety advice.

So we tried various approaches to extracting information and comparing text in the reports, but ultimately we did not come up with a truly compelling use case for the data or inferences from it.

Observations on the data:
  • Accident statistics are interesting reading.
  • Each of these accidents is a story of its own. 
  • Men are in more fatal accidents than women. For all ages, nearly 65% of the accident victims are men.  For men < 65 years of age, the ratio climbs to nearly 75%. 
  • Fatal accidents are more common in older people. In 2011, 57% of the male accident victims were over 65 yo. At the same time, among female victims, 76% were over 65 yo. 
  • In younger age groups, poisoning is the most common cause of accidental fatalities. In 2011, there were no poisonings in children < 15 yo. The statistics include alcohol poisoning.
  • Falls are the most common fatal accident type for people over 65 years, and over 60% of the victims are women, showing a sharp contrast with all other accident types and ages. 
Observations on the hack day:
  • don't forget your coffee cup
  • wander around and see what different groups are doing  -- don't wait until the pub afterwards to find the person with a degree in computational linguistics!
  • what's your goal? if it's social, be social. If it's coding, join a group
  • you will probably learn more from a larger team with more varied skills
  • how competitive do you want to be? 
  • a video of a good use case is impressive
  • it's often more efficient to ask for help


 

Thursday, 7 November 2013

How to run a MOOC: a student perspective

Over the course of the past year, I've followed nine MOOC's on Coursera. I've also used other online learning tools including Code Academy and Khan Academy among others. I've enjoyed this period of learning and updated a number of skills, which I hope will be useful in the future.

The MOOC's I've taken have ranged from the wonderfully organised Machine Learning course taught by Coursera founder Andrew Ng to the extravagantly disorganised Startup Engineering course which was primarily led by another Stanford lecturer, statistician, and cofounder of Counsyl, Balaji S. Srinivasan. As more professors become involved in the MOOC phenomenon and try to gain audiences on YouTube and other social media, I thought I'd write up some of my experiences as a student and make some recommendations, or maybe it's a wish-list.

In most MOOCs, the course staff appear to be under-prepared for the demands of the platform. This is a recurring theme, particularly with newly offered classes, so if you are thinking of offering a MOOC, please, please have a beer with someone who has run one and get the full story. It is clearly not an easy thing to do, particularly when the student numbers get large. Coursera courses regularly have > 100,000 students. Jeff Leek's and Roger Peng's post mortem of the course 'Data Analysis' might be a good place to start.

In general, MOOC's tend to have a better student experience when the professor has taught the course material many times before and is not straying too far into new territory. This is particularly true for a first attempt at MOOC's. Bill Howe's course 'Introduction to Data Analysis' was not a particularly good student experience. I think this was because he tried to add too much to what he had taught before. It's better to be focussed and have only 50,000 happy, learning students than to try to do too much and have 120,000 frustrated, failing students. At least to me. You can plan to change an assignment in the second offering to incorporate new technology.

Recommendations:
  • Lectures: Have every lecture planned very well ahead of time, preferably before the course even starts. Lectures of 7-10 minutes work well. Some students like them longer, but others don't. Leave yourself plenty of time for technology hiccups -- estimate the time it will take, then multiply x2 and change the units. Hours become days. 
  • Resources: If you see 3 -5 similar forum threads running simultaneously, each with > 200 comments of people trying to help each other out, you have failed to get the message across clearly in lecture. A few links to additional introductory material can do wonders.
  • Forums: These will be going 24/7. There will always be complaints about the level of the assignment or the language used or something. Some of it will also be interesting discussion that you want to stay on top of - 24/7. Keeping up with the forums can be daunting, so a community TA or two can be useful. Forum organisation is important. As the course progresses, important threads get buried, and useful information is often buried at the bottom of a long thread. It's useful to have a section of the forum for each separate assignment as well as a section for software issues, platform issues, deadline problems and general discussion. A TA who can summarize important points regularly and point up useful posts if very helpful.
  • Extra interactions: Some students find local Meetups or study groups to be invaluable. 
  • Community TA's: These people are volunteers. Most community TA's appear to be more interested in interacting with the more advanced students and furthering their own learning than in supporting students who are having difficulties. Please review comments made by your community TA's. A few will fall into using snark to glorify themselves. The best ones will highlight useful forum contributions and links to help other students.
  • Assignments: Students have different expectations of assignments. One of the main advantages of an online system is instant feedback. I like assignments that contain questions of different difficulty levels. This lets me identify where I could spend more time and also how solid my knowledge is. I use the feedback from incorrect answers, so having two chances is useful, but a bit stressful. Having five chances is better. It's fine if the maximum number of points gained is reduced when more than two chances are used (e.g. automatic 20% reduction at the 3rd submission). This can be invaluable for international students who may have difficulty interpreting the questions. 
  • Deadlines: MOOC students are often unable to accommodate deadlines. Reasons vary. For me, I can put in 8-10 hours / week when my kids are in school, but during Fall break week, I might manage 2-3 hours. Other students have occasional work deadlines, or a long-planned vacation. Most MOOC students are managing to set aside a few hours for studying from otherwise busy lives, but those lives occasionally interfere. One useful approach is to have 10 late days that can be applied at any time. This means that if I join the MOOC late and miss the 1st deadline by 2 days, I can use two of my late days. If I have to miss a deadline because much of my weekend was taken up throwing a birthday party for my 5 year old, I can use another one. If the online system we were supposed to use is too overloaded and breaks down, students can apply late days to shift the deadline to a time when the system is less busy (and therefore functioning). This gives flexibility and responsibility to the students, which is really nice. Some teachers disagree, though.
  • Timezones: Your students will not only come from all over the world, as in a modern classroom, but will actually be all over the world. This means that you must be aware of time differences. Time zones for deadlines and for release of new lectures makes a difference, but more importantly if you require students to do an online collaboration of some sort, allow them to log in at different times. Some students can find hours in the middle of the day, while others only find them late at night, and those times are staggered all around the globe. Consider grouping time-zone regions so students can choose to participate at a convenient time. (Hint: 2am in China is not convenient.)
  • SNAFU's:  Things won't go as planned the first time. It will reflect better on you as a teacher and on your institution if you can adjust as needed. Jeff Leek had to drop a code reproducibility portion of his grading, and Antonio Rangel had to drop an experimental interactive market. Please be flexible in your use of new technology. By all means, try it out, but be aware of student needs, which will vary -- not everyone has an American credit card, which some web services require for registration, even if nothing will be charged. 
It will be interesting to see how MOOC's develop. These new online platforms are effective for learning, and education as we know it is clearly changing. So, as soon as my 10 year old finishes his decimal math on Khan Academy, I'm back to Financial Accounting

Monday, 30 September 2013

Coursera: Introduction to Data Science (Course Review)

I finished this a few months ago, but it will probably be offered again, so here's a review. This course was taught by Bill Howe at University of Washington and offered as a MOOC on Coursera.

Course Description: 
The Coursera description promises newbie to data ninja in 8 weeks. Workload 8-10 hours/ week. Those of us who have finished our statistical mechanics homework at 2 AM know that such promises are not only empty, but rather a guarantee of a course that is over-ambitious. As the description implies, this is an overview course that tries to do too much. Every student should realise this at the outset: an introductory course that claims to cover everything is certain to be a rough ride. (I know. I've taught some.)

Lectures: 
The lectures covered some really interesting content, and the lecturer appears to know the industry very well, particularly the Microsoft perspective and tools. I assume that many of his continuing education students are Microsoft employees who want to update particular skills. Such students are not the run-of-the mill for UW.

When I was a TA at UW, the undergraduates needed slow feeding with very small spoons. Lecture halls were filled with slumped bodies under baseball caps. The evening courses, in contrast, were filled with lively, interested, adults who learned independently and came to class with lists of questions. This course is aimed at those active, interested adult learners. That said, the number of hours listed for this class is a gross underestimate for the material covered and the assignments given.

Professor Howe's lecture style is not always engaging, and a lot of material is covered. There were often over 3 hours worth of lecture material to review during the week. Along with following links and reading supporting papers, this left very little time for the assignments themselves. Prof. Howe did a good job of introducing and comparing a range of current technology choices (particularly the comparison of different database technologies). As a data science newbie, I would have liked a bit more information and emphasis on use cases for different types of databases.

The database parts of the course were well presented, and this covered subjects that I hadn't seen in my other work. The data analysis elements were not so clearly taught, though, and there are better (slower) ways to learn this material on Coursera. If you have time, Jeff Leek's course 'Data Analysis' covers this much more thoroughly. Andrew Ng's now legendary Machine Learning course is also good, although more mathematically oriented, with less emphasis on organisation, data munging techniques, and communicating results.

Later lectures in this Intro. to Data Science course appeared to have incorrect answers in the in-lecture questions. I got bored of trying to keep track of the errors and inconsistencies in the course. The material needs a thorough editing before the next showing.

Assignments:  
The lectures were not particularly good preparation for the homework assignments. A lot of independent learning was required to make progress in the course. The assignments were also relatively difficult compared to what I expected from the course description. The first assignment was a sentiment analysis of a Tweet stream written in Python. I have a pretty good programming background, having started with Basic back in 1983, visiting Fortran, MatLab, Igor, Unix utilities, C, Ruby and continuing to objective-C, R and Functional programming in Scala. I pick things up quickly. The course description did not require a programming background, yet I had to spend several hours learning the ins and outs of Python from Code Academy before I could get a handle on the assignments.

The level of the first assignments was not commensurate with expectations from the course description. I learned a lot, more than I expected, in fact, and I can now implement a matrix multiplication in Python, SQL, or MapReduce based on the homework assignments. The auto-grader for the 1st assignment never did accept my answer for the final part. It also didn't give sufficient feedback for me to solve the problem, which probably had something to do with text encoding, but was very frustrating none-the-less. This sort of issue doesn't teach much. Save your perseverance for things that matter.

Overall, the assignments were challenging. I learned a lot, but not always what the point of the assignment was. I think there were a lot of complaints (more than normal) about the difficulty of the assignments, and later assignments were quite a bit easier than earlier ones. Assignments covered:

  • Python:   Tweet stream sentiment analysis
  • SQL:   Queries, tables and matrix multiplication
  • Tableau Visualization:    FAA Bird Strike Data 
    • write-up and peer assessment
    • note: I had to use Tableau via Amazon web services as it only runs on Windows.
  • MapReduce:   data joins, basic network analysis, matrix multiplication
  • Kaggle: Take part in a competition (I did facial keypoints detection)
    • write-up and peer assessment: ranking on the leaderboard did not matter.
A couple of the assignment deadlines were changed after the deadline had passed. This is very unfair to people who have worked hard to make the deadline, although it was reasonable in the case of the MapReduce homework where we were using a new web system that was supposed to be able to handle the volume of students. This is a continuing problem with MOOCs that have > 100k students enrolled. Any time the professor makes an assignment that will run on new technology, be prepared for a very frustrating experience. In my opinion, new web-based technology should not be used for graded assignments in MOOCs. They should be tested first as an optional assignment or a staged assignment so that 100k students are not accessing it in the same week.

Overall Recommendation:
Students:  I hope that the professor will offer this course in a pared-down form. As it is, if you're already awesome at Python and SQL, go ahead and dive in. Everyone else should consider this a taster course and audit only, at least with the current assignments. Be selective about which parts you choose to look at. If you experience slowdowns or poor behaviour with particular technologies in the assignments, put it aside and try again when the course is over or the deadline is passed. It seems like a class, but it's a free platform and you get what you pay for. A lot of professors are using this to try out new technologies, so don't expect it to all work as advertised.