Thursday 12 June 2014

Ozone Weather

An open source weather app that calculates UV light intensity for the user's current time and location and gives a suggestion for how long it will take to make vitamin D. It also forecasts UV light intensities for the coming week. The code is up on Github.


Background:

This project is based on a tutorial from Ray Wenderlich's great website. (sample app: SimpleWeather). It adds Ozone data from the TEMIS database, using libxml2 and Hpple to parse the .html based on another tutorial how to parse HTML on iOS. It also includes my astronomy code from previous projects for calculating solar angles and estimating UV intensities.


Technologies

The iOS best practices tutorial included a number of interesting technologies. I've ignored the image blur filter they use in the main view, but I keep these:

  • cocoapods -- package manager and dependency tracker software. Basic usage:
    • > edit podfile cocoapods searches for a file named 'podfile'. Include the dependencies to be included with the project there (see example in the directory)
    • > pod install obtains the dependencies, creates a directory pods/ to hold the dependencies. It also creates a .xcworkspace file to hold the new project with dependencies. Use this workspace in xcode.
    • > pod update will update a dependency to the latest versions. I had to do this in this project because the TSMessage project had suffered from a problem at GitHub and they needed to put in a bug fix which broke my older version. Cocoapods to the rescue!
  • Mantle -- a project from GitHub for creating data models: aids conversions between JSON <--> NSObject (very handy with a json data feed, as we'll see!). I only wish there were more useful tools for directly converting a dictionary into an object. I didn't come up with a good solution for the ozone data, but at least it works, and it's pretty easy to see how it could be improved.
  • ReactiveCocoa -- allows you to use functional programming constructions in iOS apps. Ever run into a spaghetti of callbacks with KVO? Functional Reactive programming may not be the ultimate solution, but it certainly provides a different paradigm that applies to many common situations. Wow. If you haven't used it you, you gotta try this stuff.
  • hpple I needed an HTML parser for iOS. There are a lot of choices, but this one had a decent tutorial at Ray Wenderlich.com, and seemed easy to use (and it was). The parsing problem is pretty small as I only want to do a single page that hasn't changed in years. Regex would have worked, but I wanted to learn how to do it better.
  • TSMessages -- a ticker-style alert message system.
  • git for version control -- the technology driving collaborative development on github. Can't say I've mastered the learning curve yet. This is one of those situations where a video explanation really helps. Jessica Kerr (@jessitron) does a great one, git happens -- with sticky notes! (this particular presentation is for coders coming from a background in subversion, but I've also seen Jessica do a great intro for novices who haven't heard of version control at all. Just search for git happens.

As with any newly hatched project, there's the question of what to add to .gitignore. For a project using cocoapods, see the pros and cons at guides.cocoapods.org. For this project, I want to keep it lightweight, but also keep track of the dependencies. To do this, I'll add the pods/ directory to .gitignore, but keep the podfile, podfile.lock and other files under version control. I used a stackoverflow post to get an appropriate .gitignore file.


TEMIS

The ozone data I'm using is basically scraping a website. It is old website, by the Tropospheric Emission Monitoring Internet Service (TEMIS) and lacks a friendly api. Raw science. To get the column ozone for a location, I need to query the website with a url string of the type http://www.temis.nl/uvradiation/nrt/uvindex.php?lon=5.18&lat=52.1, where the lon and lat values are provided by my code. The response is only available as .html, so I need to get this response and parse it to extract the desired column ozone values. I'll use hpple to parse the html response and it's xpath query system to walk the DOM and extract values from the relevant table.

The entire page is formated as a series of nested tables. The page header is one table; a second table holds the body of the page with one column holding the frame to the left, a blank column, and a column holding the data table I'm interested in.

html -> body -> 2nd table -> tbody -> tr -> 3rd td -> dl -> 
dd -> table ->tbody ->
    tr -> td -> <h2> location </h2>
    tr -> 3x td -> (headers tagged <i>) Date, UV index, ozone
    tr -> 3x td -> (data values) day Month year, .1f, .1f DU

There are a lot of XML parsers and JSON parsers available. In my perfect world, .html files could be parsed as xml, but it doesn't work out that way. Many normal tags in .html are not xml compliant, so most xml parsers break down right away, inclduing the NSXMLParser included in iOS. Parsing .html is not an uncommon problem, so there are a number of librarires on GitHub that people have used. I was able to combine the Hpple parser library with reactive cocoa to create a pipeline straight from the .html response to my model objects.

There are still a few wrinkles that could use ironing. When dealing with locatioins time zones and solar data, times and dates become difficult to handle. The native date-handling in iOS doesn't make it easier. In particular, the UV Index published by TEMIS is for solar noon at the lat,lon of the location. There is no accurate way to use only iOS internals to capture this date correctly, particularly when daylight savings time is taken into account in different jurisdictions. Astronomy calculations are needed to assign these values to correct times.


Lessons

DateFormats:

'HH' parses 24hr time, hh parses am, pm time. 'hh' won't parse 17:20.
'YYYY' doesn't mean 2014. For normal years you need 'yyyy'.
The full spec for iOS 7 is at unicode.org

ReactiveCocoa:

NSLog is useful for getting info on intermediate stages.
There are some mysteries about what filter: and ignore: do that I should get a grip on.
There is a lot of potential in this library, and many functions to explore. map: is your friend. Use it flexibly.

There are many useful discussions in the issues on GitHub and on SO. One word of warning: this library is changing rapidly -- 2.0 was recently released and 3.0 is being crafted. The terminology is shifting, even for core ideas. Older posts may contain outdated code, and that's likely to change even faster with Apple's introduction of Swift!

  • prefer RACSignal over RACSequence.
  • Prefer 1-way binding with RAC over 2-way binding. (see discussion of issue proposign to drop RACChannel)
  • avoid subscribeNext:, doError:, and doNext:. These break the functional paradigm. See this StackOverflow question on fetching a collection Note: RACAble was replaced with RACObserve in 2.0.

I'm finding it somewhat difficlt to chain my signals together in the correct way, probably because I have some processing to do with different parts of the ozone signal.

Here's a helpful bit of code from techsfo.com/blog/2013/08 for managing nested asynchronous network calls in which one call depends on the results of the previous. Note: this is from August, so before RAC 2.0. I believe weakself is now created with the decorator pattern @weakify and destroyed with the decorator @strongify.


__weak id weakSelf = self;
[[[[[self signalGetNetworkStep1] flattenMap:^RACStream*(id *x) {
   // perform your custom business logic
   return [weakSelf signalGetNetworkStep2:x];
}] flattenMap:^RACStream*(id *y) {
   // perform additional business logic
   return [weakSelf signalGetNetworkStep3:y];
}] flattenMap:^RACStream*(id *z) {
   // more business logic
   return [weakSelf signalGetNetworkStep4:z];
}] subscribeNext:^(id*w) {
   // last business logic
}];

Unit Testing

The code coverage isn't good, but I did create some logic tests for my astronomy code using XCTest. My test class wouldn't let me call any private methods in the class I was testing, and I didn't want to move those function definitions into the public interface. Luckily I found another way: create a category in your test class (stackoverlfow).

// category used to test private methods
@interface OWSolarWrapper (Test)

// private functions from OWSolarWrapper.m
// date functions
-(NSNumber *) julianDateFor:(NSDate *)date;
-(NSNumber *) julianCenturyForJulianDate:(NSNumber *)julianDate;
-(NSNumber *) julianDateRelative2003For:(NSDate *)date;

// basic astronomy calcs
-(NSNumber *) equationOfTimeFor:(NSDate *)date;
-(NSDictionary *) solarParametersForDate:(NSDate *)date;

@end
Beware of floatValue:

[[NSNumber numberWithDouble:3.14159283] floatValue]
Write an extension so this raises an error! floatValue is only 24 bit, so it truncates after 6 decimal places. This introduced a bizarre rounding error in my astronomy code.

Location Testing:

Nice comment under the original tutorial:

If you want a specific location not included in xcode, you can create a gpx file for any location. You then import it into xcode to include it in your xcode location toggles.
        - marciokoko on March 4, 2014, 9:18 AM

but to calculate solar positions, I really need TimeZone information along with my testing locations, which .gpx doesn't include.

TODO

  • use different sky images for the background to reflect the weather prediction.
  • The original background of the table view had a blur filter attached. This was accomplished with a library, but I think similar is possible with CALayer. Might be good to explore a CALayer filter on the table-view cells or the underlying scrollview. The filter would respond to scrollview position.
  • UV information is only currently encoded as uvIndex ranges giving a color on the icon background. That's a start, but I'd like to do some calculation and determine two kinds of risk:
    1. . oveall intensity -- risk of acute sunburn, and
    2. . relatively high ratios of UVA at high intensity -- risk of deep damage that is harder for the body to repair.

    To do this, I need to work more at combining the weather signals, possibly changing the model significantly. Thank goodness for git branches!

Friday 4 April 2014

Course Review: Data Analysis by Jeff Leek

I took Data Analysis on Coursera in the spring 2013. It was offered again in October 2013, but it may have been supplanted by the new Data Science specialisation, which starts in just a couple of days.

tl/dr: 5/5 don't miss out.

Overview:

This is a very well-run course that builds on Roger Peng's Computing for Statistical Analysis (in R) course. Professor Leek does a very good job of introducing tools and habits for 'reproducible research' including suggesting ways to organize the data munging problem for reproducibility. Professors Peng and Leek, both of John's Hopkins School of Medicine, are leaders in this area, and discuss the challenges and progress on their excellent blog, Simply Statistics. They are also at the core of the newly offered Data Science specialization on Coursera, which offers a nice series of courses that build on each other and give good detail in the various aspects of data analysis. It's also by far the most accessible data science study track available.

Reproducible research is very important for statistical science in general, and promises to be increasingly important over time. Consequently, this course is really a great opportunity. Professional programmers eventually learn the value of unit testing, documentation, and other approaches that make their code more useable. Professors Peng and Leek are at the forefront of developing the professional statistical analysis standards that will become the TDD+version control+issue tracking of data analysis. In other words: you must take this course if you are interested in professional level data analysis. Really. Well, actually, you could take all nine courses…

The Data Analysis course is 8 weeks long. In the first offering, it had weekly lectures of about 2 hours as well as weekly homework assignments that could add up to several hours. As usual, the Coursera recommendation of 3-5 hours / week is on the low side for most students with little experience. Additionally, there are two longer, data analysis assignments with real (web based) data. Additionally, the due date for the first analysis is just after many of the better statistical analysis tools are introduced in lecture. Both analyses are peer-graded by 5 other students and the mean of the central 3 scores is used. Peer grading may change in different offerings of the course as this is one area where Jeff Leek was not sure how the course should be organised. There's a full discussion of this in a Simply Statistics podcast. Also, as one might expect from statisticians, there's data! an interactive graph of completion rates for various online programs. This course had a completion rate of 5.4%, and many of the students active on the forums had done a lot of data analysis before.

Recommendations:

Make sure to set aside time for the data analysis assignments. These will easily take 8-10 hours and could take up to 20 depending on how obsessive you become. The data analysis grade is based on a peer-review, so your reviewers may not understand your analysis, even if it is perfectly correct. There was grousing about this on the forums, but I think the point is clear: the peer graders are not experts in data analysis, and you should never write a report aimed at experts in data analysis, so take it into account when you are writing the report. I had trouble with this on the second data analysis. It involved classifying accelerometer data by action performed (walk, run, walk up stairs, etc). The data had been expanded into a set of features including some filtering, Fourier analysis and other transformations. In my analysis, I used the language of signal processing and sampling to describe this feature set, but this language was not familiar to the peer reviewers, so they didn't realise that I was describing the spacing and number of data points very precisely.

Friday 28 February 2014

Intro to GitHub (for scientists)

The problem

I talked to a young woman yesterday who is a bio-engineering postdoc at Stanford. She has some code that she'd like to 'upload' to GitHub. She admitted that, well, actually, she hadn't managed to get any of her code onto her GitHub account yet, and she looked so overwhelmed and dejected that I felt bad. I know that frustration.
So here is yet another blog post to try and help with the learning curve that is git. I'm writing this for the scientist who has written some code and wants to share it on GitHub. Most scientists write code in order to accomplish a particular task, and thus are not familiar with professional programming practices, including writing documentation, unit testing, SCM (source control management) and version control. There are great benefits to learning these techniques, and their usefulness is becoming more and more apparent to academics as research relies more and more on computer programs. In fact, the table of contents for the journal Nature Methods just arrived in my inbox with a lead editorial on reproducible research. I quote:

Nature Methods strongly encourages researchers to take advantage of the opportunity that code repositories, such as GitHub, provide to improve a software tool before submission. Even if others do not examine and test the code, the act of preparing the code and necessary documentation for deposit and use by others will help avoid delays in publication.
In short, if you are coding and publishing work derived from your code, the process of uploading your code for collaboration will help bring it up to a publishable quality. And yes, it's even more important than making the figures pretty.

GitHub == collaboration

Firstly, if you just want to upload some code, you need to take step back, a deep breath and 'think different'. There is no 'upload' button on GitHub for a good reason. It is not built for uploading code and leaving it to rot on a server, but for fostering collaboration between programmers. Thus, in order to share your code on GitHub, you first need to get it ready for collaboration. To do that, you need to set up version control. This is somewhat more complicated than finding the 'track changes' setting in Microsoft Word, but it is also far more useful.
As you try to think different, keep in mind that GitHub was built with a particular set of workflows in mind. Those workflows have to do with managing a code-base that is constantly being updated by multiple contributors. GitHub tries to ease the difficulties of this type of collaboration by bringing several things together.
  • version control -- similar to 'track changes' in a word processor, but for whole projects
  • cloud-based storage -- simultaneously backup and share your code
  • user accounts -- keep it private, share with a team or make it public as your needs change
  • social -- there's messaging, so you can talk, argue, document, discuss.... collaborate!
But version control is the main thing that sets GitHub apart from any number of social sharing platforms, and makes it so powerful for people who code. To use GitHub effectively you therefore need to understand the basics of version control with git. It's hairy and scary at first, but no worse than... well, ok, it is worse than a lot of things, but sometimes a learning curve is the price you pay for really capable tools. So, put a set of bookmarks in your browser, make a cheatsheet and keep it handy. A local cheatsheet with a good searchable title does wonders. If you're thinking 'Oh, I'll just use it once' or 'I'll remember', well, git is for professionals. Are you a professional?

Git == version control

Version control is to 'track changes' for a text document, as Superman is to Tarzan; as Microsoft Word is to TextEdit, as New York City is to Detroit. Firstly, with version control, you are tracking changes to an entire project, not a single document. Changes are tracked line by line with comments and attribution through time as the project grows, changes, and splits; as subroutines develop into full projects of their own; as new owners take control of the code base. It is flexible and thorough and reliable. You should learn to use it.
Use case: You've got some code you developed for your thesis that you want to upload to GitHub. Maybe someone else will find it useful. There are GUI front ends to git, which may help with many tasks, but git was designed to be run from the command line, and this simple use case it not too difficult to master at that level, so let's just go for it.

Download and install git

Walk through the steps at github set-up. Today, they suggest that you download their native app, but note that this only manages part of the workflow. The steps you have to do to get git working still involve the terminal. Be brave. The steps are:
  1. get a GitHub account
  2. download and install git
  3. link your local system to your GitHub account
    • tell git your name, email, GitHub login information
    • set up security keys so that GitHub knows you are you

Prepare your codebase

There are a couple of adjustments you probably want to make before releasing your code in the wild. GitHub recommends that all code comes with a License, a Readme and a .gitignore file.
  • License If your code comes with a license, it's easier for collaborators to re-use the code and build on it. Specifically, it makes clear what they are allowed to do. It may not be important to you, but your code will be easier to share if you make it clear what the rules are.
  • Readme
    You have to explain your work at some point. If you do it in a file named README, GitHub will automatically put it on the front page of the repository. This is very helpful for anyone trying to understand what you did and why. The README can be just a text file. If it is in markdown, perhaps with additional flavoring GitHub will render it with headings and styles, which is much nicer for the reader and not difficult for the writer.
  • .gitignore
    Some files don't need to be tracked. For instance, some old Mac directories contain .DS_Store files with directory display information for the Finder app. That doesn't need to be part of the repository. So here's my .gitignore for an old MatLab project:
      $cat .gitignore
      .DS_Store 
    
    Pretty simple. You might also add *.log or tmp/ to the .gitignore file, depending on your context. Basically, any files that are automatically generated or updated on compile should not be tracked.

Make a local repository

The project you want to get onto GitHub probably consists of a directory or directory tree containing a series of text files, and possibly some image files or data files. In order for git to track changes to this project, you have to put these files into a repository. This is simple to do once you've prepared it for sharing.
Find your command line. On a Mac, you can use the Terminal app. Navigate to the directory holding your project. If you have never ever used the command line before, this might be challenging. If you want to dive in, you can certainly do so with three little commands: ls, cd, pwd. You can look at the man pages for these commands by typing, for example $man ls, or you could try a crash course in using the command line.
Once you get to your project directory, type:
$git init   
You should see a message from git, something like:
Initialized empty git repository in /Users/suz/programming/octave/OrX/.git/
This initialises an empty repository, which looks like a file named .git. You can check that it's there by typing
$ls -a
Git gives some feedback about what it has done, but I often find it useful to check with
$git status
after each command to see what has happened.
Now we can get the project into the repository. To do this, first type
$git add .   
This prepares git to add your files to the repository, a process known as 'staging'.
Note: The . tells git to stage all the files in the directory tree to the repository. This is very handy when we add files because they don't all have to be specified by name. On the other hand, it isn't ideal, because there are often binary files or .log or even image files that update automatically. You won't want to keep track of those changes. Fortunately, git will automatically look for the .gitignore file we already prepared to get the list of exceptions.

Ready for some commitment? Type:
$git commit -m 'initial commit' 
You should get a full list of the files that git is committing to the repository. Check the status when it's done and you should give a reassuring message:
# On branch master
nothing to commit (working directory clean)

Success! now your project is actually in the repository and git can track any changes to the files. The repository will also keep all the messages that you put with each commit. Always use the -m flag and take the time to add a meaningful message.

Congratulations! Your code is now in a git repository, under version control. You are ready to collaborate.

Share your work


Make a repository on GitHub

  1. Log into your GitHub account
  2. Create a new repository on GitHub
    On your profile page, in the 'Repositories' tab is a list of repositories that you've contributed to. At the upper right should be a bright green 'New' button.
  3. Follow the directions in the form, adding an informative description so others know what treasure they have found.
Congratulations! You have a GitHub repository to share your code from!

Link the repositories

In git terminology, the current state of your code is the head of a branch. By default, the initial commit is called the 'master' branch. You can make other local branches, and probably should to try out new features. You can also make remote branches. At this point, your new GitHub repository is essentially an empty remote branch. By custom, this branch will be referred to as 'origin'. To point git to it, type (on one line), with your appropriate username and project title:

$git remote add origin https://github.com/username/project.git

This command translates roughly as "Dear git; Please be so kind as to add a connection to a remote repository. I will be referring to the remote repository as 'origin' in our future correspondence. It can be found at https://...... Thank you for your kind assistance in this matter. Sincerely, yours truly, esq."

Upload your code

Ok, ok, there is no upload on GitHub, but it is payoff time. Once you have a local repository linked to a remote repository, you can just push the code from one to the other.

$ git push -u origin master

Translation: "Dear git; Please push the recent changes I committed to my local repository, known as master, into the remote repository known as origin. Also, please be aware that I'd like this to be the default remote repository, sometimes referred to as 'upstream'. Thank you again for your kind assistance. I am forever in your debt. Sincerely, Yours truly, esq. and, etc."

Success!!! You have now successfully pushed your code to GitHub.

Or at least I hope you were successful. If not, if you've tried to follow this post and the directions at GitHub and you still feel lost, there is more help out there. Many universities are running Software Carpentry bootcamps to help students and faculty develop more professional programming skills. The skills taught aim to improve software collaboration and impart the skills needed to carry out reproducible research. Two key tools they teach are version control with git and collaboration via GitHub.
Live long and collaborate!

Wednesday 29 January 2014

UV reactive bead spectra

UV reactive bead spectra

Background

I was interested in designing some educational experiments using UV-reactive beads to teach about the presence/absence and intensity of UV light under different atmospheric conditions. The beads turn from clear or colorless to various bright colors when exposed to UV light. Unfortunately, these beads are just too sensitive – they even react strongly to the stray 400 nm light coming through glass windows. And reach full color intensity within a minute, even at 51° N in February! It's fun to watch, but not very useful for detecting physiological UV conditions for vitamin D synthesis.

I don't have a lot of scientific information abou these, so the range of UV light needed for the color change is not well characterized, and the resultant absorbance spectra of the beads (related to their aparent colors) is not readily available. I decided to study them with a reflectance spectrophotometer to see what I could learn.

It may not be useful for the educational project I had in mind, but I'll write it up here anyway, along with the R-code used to analyze the data….

Experimental

I ordered UV Reactive beads from UV gear, UK.

I used a table in the garden on a sunny day for the experiments. The sky was open directly above and direct sunlight came from the south. There were a few leafless tree branches in the way, and some buildings nearby, so it was not full sky exposure, but at least there was full direct sunlight. The table was covered with a small white blanket to provide a consistent background. The beads equibilbrated 10-15 min in direct sunlight before spectral measurements were made.

I used an Ocean Optics model USB2000+ spectrometer connected to a 1 meter UV-Vis fiber optic cable (part number QP400-1-UV-VIS from Ocean Optics) with a cosine correction filter at the end (Ocean Optics part CC-3-UV-S) to minimize stray light. The spectrometer was connected by USB to a MacBook Pro (OS X mountain lion 10.8.1) running SpectraSuite software for data collection. Spectra were measured with 30 averages in reflectance mode and saved as tab-delimited text files. The reflectance measurements required a 'reference spectrum' (light spectrum) of the white blanket and a second, 'dark' reference, which was obtained by blocking the end of the cosine correction filter. To collect the reflectance spectra, I pointed the fiber optic light pipe at individual beads from a distance of < 0.5 cm.

Spectral analysis can be done in the SpectraSuite software, however, many basic functions had not (yet?) been implemented in the OS X version of the software. Also, I wanted to practice my R skills, so I decided to load it into R and see what I could learn. R has multiple spectroscopy packages, and in fact a new one has come out since I started this project. I looked into using hyperspec, but the data structure seemed overly complicated for the simple analysis I had in mind. So what follows is my simple R analysis of reflectance data on sunlight-exposed UV beads.

Load in the Data

# change working directory
try(setwd("/Users/suz/Documents/vitD Schools/UV bead exp 19022013/spectrometer data/"))

# data is tab-separated, with a header. The end of file wasn't recognized
# by R automatically, but a read function can specify the number of rows
# to read.
spec.read <- function(spec.name) {
    read.table(spec.name, sep = "\t", skip = 17, nrows = 2048)
}

# the function can then read in all the files ending in '.txt', put them
# in a list.
spec.files <- list.files(pattern = ".txt")
df.list <- lapply(spec.files, spec.read)

# convert the list to a matrix: the first column is the wavelengths and
# the other columns are experimental data -- one reflectance measurement
# at each wavelength.
spec.mat <- matrix(df.list[[1]][, 1], nrow = 2048, ncol = (1 + length(spec.files)))
spec.mat[, 2:11] <- sapply(df.list, "[[", 2)

matplot(spec.mat[, 1], spec.mat[, 2:11], type = "l", lty = 1, lwd = 3, xlab = "wavelength (nm)", 
    ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 25000, "Raw Data")

Looks like we need to do some clean-up!

Clean up the data

Baselines and edges

At the edges of the spectral range, the reflectance data is dominated by noise, so it isn't useful. The baselines for the different spectra also need aligning, and we'll scale them to the same intensity for comparison. The intensity range observed in the data depends somewhat on the angle of the probe, even with a cos filter in place.

# define terms for the processing:
nm.min <- 400
nm.max <- 800  # edges of the displayed spectrum
base.min <- 720
base.max <- 900  # define baseline correction range
peak.range.min <- 420
peak.range.max <- 680  # where to find peaks for scaling.

# remove ends of the data range that consist of noise
spec.mat <- spec.mat[(spec.mat[, 1] > nm.min) & (spec.mat[, 1] < nm.max), ]

# normalize baselines, set baseline range = 0.
spec.base <- colMeans(spec.mat[(spec.mat[, 1] > base.min) & (spec.mat[, 1] < 
    base.max), ])
spec.base[1] <- 0  # don't shift the wavelengths
spec.mat <- scale(spec.mat, center = spec.base, scale = FALSE)

Choose colors for the plot by relating the file names to R's built-in color names.

bead.col <- sapply(strsplit(spec.files, " bead"), "[[", 1)
# replace un-recognized colors with r-recognized colors (see 'colors()')
bead.col <- gsub("darker pink", "magenta", bead.col)
bead.col <- gsub("dk ", "dark", bead.col)
bead.col <- gsub("lt ", "light", bead.col)
bead.col <- gsub("lighter ", "light", bead.col)

# plot corrected data
matplot(spec.mat[, 1], spec.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 10, "Baseline Corrected")

From this plot, we can see that the lighter colored beads have smaller peaks than the darker beads. The lighter color probably represents less dye in the beads. There seems to be a lot of variation in the peak intensity of some beads, particularly the yellow beads and the dark blue beads. Based on the width of the peaks, the purple and magenta beads appear to have mixtures of dyes for both pink and blue colors. The dark blue beads appear to be either a mixture of all the dye colors or a mixture of pink and blue, but much more dye is used than for the paler pink or blue beads. The yellow bead spectra is oddly shaped on the short wavelength side, probably due to instrument cutoffs around 400 nm.

Scaling and smoothing

Now I'll scale the data to the same range. It turns out that the R command 'scale' is perfect for this.

# scale the peaks based on the min reflected intensity
peak.range <- which((spec.mat[, 1] > peak.range.min) & (spec.mat[, 1] < peak.range.max))
spec.min <- apply(spec.mat[peak.range, ], 2, min)
spec.min[1] <- 1  # don't scale the wavelengths
spec.mat <- scale(spec.mat, center = FALSE, scale = abs(spec.min))

The spectra are also jittery due to noise. This can be removed by filtering. This filters over a range of 10 points.

data.mat <- spec.mat[, 2:11]
dataf.mat <- apply(data.mat, 2, filter, rep(1, 10))
dataf.mat <- dataf.mat/10
specf.mat <- matrix(c(spec.mat[, 1], dataf.mat), nrow = dim(spec.mat)[1], ncol = dim(spec.mat)[2], 
    byrow = FALSE)
matplot(specf.mat[, 1], specf.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 0.2, "Scaled and Smoothed")

I could attempt to prettify the spectra further by using actual colors from pictures of the beads. It looks like the Bioconductor project has a package 'EBImage' that should be just what I want, but it looks like I need to update to R version 3.0.1 in order to run it. So I guess I get some spot colors from JImage.

bead.yellow <- rgb(185, 155, 85, maxColorValue = 255)
bead.orange <- rgb(208, 155, 82, maxColorValue = 255)
bead.purple <- rgb(110, 25, 125, maxColorValue = 255)
bead.pink <- rgb(190, 100, 120, maxColorValue = 255)
bead.dkblu <- rgb(22, 18, 120, maxColorValue = 255)
bead.ltblu <- rgb(130, 135, 157, maxColorValue = 255)
bead.dkpink <- rgb(200, 25, 140, maxColorValue = 255)

bead.col2 <- c(bead.dkpink, bead.dkblu, bead.dkblu, bead.pink, bead.ltblu, bead.orange, 
    bead.pink, bead.purple, bead.yellow, bead.yellow)

matplot(specf.mat[, 1], specf.mat[, 2:11], type = "l", lty = 1, lwd = 3, col = bead.col2, 
    xlab = "wavelength (nm)", ylab = "reflectance", main = "UV color changing beads, after UV exposure")
text(500, 0.2, "Colors from Photo")

Analysis

I can carry this anlaysis further by quantifying the wavelength of the peak absorbance and the peak width (usually Full Width at Half Maximum – FWHM) for each spectrum. This could be useful in further analyses, reports, or as a feature in a machine learning approach.

To extract this information, I could try to fit a series of gaussians to the peak, representing the fraction of pink, blue or yellow dye present, but the quality of the data doesn't really justify this, particularly as I don't have a good shape for the yellow absorbance peak or adequate reference data for each of the dyes. As a quick and dirty method, I'll take the median position of the data values that are at 95% of peak. That should give something in the center of the peak. Since the peaks have all been scaled, that corresponds to the center of the data values < -0.95.

Likewise, the usual peak width (FWHM) woud be the range of values < -0.5, however, the poor baseline at shorter wavelengths makes this range more or less unusable. For this reason, we can take a more well-behaved approaximate peak width measurement as the range of reflectance values < -0.8.

# approximate peak wavelength
indcs <- apply(specf.mat[, 2:11], 2, function(x) median(which(x <= (-0.95))))
pks <- specf.mat[indcs, 1]

# approximate peak width
lowidx <- apply(specf.mat[, 2:11], 2, function(x) min(which(x <= (-0.8))))
highidx <- apply(specf.mat[, 2:11], 2, function(x) max(which(x <= (-0.8))))

pkwidth80 <- (specf.mat[240, 1] - specf.mat[140, 1])/100 * (highidx - lowidx)

features <- data.frame(pks, pkwidth80, bead.col, bead.col2)
features <- features[order(pks), ]
features
##      pks pkwidth80  bead.col bead.col2
## 10 436.4     75.21    yellow   #B99B55
## 6  449.8     88.86    orange   #D09B52
## 9  451.7     98.07    yellow   #B99B55
## 7  529.3    111.35      pink   #BE6478
## 4  533.6     99.18 lightpink   #BE6478
## 1  549.8    110.61   magenta   #C8198C
## 8  568.7    133.10    purple   #6E197D
## 3  581.1    179.93  darkblue   #161278
## 2  585.4    136.79  darkblue   #161278
## 5  603.0     76.32 lightblue   #82879D

# save your work!
write.csv(features, "UVbead data features.csv", quote = TRUE)

One side effect of this is that we now have a small table of metrics that describe the relatively large original data set reasonably accurately and could be used to classify new data. In current data analytics parlance, this is known as 'feature extraction'. In traditional science, these characterizing features could be combined with others from different studies (crystallography, electrochemistry, UV-vis, IR, Raman, NMR, …) to help predict the effects of a chemical change or a different chemical environment on the behaviour of the molecule. Such studies are traditionally used to help direct synthetic chemists toward better products.

Conclusions

From this analysis, we can see that the blue beads are absorbing at longer wavelengths than the yellow, and the pink and purple beads have absorbances in between. The darker colors have broader absorbance peaks than the lighter colors, with the darkest blue having a range that appears to cover the purple, yellow and blue regions. These darker beads probably contain combinations of the dyes used for the different colors, and not just more of the dye used for the light blue beads.

The reflectance spectra are not as observant as our eyes. For instance, the yellow and orange beads are readily distinguished by eye, but not so clearly in the observed spectra or the extracted features. This may be due to the 400nm cutoff of the light pipe, which distorts the peak shapes for the yellow and orange beads. Ideally, we could observe the changes in the absorption spectra over the whole range 280-800 nm as a function of time. The current reflectance spectrometer setup, however, is only capable of capturing the 400-750 nm range. This means that we do not have access to the interesting behavior of these dyes at shorter wavelengths.

Chemically, I expect that absorption of the UV light in the 300-360 nm range causes a reversible conformation change in the dye molecules, as has long been known for the azo-benzenes and their derivatives. After the conformation change, the absorption maximum of the dye is shifted to a much longer wavelength. If the UV dyes in these beads are very closely related to each other, is likely that the yellow beads absorb at relatively short UV wavelengths and the blue beads at longer wavelengths before the transition, however, without measuring the UV absorption spectra we cannot know this. It is quite possible that their UV absorption spectra are nearly indistinguishable and the effects of their chemical differences are only apparent in the visible spectra. Without access to better hardware or more information about the molecules involved there is no way to know.

Next steps

We do have access to one thing that varies in the appropriate way. The experiments shown here were taken at relatively low UV light levels (February in England). During the summer, the angle of variaton of the sun is much larger. At dawn, it is at the horizon, while at noon, it is about 60° higher. Since light scattering in the atmosphere is a strong function of wavelength, the relative intensities of light at 300, 360, and 400 nm will vary with the angle of the sun. If the beads have different absorption peaks in the UV range, the time dependence of their color changes should vary by the time of day.

The simplest way to measure this is not with a spectrometer, but probably by following color changes in video images.