Care.data

If you’re in the UK at all, you may have heard of some discussion around something called care,data. The general idea about it is that all healthcare data is centralised and that this repository of data would be made available to researchers. Such a repository of data would be massively useful for healthcare researchers.

So far so good. As someone with a great deal of interest in data, and how it can be best used to advance human society, you’d think I’d be wild about this idea. I’m not wild about the implementation and this is a pity.

The data, we are told, will be pseudoanonymised. This is the number one problem I have with it – it’s not actually properly anonymised. It comes with postcode data and NHS number. In the UK, postcode data can in a lot of cases be personally identifiable. This is wrong.

This is before you start asking questions about who gets to use the data. Plus, given the changes to the NHS organisation in the UK courtesy of the current government, you’d have to ask whether the data is even going to be as useful as it might have been 10 years ago under a centralised system.

So okay, I can knock it and be concerned. But I do believe something akin to it would be useful. Not necessarily directly profitable, but useful. So how could we implement it?

Well, there’s no reason why we can’t, straight out, why postcode is relevant? It provides regional variation information. So one of the things we need to do is provide geographically classified data. Using postcodes to create a geographic classication which does not include the postcode itself is, or should be, straightforward enough. Ergo, the postcode issue can be dealt with.

The NHS number can be replaced with a different primary key number which is not made available as part of the database of care,data data, but for which a conversation table exists with the original data. Again, depending on the actual implementation of the data structures, this should be straightforward.

This deals with the data privacy side of things and one of the big huge issues I have with the current idea.

After that, we need to be aware that more data doesn’t always cater for better/more accurate detailing. Large datasets can amplify statistical errors which, given we are talking about health data sets matter a lot, They affect real people.

These errors are the type of errors where, for example, 1 in 100 cases might be misdiagnosed because a particular test isn’t 100% acccurate for example.

Ultimately, I’m strongly in favour of this project, or, more to the point a project like it, provided it comes with built in data protection concerns and is implemented to benefit health care rather than, for example, corporate health business interests. As matters stand, I’m inclined to feel that there are lacunae here at the moment.

 

Finding ships on the web

I had a text message from a friend yesterday. He was in Calais, waiting to make the Channel Crossing with P&O Ferries. The company had texted him earlier to warn him of disruption but information was otherwise thin on the ground.

In fact, yesterday, if you were a P&O customer looking for information about how your crossing was going to go, from them, your information was limited to phoning a number which was on their site. In this day and age, I think that’s crazy.

My background is in the aviation sector and I know, for example, that you can, for most flight companies, get flight information through their websites and through their mobile applications. Some companies will flag serious delays via their twitter feeds and updated news information on the website. Granted, not all of them. But to a great extent, there is a recognition there that technology could be used to enhance airline to passenger communications.

My friend was told his ship was likely to be 45 minutes late when he got to check in. There wasn’t any sign of a ship though.

I see things like this as a challenge. And unlike him, I happened to be sitting at a computer.

P&O’s most useful tweets amounted to “Your ship will be late, but check in on time anyway”. They have no useful operational information on their website. So I abandoned them and went for the shipping equivalent of FlightRadar24, a website called MarineTraffic. It’s a really nifty site and from it I could identify the two P&O ferries which were en route from Dover to Calais and made an educated guess which one he would be most likely to be getting on. From that, I could tell him how far away his ferry was and at approximately what time it would arrive in Calais.

If you’re sitting on a quay in the wind and rain, and all you have to go on is your ship will probably be leaving 45 minutes late, and someone sends you a text and says “your ship is 15 minutes away from Calais at the moment”, this helps a lot. A few weeks ago, my friend did likewise for me when I was sitting in the gate area at Charles de Gaulles in Paris waiting to come back to Dublin; he checked what time the outbound flight from Dublin had left and we estimated what time the return would be leaving. People hate delays at the best of times, but delays with no concrete information, that’s really not great either.

There are two messages I take away from this.

  1. P&O could do a lot more to inform their passengers both via twitter and via their website. People don’t tend to want to calling automated phones for operational information any more. It’s old technology and it’s largely been superceded.
  2. Very often, even though the information is not straightforwardly available, it can be found somewhere. I found it and passed it on, up to and including what the state of play was later with the number of ferries arriving in Dover at approximately the same time (there were three queued up at one point).

So what would I suggest? Well, P&O need to feed operational information regarding delays to their twitter feed. My guess is that they don’t want to do this in case people just show up late to check in but people’s time is money. Perhaps we need to find away to accommodate late check in when we know a ship is going to be late. And they probably need to feed that information onto their own website although I have to say, I hope they are considering a website redesign. The site could do with it.

Marinetraffic.com has an iPad application which was very useful later when I was AFK. I’m not sure how it would behave on a phone but it definitely was better than the  website is through a browser. It could well be worth the investment if you’re travelling by sea during the winter when delays and cancellations are more likely.

 

Food poverty in Ireland

The Journal website published a story on Food Poverty in Ireland which you can find here. Its main selling point is that the numbers apparently at risk of food poverty are shown on a map of Ireland. It was based on a release by Unite and Mandate and you can find a text from them on the subject here. It’s titled Hungry for Action – Mapping Food Poverty in Ireland.

The map, I suppose, is eye catching. But I don’t think it really works to communicate the problems. For me, the first issue is that the absolute figures are fairly meaningless. Yes, it’s horrible that 112,300 people are at risk of food poverty in Dublin but if you actually take the food poverty figures and calculate them as a proportion of the population for each county, you realise that actually, proportionally, the risk of food poverty is comparatively lowest in Dublin as a proportion of the local population.

foodpoverty

Dublin has the lowest proportion of its population at risk of food poverty when we split the population as a whole according to counties.

And that’s before you even look at how the figures are generated in the first place.

The most recent figures available are for 2010 and the figures which the Hungry for Action report provides are extrapolated from these figures:

‘Constructing a Food Poverty Indicator for Ireland’, a study published by the Department
of Social Protection61, found that one-in-ten people experienced food poverty in 2010, or
approximately 457,000 people.

The following attempts to estimate the level of food poverty in each county. This is only
an approximation as the study does not provide this data. The approximation factors in
variations in income level (and assumes that income levels will alter the percentage in
food poverty) and 2010 population estimates. Therefore, these figures should be treated
as indicative.

So here’s one big problem already. We can’t really rely on the figures. They are based, to some extent, on guess work, and on an assumption relating to income levels. And they come with the following health warning: They are based on a government study from 2010 which did not break down its global figure according to county.

It should be noted that the above estimates are likely to be conservative.
As stated above, these estimates are based on 2010 data. In 2011 (the last year we
have data for) general deprivation rose by 8 percent. Further, the ESRI described
subsequent budgets as ‘regressive’.

What worries me here is that the figures, as provided in the map graphic, are akin to guess work. They are a breakdown of a national figure with no corresponding data available at county level to even base an extrapolation on. Put simply, we actually don’t know how many people are at risk of food poverty and certainly not split according to county. I’m not sure I agree with releasing a report like this given the caution and caveats with which the figures need to be approached.

However, if we leave that aside, I’m still unhappy with the use of a map to highlight this issue because it doesn’t tell you anything much in relative terms about which parts of the country are particularly affected. The piece by The Journal focuses on the absolute numbers:

The map shows Dublin fares worst with 112,300 people suffering food poverty. Larger counties like Cork and Galway follow close behind, with 50,500 and 25,300 people in need of assistance respectively.

The key problem with this is that – if we accept the figures as being in any way indicative which I have doubts about, Dublin’s figure as a proportion of the population is actually the lowest. It’s highlighted in the graphic above.  It’s scant comfort to anyone in that situation of course, but comparatively, the worst off counties are Kerry, Kilkenny, Longford, Monaghan, Donegal and Offaly. Dublin is actually the best off. In certain respects, it doesn’t tell us the story of food poverty in Ireland in real terms.

So what am I saying here?

Well a couple of things

  1. Choice of graphic is very important. This one does not bring any more to the story I think than a table of figures would have. There’s no way by which you can seriously – and justifiably – compare what’s happening in Dublin with, say, what’s happening in Donegal.
  2. The figures are not based on any meaningful data. This is evident because they took a national figure from 2010 and extrapolated it out into county subdivisions using a weighting of some things which is not clear from the document provided.
  3. I went and plotted live register numbers per county as a proportion of population (nB, not as a proportion of available labour force because I don’t have those figures to hand at the moment) on an expectation that the shape of the graph would be broadly similar. It’s not.

Live Register Numbers

 

In other words, I don’t think the story is as a simple as figures shown a map of the country with numbers on each county would you have you think.

Some notes about input to this piece:

Am I saying that we can dismiss the issue of food poverty in Ireland, yerra it’s all grand really? No. I’m not saying that. What I think needs to be done is actual research into the area. I don’t know how you measure it. Maybe you talk to the food support charities, the soup kitchens, the soup banks and you get a sense of changes to demands on their infrastructure in a scientific way rather than relying on anecdote. When society has a problem, it’s best to know as much as possible about that problem rather than guessing the extent of it. Ireland is not a big country. 

What I do feel, however, is that this document, this piece which Mandate/Unite have pushed out is not the best way to do it. I’m not criticising the need to raise awareness of issues surrounding food poverty in a developed country. I’m just suggesting that using a bunch of figures which appear to have little underlying data is not really the best way to go about it. If they have data, it would be nice to see it.

Changes in Life expectancy in Ireland between 1926 and 2006

If everything had gone according to plan,you’d be looking at some initial work regarding the popularity of various baby names in Ireland mainly because although the data for the UK is easy to get at, it’s also by and large, not very interesting to play with. The figures for Ireland are a bit non-straightforward to haul out of the CSO’s website so instead, I decided to look at mortality rates on the grounds that given a graph that compares 1926 and 2006, it’s a dream opportunity to look at displaying the information in the form of a slope graph.

You can find the data here and it was last updated in 2010. It might be interesting to see more recent figures if they are available but at this point, I’m not too bothered. The data is straightforward enough, it’s a 2×7 matrix, there is no cleaning to be done and no major analysis or flutering around to be done with it. For this reason, I brought it into Excel, exploited these instructions here, got slightly frustrated with various aspects of trying to do this in Excel (my next bid will be to have a look at options to do this in R which has to be easier to mess around with but I haven’t looked at whether ggplot2 does slopegraphs or not).

There’s a very simple story underlying these data: typically, in Ireland, life expectancy has increased a lot in the 80 years between 1926 and 2006.

As a general explanation of how the data works, the chart tells you on average, how many more years you are likely to live if you reach the age noted. For this reason, as you get older, the number of years you’re likely to live on from that point drops.

Okay.

Here’s the first graph:

 

Males

This shows changes in life expectancy for males in the period 1926 to 2006. We have only two time points so I can’t comment on odd variances in there – it is highly unlikely that the line of change is dead on straight.

Key take away points from this:

A baby boy born in 2006 is likely to live almost twenty years longer than a baby boy who was born in 1926.The biggest increases in life expectancy are for the cohort somewhere between 0 and up to 35 and 55 years. After that, the gains are nowhere near as steep. Part of this is explained by major improvements in infant mortality.

Here is the corresponding graph for females:

Females

 

This graph is largely similar in shape to that for males – higher gains the younger you are, again linked to changes in infant mortality rates. However, females get a much bigger push at birth than males did over the period. In 1926, females expected to live on average half a year longer than males from birth. In 2006, the difference is almost 5 years. This can probably be partially explained by changes in maternal mortality over the period.

What is interesting about this in social terms is that although women are likely to live longer than men, they are less likely to be adequately provided for pension wise because inter alia, they haven’t paid into professional plans as long or haven’t made as many social welfare contributions, linked to a) not having worked through marriage (Ireland had a marriage bar until sometime in the 1970s so that married women weren’t taking jobs from family men) b) having taken time off to have children.

I’m interested in obtaining similar data for other countries – anecdotally I seem to remember reading that life expectancy at birth for males in France in 1900 was around 35 years for example.

With respect to the actual graphing of the data, doing it in Excel is easy in some respects, you wind up with a graph that’s reasonably coherent scale wise. It’s just prettifying it is a bit time consuming and not very fun. I brought it into Photoshop mainly to get it out as a decent enough jpeg. None of the relevant templates were exactly as I wanted so I need to look into building standard templates. I will, however, have a look at drawing these things in R.

In the meantime, for a later project I will look at sorting these out in Adobe Illustrator at some stage as well – it is frustrating not having access to simple things like rulers to line stuff up when you’re moving it around the plot.

Cloud versus local

I had an odd dialogue coming with Microsoft Word this morning. The document I wanted to open, it said, had an issue synching, and the Skydrive and local copies of the document were different. I needed to choose which to retain.

I was not, it must be said, very happy about this, but chose local as I assumed that on the half dozen occasions I hit save last night, it saved locally first.

This turned out to be a mistake. The most recent local version was saved 3 hours before I finished work.

I was doing a lot of messing with dropping image files into that document last night so it was regularly saved. It was also saved when I shut up shop yesterday evening but none of those saves appeared to get written to local disc.

This is a huge problem for me. I have an always on connection so connectivity isn’t generally an issue. I’m the only person accessing the Skydrive, and I do it from two computers, both of which only I use. MS’s dialog told me another user had updated the document last night. That other user was me, on the same computer as I am using now.

I’m not going to complain bitterly about the problems this is causing me, suffice to say my day has suddenly become a whole lot worse than it was before I discovered this. But I do have to say this.

  • the dialog box, on telling me another user has updated the file, needs to tell me who that user is. I know in this case it was me, but in Skydrive’s case, that’s not always going to be true. With shared documents, it’s almost guaranteed not to me.
  • The dialog box, on telling me there’s an issue, needs at least to tell me which file is older. This really should be obvious to anyone.

I ran a completely unscientific straw poll this morning. On balance, more people expected the local copy to be more recent than the cloud copy with some comments about exceptions around documents stored in a browser. So I have to say, the assumption that the local file was the most recent was not particularly inane – it’s what most people expect.

I’m not sure what the problem is but the evidence I have right now is that it’s tied to something Microsoft have done between SkyDrive and Office. I only know this because the folder concerned included other no MS application based files which did get saved locally and did get synchronised correctly.

Right now, I’m faced with replicating a whole pile of work which is not ideal. It’s only three hours and it’s write up and it’s possible it will take me significantly less time to do it as I have most of the output, or can get it very easily as I have the scripts generating it (and some of that will have to be done unfortunately).

The take away message from this is:

  • most people expect local versions of files to be updated before cloud versions, particularly if they are editing in locally installed software
  • if you’re telling them that their files are out of sync because another user has updated, you must tell them who that user is and you must give them the time stamps of both versions

I find it hard to believe that this occurred to no one working on this in Microsoft.

I could live with the cloud version being the more recent version if I was told that it was. Instead, the utterly useless dialog box I got didn’t tell me this. I know the other user involved in this case was me, on the same computer, and I can’t see why MS’s dialog can’t communicate this.

SNCF Hackathon Transilien

SNCF, the French national rail company, ran an open data hackathon last weekend. I didn’t know about it in advance and anyway the schedule at the moment was such that there was no way I could have made it without a lot more notice but it struck me as interesting that they did it.

You can find the SNCF’s open data page here. Transport for London do something similar and I’ve seen some very interesting map projects come out of that. I’ve also plaintively wailed for similar access to Dublin Bus’s data.

There’s a lot of things different people can do with different data and they don’t always work for your company. I think it’s interesting to see the transport companies doing something, and some very interesting stuff has been done with available data in the aviation sector too (Flight Radar 24 is a key example).

However, I hadn’t come across a company actually running an open hackathon on their data so this weekend’s event in Paris – it focussed on transilien services which is commuter rail in the Ile de France area around Paris – was an interesting development. I’d keep an eye open for similar events and try to get there in the future if I can hang the small details together.

 

What do you love about programming?

Via a tweet from, I think, Kathy Sierra, in which she said this was the one interview question she had never been asked.

I started programming, a bit, when I was 13 and did it on and off until I was about 16. And then I stopped for 10 years. In 1999 I did an interview with a major Irish company which was looking for IT staff but who did not, for various reasons, have to have a degree in computer science. I got through that process and despite expecting to be put working on web technologies, I was sent for assembler training and then spent the next chunk of my life as an assembler programmer. Since then I have programmed a bit in Java, some in VB, some in R and now, occasionally in Python and again in Java.

Programming is an interesting activity. I love starting off with a problem to solve, and I love thinking about how I might solve the problem given the available tools. When you’re learning a language, this leads to various interesting algorithms as you code around a lack of knowledge. Sometimes it leads to massively inelegant solutions, other times it leads to things of pure beauty. I love programming purely for the problem resolution aspect of it, the fact that I can sit down with nothing but a piece of paper and a task to accomplish. For me, programming is more the side of working out how to accomplish something rather than purely executing it in code. There are, if you like, many ways to do that – the hard bit is the working out not necessarily the coding.

I don’t, in general, mind debugging my own code mainly because I generally understand what it is I was trying to accomplish. You learn a lot from the way you look at problems when you’re trying to identify where you went wrong in trying to solve them. In this respect, programming is always a learning process.

What I love about coding is typically it opens up the possible. What can we achieve tomorrow that we could not do today?

Comparative infographics

There is a massive growth in the production of infographics of varying quality and if you’re interested, there’s a tumblr full of dodgy ones here. Mostly, that focuses on the quality of the graphic design and whether it accurately portrays the underlying data. 

However, I want to consider one particular type of infographic and that is an infographic that purports to compare two entities. I find they can be problematic even if they are beautifully designed. The main underlying issue is data quality.

They can be done according to a lot of useful rules such as citing the source of the data you are using for comparison – but if they miss a key component of a comparative infographic, then no matter how beautiful they are, they are still of questionable merit. Each comparison must be a like with like comparison.

So, for example, a graphic seeking to compare social media penetration doesn’t get it right if it’s loading Facebook page loads with Flickr image uploads. That’s a beyond unfair and misleading comparison. My favourite one lately has been a comparison of London and Paris in which the cost of an average dinner out was compared with the cost of dinner out in one of Paris’s more exclusive restaurants and the greater London area was not compared with the greater Paris area, the cost of a family ticket in Disneyland was compared with the annual number of visitors to Harry Potter World, prices cited were in two different currencies making the comparison almost meaningless.

Ultimately, I have to ask how highly we can praise an infographic for being graphically beautiful but not informative because the underlying data is not useful for comparison’s sake. Ultimately, I would say the value in an infographic is linked to how informative the underlying data is and where comparative graphics are concerned, whether suitable comparative datapoints have been used.

Ben Schneiderman’s 8 Golden Rules of Datascience

This popped up in my twitter feed today – it’s a photograph of a slide from a talk given by Ben Schneiderman. I’m not sure I’d call them golden rules per se, but they are definitely a very decent framework to follow:

Preparation

  • Choose actionable problems and appropriate theories
  • Consult domain experts and generalists

Exploration

  • Examine data in isolation and contextually
  • Keep cleaning and add related data
  • Apply visualization and statistics: patterns, clusters, gaps, outliers, missing and uncertain data

Decision

  • Evaluate your efficacy, refine your theory
  • Take responsibility, own your failures
  • World is complex, proceed with humility.

Professor Schneiderman’s home page is here. The link to the tweet I picked all this up from is here via Kirk Borne and Seth Grimes

Changing times…

Via Damien Mulley’s fluffy links the other day I found myself perusing the Irish Motor Directory and Motor Annual 1911-1912 late last night. The directory itself can be found here, hosted by Lurgan Ancestry and while we’re at it, a shout out to My Kerry Ancestors who are talking about this link too. Okay, that’s the commercials out of the way.

I decided to see from it who was the first person in the “I come from a small, small” town where I grew up to register a car, and glanced down through the list looking at the addresses.

The first owner registered in the town where I grew up was my great grandfather, and it looks like he registered a motorbike. My mother is stunned, but was pretty certain that it was him, so I went to the 1911 census to check who of the relevant surname was living on the street concerned at the time, and by process of very simple elimination confirmed that yes. the named owner in question was her grandfather. In 1911, he was 27.

So I could write a bit about the family background but this is a data/tech blog and actually I’m going to write about changes in society.

If you have a look at the Lurgan website above, it’s actually interesting in the questions it leaves unanswered.

  • The register classifies vehicles by type – car, charabanc, bicycle, tricar, steam car, steam lorry, dogcart, steam plough. It would be enthralling to know who manufactured these things.
  • The register provides the registration numbers and some address information.
  • The addresses are interestingly diverse – for example, because I grew up in Cork, I was looking at the IF register – but a number of the addresses are in Dublin and the UK, for example.
  • in 1911-1912, there are 239 cars and 146 bikes registered in Cork, but the highest registration number is IF 434 as far as I can see. So I’m interested to see what the gaps are.
  • There are county and borough register authorities – I don’t know enough about local government organisation in Ireland in the early 20th century (but then, who does?)
  • this document was a reference handbook for motorists. So it was openly available.

That last bit is the bit that interests me. Any motorist in Ireland could have had a list of all the car owners in Ireland, known their names and where they lived, sorted by registration number. This doesn’t happen today and I don’t know if it could. I just googled my own car reg and Motorcheck came up with a background check for the car – but it will not give any personal details about the owner of the car or the address at which they live.

The Reference book for 1911-1912 suggests that there are 9169 vehicles listed in it, split slightly in favour of cars. Registrations would have started in 1903 when the registration system was implemented first (citation – Wikipedia but I don’t think there’s much arguing here). The series for Cork, IF, started being used in 1903 and eventually ran out in 1935. The number/index letters were reversed and used again later between 1975 and 1976. So the only conclusion that I can draw about my grandfather’s bike is that it was registered at some stage between 1903 and 1911, and the likelihood, I suspect, closer to 1903 than 1911 based on the numbers.

For comparison,  86,932 new cars were registered in Ireland in 2012. (Summary of Statistical Yearbook of Ireland, 2012). The 1911-1912 Reference Book was compiled by Henry G. Tempest and given the available communications options, it’s fair to say that to compile and print that information for over 9,000 vehicles was an achievement but I couldn’t see him doing it for nearly 90,000 new cars, never mind all the cars still on the road from prior to 2012.

And times have changed. We are more concerned about personal data. For years, people have been applying their right not to be listed in the phone directory, and I’m not sure anyone would want their address details along with details about their car in any easily accessible database for various reasons including, no doubt, not wanting to have their movements identified too easily, or not being easy prey for thieves.

Of course, I wouldn’t be me if I wasn’t thinking of ways I could analyse this data in more detail and wondering what other extraneous data sources could be used to enhance it (and not just, for example, the 1911 census).