Care.data – Musings on Languages, IT and other stuff

If you’re in the UK at all, you may have heard of some discussion around something called care,data. The general idea about it is that all healthcare data is centralised and that this repository of data would be made available to researchers. Such a repository of data would be massively useful for healthcare researchers.

So far so good. As someone with a great deal of interest in data, and how it can be best used to advance human society, you’d think I’d be wild about this idea. I’m not wild about the implementation and this is a pity.

The data, we are told, will be pseudoanonymised. This is the number one problem I have with it – it’s not actually properly anonymised. It comes with postcode data and NHS number. In the UK, postcode data can in a lot of cases be personally identifiable. This is wrong.

This is before you start asking questions about who gets to use the data. Plus, given the changes to the NHS organisation in the UK courtesy of the current government, you’d have to ask whether the data is even going to be as useful as it might have been 10 years ago under a centralised system.

So okay, I can knock it and be concerned. But I do believe something akin to it would be useful. Not necessarily directly profitable, but useful. So how could we implement it?

Well, there’s no reason why we can’t, straight out, why postcode is relevant? It provides regional variation information. So one of the things we need to do is provide geographically classified data. Using postcodes to create a geographic classication which does not include the postcode itself is, or should be, straightforward enough. Ergo, the postcode issue can be dealt with.

The NHS number can be replaced with a different primary key number which is not made available as part of the database of care,data data, but for which a conversation table exists with the original data. Again, depending on the actual implementation of the data structures, this should be straightforward.

This deals with the data privacy side of things and one of the big huge issues I have with the current idea.

After that, we need to be aware that more data doesn’t always cater for better/more accurate detailing. Large datasets can amplify statistical errors which, given we are talking about health data sets matter a lot, They affect real people.

These errors are the type of errors where, for example, 1 in 100 cases might be misdiagnosed because a particular test isn’t 100% acccurate for example.

Ultimately, I’m strongly in favour of this project, or, more to the point a project like it, provided it comes with built in data protection concerns and is implemented to benefit health care rather than, for example, corporate health business interests. As matters stand, I’m inclined to feel that there are lacunae here at the moment.