Category Archives: datascience


I really have a lot of things to catch up on but a couple of weeks ago, a piece on the Business Insider site caught my eyes. In it, it suggested that if you wanted to work for Google, you needed to know Matlab. They attributed the comment to a guy called Jonathon Rosenberg.

This caused some discussion on twitter in the days afterwards. Mostly, people found it difficult to believe, particularly when Google uses a bunch of other tools, including my personal choice for a lot of data analysis, R.

I am not sure that Matlab is a mandatory requirement to work in Google; it doesn’t necessarily turn up on any of their job ads that I might be interesting, but in some respects, I can understand why A N Company might do something like this. It’s a little sorting mechanism. The point which I found most interesting about the piece above was less that Google were looking for Matlab, but that the writers of the piece had never heard of Matlab.

I was once interviewed about modern web technology and how it might benefit the company concerned way back in the early days of the web becoming a consumer sales channel. My view of the discussion ultimately wasn’t that they wanted me to work on their web interfaces (not at that stage anyway), but they wanted to see what my ability to learn about new stuff was. It may well be that if you go to work for Google in some sort of research job, you’ll use Matlab. Or, more probably, you’ll learn a bunch of other things in the area that you are working.

Either way, comments like Rosenberg’s may, or may not be official hiring policy but it’s often worth considering that they are asking a broader question rather than “Can you use Matlab” and more “Can you prove to use that you can develop in whatever direction we throw you”.

And if you haven’t heard of Matlab, the chances are, you may not.

Professor David Spiegelhalter at the RIA

Friday 7 November saw Professor David Spiegelhalter talking about risk at the Royal Irish Academy. If you’re not familiar with him, his site is here, he occasionally pops up on BBC Radio 4’s More or Less and other interesting places.

Risk is an interesting thing because humans are appallingly bad at assessing it. Ultimately, the core of Professor Spiegelhalter’s talk focused on calculating risk (yes, there is a micromort unit of measurement) and more specifically, communicating it in human friendly terms. This is not to suggest statisticians are not human; only that they have a language (we have a language) that isn’t always at one with general understanding.

This isn’t the only problem either – humans appear to be very good at not worrying about non-immediate risks as well. So this presents a number of challenges in terms of decision making behaviour on the part of people.

Talks like this can be massively entertaining if done well; less so if badly done. In one respect, one of the overwhelming contrasts of the evening was the absolute contrast between Professor Spiegelhalter’s talk and Patrick Honohan’s response which focused on difficulties in risk assessment in the financial sector. I took a slightly dim view of the response on the basis that every single banking ad makes it clear that the value of your home (or assets ) can go down as well as up and did so for most of the 2000s in this country, and therefore it isn’t so much a question as we didn’t understand the risk – many people just did not want to accept it. In certain respects, it has a lot in common with people who find it hard to live healthily now to for benefits sixty years down the line. If I had to choose who got their message across more effectively, by some distance it was Professor Spiegelhalter.

Talks of this nature interest me; particularly as they relate to numbers and numeracy, and in this case, on risk. People are never particularly good on probability and chance despite all that Monopoly board training each Christmas. Ultimately, the impression I got from the talk is that the debate has moved on somewhat from “what is the risk of [X bad or good thing] happening” to “how do we effectively communicate this risk”. It’s interesting – in a tangential way – that we are swimming in methods of communicating things these days between online streaming, social media feeds, many online publishing platforms and still, with science and numbers, we are only finding the corrective narrative for engagement in a hit or miss manner. Professor Spiegelhalter delivers his talk in an excellent manner. It is a pity that more people will not get to hear it.

on a related note, if you’re interested in talks of a science and mathsy flavour, the RIA and the Meteorological Society are prone to organise such things on the odd occasion. Check their websites for further information. 

SamaritansRadar is gone

The application was pulled on Friday 7 November. Here is the statement issued by the Samaritans on that occasion.

I am not sure how permanently gone it is, but this is worth noting:

We will use the time we have now to engage in further dialogue with a range of partners, including in the mental health sector and beyond in order to evaluate the feedback and get further input. We will also be testing a number of potential changes and adaptations to the app to make it as safe and effective as possible for both subscribers and their followers.

Feedback for the Radar application was overwhelmingly negative. There is nothing in this statement to suggest that the issue for the Samaritans is that there were problems with the app, only that some people were vocal about their dislike of it.

I really don’t know what to say at this stage. While I’m glad it has been withdrawn for now, I’m not really put at ease to know that the Samaritans have an interest in pushing it out there again. It was a fiasco in terms of app design and especially community interaction. There is nothing, absolutely nothing, to indicate that they saw the light about the technical issues with the application, the ethical issues with the app and the legal difficulties with asserting they weren’t data controllers for that app.

I hate this because a) it negatively affected a lot of people who might in under circumstances use Samaritans services and b) it makes the job of data scientists increasingly difficult. It is very hard to use a tool to do some good stuff when the tool has been used to do bad stuff.

Big data – is this really what we want?

If you think of the other places where Big is used to describe an industry, it’s not generally used to by people who like the industry in question. Big Pharma. Big Agriculture, Big Other things We’d Like To Scare You About.

But the data industry insists on talking about big data as a thing that it’s pushing as the next big thing without considering that equally, there are a lot of people pushing back against big data. How Big Data Can Tell A Lot About You Just from Your Zip Code.

This is not good for data analytics. Any term which can be used to engender fear and nerves is not so much an asset as a liability.

There’s an apocryphal story about Target apparently identifying when a teenager was pregnant from her shopping habits, writing to her, her father finding out and getting into a rage with the local branch of Target and having to apologise. A number of people in the data industry have described it although I can’t actually find a source for it. A lot has been written about how retailers can learn a lot about you from your habits, however, which has an impact on which special offers you get when they deign to send you vouchers.

Some people find this a little bit creepy. This, together with news stories about What Your ZipCode Says About You and “What Big Data Knows About You” just reinforces this.

So a couple of things. We need to stop talking about Big Data. Big Data will come back to bite the analytics industry as consumers push back against what they perceive as a bit of spying and general creepiness. And we need to focus on the benefits to consumers of data analytics. It is not just a question of buying them off with extra vouchers. Pinterest, for example, is getting much better with recommendations for new boards (although once they get hold of an idea such as Treasa Likes Fountain Pens it takes weeks for them to realise I’m now following enough fountain pen boards). On the other hand, Amazon is not getting so much better with book recommendations lately.

The other problem I see with the label big data is that it allows people to avoid thinking about what they are really trying to achieve. The question “What are we doing about big data?” never comes across as anything other than “I read this in HBR and everyone’s on about it on my LinkedIn groups so we need to hop on this bandwagon”.

If you take a step back, it’s better to think about this question: “What data do we have, and are we using it effectively to support both ourselves and our customers”. It may be big, it may be small. Some of it may be system related – getting pages to load faster, for example – some of it may be habit recognition related – prefilling forms for transactions which happen regularly like, oh flying to London every Monday morning.

Numbers have power

This morning, the front page of the Irish Examiner, which you can see here on Broadsheet (third one down) caught my attention for this headline:

46% back death penalty for child rape

The subheading is “Farmers take hardline on law and order”

The very first line of the the piece underneath is as follows:

The death penalty should be introduced for the crime of raping a child, according to a national opinion poll.

There are several problems with this in my view. I like the Examiner a lot, and the journalist under whose byline this appears, Conall O Fatharta has done quite a lot of interesting reporting in the last few months. But when you’re claiming that a national poll says that the death penalty should be introduced for the crime of raping a child (or, in fact, any crime), then two things are necessary:

  1. the proportion of people (nationally) who support that assertion should be greater than 50% (it’s not in this case, because already, the headline makes it clear that a majority do not); and
  2. the poll should be on the basis of a reasonable sample of the population at large. If you read the piece more closely, however, the poll was limited to farmers.

As such, the headline which the Examiner plonked on top of the story lacks some important detail, and given its position on the front page, that is deeply regrettable.

What do we know about the poll?

The poll was carried out on behalf of (or by) the Irish Examiner and the Irish Creamery Milk Suppliers Association. We do not know (from this report at least) how many farmers were surveyed, and this is important because the journalist has broken down the figures geographically.

Why does this matter?

It matters on several levels, but chief amongst them is that we cannot be certain that a subset of farmers of the ICMSA is representative of the nation as a whole, or, in fact, possibly not even of farmers in general. For example, a simple question we could ask is was it done on the basis of ICMSA membership and in that case, given the that the ICMSA respresents predominantly dairy farmers, is it safe to say the output from a survey of dairy farmers applies to all other farmers.

An additional factor is that the CSO carries out a census of agriculture from time to time and there are a couple of pieces of information which are worth noting. The most recent report which I can find on the CSO’s website is for 2010, so, four years old. The press release is here, on the CSO’s website and it summarises the findings nicely. The full report is here.

There are a couple of key pieces of information in the summary which matter here:

  • More than half of all farm holders were aged 55 years or more. The number of farmers aged under 35 fell by 53% since 2000
  • One in eight (12.4%) farms is owned by a female.

Additionally, it is possible that the vast majority of farmers are rural dwellers but a greater proportion of the population are now urban dwellers. I have not found straight figures for that.

These figures are not representative of the population as a whole. If you look at the CSO census figures, only a third of the population of the country is over the age of 45 which means the proportion of the population which is over the age of 55 is less again.

Additionally, in 2011, at the time of the last census, more than 50% of the population were female. You can find the CSO’s population statistics by age from 2011 here.

Both the headline and the first line of the story give the impression initially that the results are nationally representative but as the survey was of farmers, the participants are age and sex skewed away from the shape of the population as a whole.

So, the subheading mentioned farmers; what actually is the problem here?

Three things: we get our news from various sources which means that pieces of information might get cut, such as on a twitter feed which may not necessarily highlight that this is a Farming Spotlight piece. Not everyone might click through beyond the headline. This is particularly important as links get passed around. This by the way was the Examiner’s own tweet of its front page and inline, you will only see the top half.

Secondly, for me, a story which effectively boils down to “a sample of the population skewed by age, gender and urban/rural divide have this opinion which may or may not be representative of the population as a whole” really doesn’t belong as the top front headline on a national newspaper. In short, while it is, in passing, interesting, it isn’t really a major story.

Thirdly: the Irish Examiner has not provided any useful information (that I can find) in terms of the number of respondents, how the survey was carried out and what the estimated margin of error was. If you check any political poll reporting, the number of people surveyed along with the margin of error is always provided, along with an indication of when the poll was carried out. This is news at the moment, possibly because of the National Ploughing Championships but again, the statistical basis for the survey is missing.


Be all and end alls: Natural Language Processing?

I have some doubts about the effectiveness of anything which depends heavily on natural language processing at the moment – I think there’s a lot to interest in the field but I don’t really think it has reached a point of dependability. One of the highest profile – I hesitate to use the word experiment – pieces of work this year, for example, included this comment:

Posts were determined to be positive or negative if they contained at least one positive or negative word, as defined by Linguistic Inquiry and Word Count software (LIWC2007) (9) word counting system, which correlates with self-reported and physiological measures of well-being, and has been used in prior research on emotional expression (7, 8, 10)

(Experimental evidence of massive-scale emotional contagion through social networks, otherwise known as the Facebook emotion study)

Anyway, the reason I am writing about this again today was that this piece from Forbes turned up in my twitter feed and the line which caught my eye was this:

Terms like “overhead bin” and “hate” in the same tweet, for example, might drive down an airline’s raking in the luggage category while “wifi” and “love” might drive up the entertainment category.

Basically, the piece is a bit of a puff piece for a company called Luminoso, and it has as its source this piece from Re/Code. Both pieces are talking about some work Luminoso did to rate airlines according to the sentiment they evoke on twitter.

If you look at the quote from the Facebook study above, the first thing that should step out immediately to you is that under their stated criteria, it is clearly possible for a piece of text to be both positive and negative at the same time. All it has to do is feature one word from each of the positive and negative word lists. Without seeing their data, it is hard to make a call on how much or, whether they checked how frequently, that happened, whether they controlled for it, or whether they excluded. The Forbes quote above likewise is worryingly simplistic in terms of understanding what needs to be done.

This is Luminoso’s description of their methodology. It doesn’t give away very much but given that they claim abilities in a number of languages, I really would not mind seeing more about how they are doing this.

Facebook and that study

Just briefly, given the general response to the Facebook empathy contagion article on PNAS a while back (an hour is a long time on the internet, let’s face it), the question I would have to ask is this: is everyone in Facebook so attached to what they can do with their dataset that they no longer remember to ask whether they should be doing that stuff with their dataset?

A while back, I met a guy doing a PhD in data visualisation or something related and he spoke at length about how amazing it was, what could be done with health data and how the data had to be freed up because it would benefit society so much. I’ve never really bought that idea because the first thing you have to ask is this: do individuals get messed up if we release a whole pile of health data, and if so, to what extent are you willing to have people messed up?

What I’m leading to here is the question of group think and yesmenery. Ultimately, there comes a point where people are so convinced that they should do what they want, that they are unwilling to listen to dissent. The outcry over Facebook’s study has been rather loud and yet, it doesn’t appear to have occurred to anyone who had anything to do with the study that people might find it a bit creepy, to say the least. It’s not even a question of “oh, you know, our terms and conditions” or “oh, you know, we checked with Cornell’s review board”, it’s just straight up “is it creepy that we’re trying to manipulate people’s feelings here? Without telling them?”

I mean, I can’t ever imagine a case in which the answer to that question is anything other than Yes, yes it is creepy and eugh. And yet, it doesn’t seem to have occurred to anyone connected with it that it was kind of creepy and gross.

Once we get past that, what’s being focussed on is the datascience aspect and I have a hard time swallowing that too. This was a psychological experiment, not a datascience on. I mean, if you did a similar study with 40 people, you wouldn’t call it a statistical experiment, would you? In many respects, the datascience aspect is pretty irrelevant; it’s a tool to analyse the data and not the core of the experiment in and of itself. A datascience experiment might involve identifying the differences in outcome between using a dataset with 10,000 records and a dataset with 10 million records for example. Or identifying the the scale of difference in processor speeds between running a data analysis on one machine versus another.

Anyway, the two main issues I want to take away from this is that a) it wasn’t really a datascience experiment and b) sometimes you need to find people who are willing to tell you that what you are doing is ick, and you need to listen to them.

Thing is – and this is where we run into fun – what have they done that they haven’t told us about?