December 2012 – Musings on Languages, IT and other stuff

Big Data. Many things to many different people

Late last night, I picked up a tweet from Hilary Mason, chief scientist with Bit.ly

I’m troubled by the increasing interpretation of “big data” to mean “data without the scientific method”. When did that happen?

This is an interesting question, made all the more difficult by the growing impression I have that the definition of Big Data is a very dynamic concept. What is big data?

The truth is, I think a lot of people aren’t sure. Hilary herself provides an interesting definition:

I prefer the big = “too big to analyze on one computer” definition

I have some mixed feelings. I don’t like the phrase big data; I never have because it comes over far too much like a marketing buzzword and less like some underlying concept. For various reasons, people have had cause to ask me what I understand by “big data” lately which indicates to me that it’s something that has come at people without them recognising what lies beneath it.

For me, when I have to describe what I see it as, I say this. I say “We generate a lot of data. From different activities within a given organisation. We allow some people to analyse certain specific areas of it because historically, we didn’t have so much data, and it was all subject specific. But things are different now. Our activities generate a lot more data and that data is very much interdependent sometimes. You may be a subject area specialist in one particular area of an organisation and you may only care about that particular area. But your organisation is much bigger than your area and it could be – often is the case for example – that marrying the data from your area with the data from other areas can have a huge impact on the way your organisation does business”. In other words, we have a lot of data now, some of it more voluminous than others, but the vast majority of organisations do not use their data to join the dots coherently.

When I read articles about big data, and data science, I too am troubled. I’m troubled by the impression I have that big data is somewhere we should be at without understanding why we should be at it and what we can get from it. There is a degree of unclarity about what a data scientist actually does. Business is not generally good with a lack of clarity. Matters are not helped when the media helpfully supply articles about how datascience is the next big thing or that being a datascientist is the sexiest new job going.

It probably is but this only serves to attract people who weren’t really interested in the first place.

I’m interested in the interpretation of numbers, what they mean, how we got to those numbers, where we can go with them. How they inform us. I equally got to where I am right now by recognising that a lot of people were very interested in drawing pretty pictures displaying numbers but not so interested in the validity of the numbers. I have seen bar charts comparing social media site usage which compared the number of Facebook page loads with the number of photo uploads on Flickr. You don’t need me to tell you this is not a valid comparison given that Flickr gets page loads and Facebook has photo uploads as well.

My big huge concern with big data is that people look at the big bit but not the data bit.

I can’t write this in 140 characters, by the way; that’s why we are here and not trashing it out into the middle of the night on twitter. Also, Hilary had a good go at it last night.

If you were an executive standing in front of me, I would ask if you ever measured your website response times against your website demands linked to – for example – your sending out a marketing email, or whether there were unexpected regional variations, or whether there are trends in the google search terms bringing people to your website that indicate a questionable link somewhere or a business opportunity lost. You can call this big data if it makes you feel better, or data science. I tend to prefer data science because I suspect it is going to be around a lot longer than big data.

What matters to me is that you get the best possible information out of your data, regardless of how much or how little you have. One of the things that concerns me slightly about discussions on data science and big data is the lack of attention to basic skills in analytics. It is not just a case of running a SQL query and picking out the highest band of the ensuing bar chart. When we look at the skill sets required for data science, there tends to be a focus on computer programming (which is good, don’t get me wrong), but less importance attached to basic statistics.

When I got interested in this about 18 months ago, I had an understanding of what I wanted to do, and went back to college to get the maths and stats skills that are handy here. I’ve been programming for 12 years so I don’t worry too much about the necessary programming skills. What worries me a little is that we will wind up with a lot of people calling themselves data scientists on the basis of a few Python scripts and not a lot of understanding of the actual data.

When people focus on concerns about big data, they talk about the skills squeeze (see this New York Times piece, for example, and Jason Ward of EMC in Ireland on a similar subject) and not the actual underlying business of data science. A key issue is that datascience is not just about dragging out the data into a spread sheet, pressing a few buttons and going “hey presto”. Harvard Business Review had a useful piece on this which I recommend looking at.

Hilary Mason has an interesting piece on Getting Started With Data Science. Another place to start is probably to take a step back and try and describe what you’d want to do with a lot of data. What matters with data, no matter what scale it’s on, is how you interpret it. If you really want an example of why this is important, although it’s not data on a massive scale, Nate Silver’s work on the last US election is a good place to start, particularly given that he and a number of other analysts disagreed on the data interpretation. You need to recognise that data is to inform, not to be bent to your needs and that sometimes, that data will not tell you what you want to hear.

I’d agree with Hilary in that communication is a key skill which very often gets forgotten about, but this issue is not limited to data analytics.

So, this brings us back to what big data is, or is not, and whether it really means what we think it means. I really do think it means so many different things to so many people that as a label, it’s functionally useless. Hence, I’d prefer data science as a label, or data analytics. In this way, you can highlight that yes, methodology matters, and statistical skills will matter.

In essence, I would answer Hilary’s original question as follows: Big Data lost its underlying rigour when it become a product to be sold rather than a job to be done.

How obsessed is Ireland about property?

Brian Lucey flagged this on his twitter feed this morning.

If you don’t want to click through, yesterday he posted the same post twice to his blog; the sole difference being that the two pieces had different titles, one property related, one more general. I’d almost say celebrity mag styled actually but I could be being unfair – the dentist rarely has the end of year edition of Hello or VIP given I have my annual check up in October.

Anyway. The money quote is this:

By a margin of almost 5-1 the property titled post got more hits

The title of Brian’s piece asked “Just how obsessed is Ireland about property?” but aside from the quote above, he doesn’t actually draw a conclusion – I imagine he leaves it as an exercise to the reader but by implication, he seems to be suggesting that Ireland is obsessed about property by a margin of 5 to 1 over more generic subject blog posts.

I’m going to assume that Brian Lucey has his tongue stuck firmly in his cheek with this but I’m going to do a little spelling out here. You cannot draw any conclusions from the outcome of this experiment based on the information given by Brian in the relevant post.

Here’s why.

We do not know what the sample size was. It is possible (unlikely but even so) that Brian got six hits on his blog total yesterday. The population of Ireland is circa 4.5 million, so it’s dangerous to do any extrapolating the view of the population at large without knowing how large the sample was.
We do not know what the source of the hits were: 1) links from other websites 2) links from Brian’s own Twitter account 3) links from Facebook, Google Plus or any of the main discussion forums, or from his rss feed. This is troublesome because it means we cannot cater for possible bias. If, for example, the bulk of Brian’s hits came from his Twitter followers, it is not safe to assume that this is a random selection of Irish people as 1) people who follow Brian’s twitter feed are more likely to be interested in economic matters and potentially property matters as he speaks about property on the media quite often and a lot of his pieces for the Examiner are property based 2) and there’s a slight bias in social media users against the older population.
In my view, people who follow Brian Lucey’s writings either on twitter or through the Irish Examiner are more likely to be predisposed to have an interest in Irish property than the population at large. Put simply, it is getting harder to get a random sample of the Irish population easily. The same goes for people who read NamaWinelake by the way – it is a special interest site which draws people on account of that special interest. To get a random sample, it would be almost better to post the two links on a forum dedicated to – say – GAA supporters – as that would remove the confounding variable of an already existing interest.

Here’s a useful primer on why this matters. One of the more famous wrong headlines in history is the Chicago Tribune’s headline announcing Dewey’s victory in the 1948 US Presidential election, an underlying support of which were telephone polls. In 1948, access to a telephone was not uniform across the population, and favoured the more well off than the general population. As a result, if you do not have a valid sample, then your conclusions cannot be guaranteed to be valid. In fact, it’s getting harder to this in Ireland – someone I know noted once that 30 years ago if you took a random sample of mass goers in Ireland you were probably pretty close to a reasonable random sample of the wider population. But because the population of mass goers has changed vis-a-vis the wider population, this was no longer the case.

All I can conclude from Brian’s piece is this. Given a choice between two posts yesterday, five sixths of those reading his blog chose the one most likely to be about property. Given the lack of information about the population reading his blog and the population at large, the size of the sample size and the existing possibility of bias amongst people who read his blog, you cannot draw any conclusions about the wider population of Ireland.

I’m pretty sure Brian knows this by the way, but one of the things which tends to concern me about Ireland is the lack of attention to detail regarding figures, numbers and statistics and how they are interpreted. Statistics can be twisted because the vast majority of people are not aware of their limitations in this area.

Useful data sources

This is a temporary directory of possible sources of data for datavizualisation and data analysis projects.

Mainly it’s here (at the moment) because I have identified another home for it. It will probably move to a page of its own later, and maybe out to the projects site. Blink and you’ll miss it.

Amazon AWS Public Datasets

R datasets list compiled by Vincent Arelbundock

Shish list of open data sets.

Datamarket

Kaggle.