Big Data. Many things to many different people

Late last night, I picked up a tweet from Hilary Mason, chief scientist with Bit.ly

I’m troubled by the increasing interpretation of “big data” to mean “data without the scientific method”. When did that happen?

This is an interesting question, made all the more difficult by the growing impression I have that the definition of Big Data is a very dynamic concept. What is big data?

The truth is, I think a lot of people aren’t sure. Hilary herself provides an interesting definition:

I prefer the big = “too big to analyze on one computer” definition

I have some mixed feelings. I don’t like the phrase big data; I never have because it comes over far too much like a marketing buzzword and less like some underlying concept. For various reasons, people have had cause to ask me what I understand by “big data” lately which indicates to me that it’s something that has come at people without them recognising what lies beneath it.

For me, when I have to describe what I see it as, I say this. I say “We generate a lot of data. From different activities within a given organisation. We allow some people to analyse certain specific areas of it because historically, we didn’t have so much data, and it was all subject specific. But things are different now. Our activities generate a lot more data and that data is very much interdependent sometimes. You may be a subject area specialist in one particular area of an organisation and you may only care about that particular area. But your organisation is much bigger than your area and it could be – often is the case for example – that marrying the data from your area with the data from other areas can have a huge impact on the way your organisation does business”. In other words, we have a lot of data now, some of it more voluminous than others, but the vast majority of organisations do not use their data to join the dots coherently.

When I read articles about big data, and data science, I too am troubled. I’m troubled by the impression I have that big data is somewhere we should be at without understanding why we should be at it and what we can get from it. There is a degree of unclarity about what a data scientist actually does. Business is not generally good with a lack of clarity. Matters are not helped when the media helpfully supply articles about how datascience is the next big thing or that being a datascientist is the sexiest new job going.

It probably is but this only serves to attract people who weren’t really interested in the first place.

I’m interested in the interpretation of numbers, what they mean, how we got to those numbers, where we can go with them. How they inform us. I equally got to where I am right now by recognising that a lot of people were very interested in drawing pretty pictures displaying numbers but not so interested in the validity of the numbers. I have seen bar charts comparing social media site usage which compared the number of Facebook page loads with the number of photo uploads on Flickr. You don’t need me to tell you this is not a valid comparison given that Flickr gets page loads and Facebook has photo uploads as well.

My big huge concern with big data is that people look at the big bit but not the data bit.

I can’t write this in 140 characters, by the way; that’s why we are here and not trashing it out into the middle of the night on twitter. Also, Hilary had a good go at it last night.

If you were an executive standing in front of me, I would ask if you ever measured your website response times against your website demands linked to – for example – your sending out a marketing email, or whether there were unexpected regional variations, or whether there are trends in the google search terms bringing people to your website that indicate a questionable link somewhere or a business opportunity lost. You can call this big data if it makes you feel better, or data science. I tend to prefer data science because I suspect it is going to be around a lot longer than big data.

What matters to me is that you get the best possible information out of your data, regardless of how much or how little you have. One of the things that concerns me slightly about discussions on data science and big data is the lack of attention to basic skills in analytics. It is not just a case of running a SQL query and picking out the highest band of the ensuing bar chart. When we look at the skill sets required for data science, there tends to be a focus on computer programming (which is good, don’t get me wrong), but less importance attached to basic statistics.

When I got interested in this about 18 months ago, I had an understanding of what I wanted to do, and went back to college to get the maths and stats skills that are handy here. I’ve been programming for 12 years so I don’t worry too much about the necessary programming skills. What worries me a little is that we will wind up with a lot of people calling themselves data scientists on the basis of a few Python scripts and not a lot of understanding of the actual data.

When people focus on concerns about big data, they talk about the skills squeeze (see this New York Times piece, for example, and Jason Ward of EMC in Ireland on a similar subject) and not the actual underlying business of data science. A key issue is that datascience is not just about dragging out the data into a spread sheet, pressing a few buttons and going “hey presto”. Harvard Business Review had a useful piece on this which I recommend looking at.

Hilary Mason has an interesting piece on Getting Started With Data Science. Another place to start is probably to take a step back and try and describe what you’d want to do with a lot of data. What matters with data, no matter what scale it’s on, is how you interpret it. If you really want an example of why this is important, although it’s not data on a massive scale, Nate Silver’s work on the last US election is a good place to start, particularly given that he and a number of other analysts disagreed on the data interpretation. You need to recognise that data is to inform, not to be bent to your needs and that sometimes, that data will not tell you what you want to hear.

I’d agree with Hilary in that communication is a key skill which very often gets forgotten about, but this issue is not limited to data analytics.

So, this brings us back to what big data is, or is not, and whether it really means what we think it means. I really do think it means so many different things to so many people that as a label, it’s functionally useless. Hence, I’d prefer data science as a label, or data analytics. In this way, you can highlight that yes, methodology matters, and statistical skills will matter.

In essence, I would answer Hilary’s original question as follows: Big Data lost its underlying rigour when it become a product to be sold rather than a job to be done.