Category Archives: datascience

The death of big data

A couple of people have tweeted links to this article in my stream this morning and a couple of comments in it stood out for me, particularly bearing in mind I’ve already considered the concept of big data. Money quote from that piece:

I’m troubled by the impression I have that big data is somewhere we should be at without understanding why we should be at it and what we can get from it.

Moving back to the piece from VentureBeat, one of the standout sentences for me was this one:

The phrase “big data” is now beyond completely meaningless.

I’ve never, very liked the term big data because from my point of view, it never was meaningful. And yet there are still people having conversations that go “what are we doing about big data”.

This is the wrong question. The question is “how do we best exploit the data we have, how do we improve the quality of the data we have”. Scale has very little to do with this when you think about it.

Data is all about the questions you ask of out.


Etsy is an online market place for handcrafted goods and related specialist objects. They present some of their business data here in their monthly Weather Report which is – in my opinion – quite a nice idea. I’d like to see more companies, and not just in the internet start up branch do something similar rather than just waiting for filing time.

Etsy was the first company I started thinking about for this project for various reasons – first of all, they totally drove their market and are still the market leaders globally in that zone despite some local competition in smaller market areas. What they do, they do very, very well. But they are not necessarily high profile companies like, for example, the Netflixes and the Groupons. They do, however, have some interesting ideas in terms of organising their market and their staff. Their process of increasing their numbers of female engineers was a master case in not paying lip service to something they wanted to change.

According to Amazon, Etsy get their web log data processed on MapReduce, and actually, Etsy have blogged about that here and it is well worth a read if you are interested in data analytics and the requirements of companies in the new economy.

But that doesn’t answer the question as to what I would do if I had access to their data and let’s be honest, Etsy are pretty hot in terms of dealing with their data themselves so whatever I suggest, they may well have it covered.

The first thing I would do is look at data for Etsy outside America. I’m interested in international sales. Sales from America to France, from France to Germany, from Australia to Italy. If you pushed me to the wall and said “Guess”, I’d be willing to assume that a significant proportion of Etsy’s business sales are intra-United States. I’m interested in the breakdown in sales outside that piece of their business because to some extent, that may well be where much of their growth comes from. Etsy has done some very interesting localisation of their site – see here (yes, they blogged about that too) but I’d like to drill down into the numbers of pages they are serving in their locales (currently, in addition to English(US) and English (UK) they are providing localisation in German, French, Italian, Spanish and Dutch) and additionally what is getting hit by google translate, whether it is English pages or any of the other locales. From a currency point of view, they are providing pricing in significantly more currencies – I’m interested in seeing how the currencies line up with the language and locales. Right now, Etsy recognises that I speak English, that I live in Ireland and that I like my prices in Euro. But I could have them in Thai baht if I wanted.

I’m interested in how Etsy’s non-US market is playing out. Whether there’s a dependency on English for those languages which do not have language content localised – for example Japanese, or whether much of it gets streamed through Google Translate, how much trade not featuring US sellers or buyers is happening, and what networks are cropping up again and again in those sales; whether there is an obvious leaning for many people in Japan to buy handcrafts from, say, Australia. Whether Italian products are going down particularly well in Denmark.

I’m interested in changing life for people who might buy products through the site. Part of this is by making it easier to identify lower and higher delivery charges – for example, intra Europe is less expensive than US-Europe. So I’d like to find a way of setting up search/product offerings in Etsy that can be done on the basis of likely postal charge. Currently, I don’t think this is possible – the search is limited on the basis of whether a product will be dispatched to your location or not, and not sorted according to possible cost – but it could be done by setting up banding based on the delivery charges in the store fronts, potentially. I’d also like it if, underlying, the system which serves storefront pages to possible customers could learn when a particular product created in one part of the world seems to have a particular following in another part of the world. I’d be interested to see what Etsy are doing in terms of localising demand beyond the need to serve products which can be dispatched to your country of location or not and whether this can be used to drive market penetration outside the US.

In summary then, I’m interested in Etsy’s non-US data. I’m interested in extra-US sales activities, I’m interested in measuring whether the localisation they have done so far is matching how their international markets are moving. I’m interested in using this data to tweak how products are served to potential customers, and I’m interested in enhancing the available information to a customer in terms of delivery issues, for example. I particularly interested to see how Etsy is doing in the UK compared to other non-English language locales on a similar scale (say Germany, France, Italy). I’m very interested to see how Etsy is doing in Japan and India and what the trends there have been over the last 2-3 years for example. I want to see if particular locales are showing organic growth and I’m interested to see what the company is doing to drive growth outside the US heartland.

This is what I would do with some of Etsy’s data if I ever got my hands on it. Also, I’d implement a wishlist. Please can I have a wishlist.


ETA: Etsy’s localised newsletters are great and yes, they have some very decent localised search well. I am completely impressed.




Passenger Air transport in Europe, 2004 to 2011

One of the things I wanted this year was a little experimentation with Tableau so I had been looking around for some data to play with. The above data comes to you courtesy of Eurostat and it relates to passenger transport by air in Europe. I’ve covered the period 2004-2011 and the countries concerned because they provided complete data for the periods concerned. There are a couple of other countries with incomplete data in the Eurostat tables as well which you can find here.

I did a little bit of work with the data because I wanted to identify two underlying stories. The first one – which you can see in this display – takes the absolute passenger figures and divides them by the population of the countries concerned so that we get a measure for passenger flights per head of population. This is interesting because it can highlight a couple of things – necessity (see Ireland and Iceland for example which are relatively small countries with no other connection options to other countries), economic strength (see Switzerland) and, rather more difficult to measure, the importance of travel.

The other story – which parts of this dashboard hint at – is how economic performance impacts on air transport. For this, I will look to get GDP figures into the underlying data and graph them against passengers per head of population. Already, however, if you look at the time lines for Ireland and Iceland, there is a hint that there can be a major impact in this respect.

This is the first project I have undertaken with Tableau and I am using Tableau Public. It has been a sharp learning experience. One of the things which has struck me is that software can be erratic in how it handles dates. The underlying tables for this project are in Excel, and Excel does not handle years as dates. Tableau attempted to interpret the years as days since sometime in 1899. Fixing that is messy and potentially a logistical night mare in the future. When I went to look at date formatting, I was stunned to see Excel didn’t allow me to format a year as date. This is infuriating.

However, I got something out of this process which is a lot of information on how to get data working in Tableau for me.


Your objective to inform, and not look pretty but useless.

Via Stats Chat in the last week or two.

If you’re not willing to click through, Stats Chat have posted a donut graphic which some New Zealand paper have printed to display some data. Really, you should have a look and then decide whether the graphic actually accurately depicts the data that the Australian paper’s figures appear to be giving.

One of the worst features – in my humble experience – of enhanced graphics capabilities of different software packages (I’m looking at you, Excel, you know I love you but…) is that people will insist on using them. Inappropriately, confusingly and just plain badly. It’s quite worrying in some respects.

An open letter to Twitter


Thanks for the promoted tweet from eToro. I seem to see them regularly.

I understand that you have a business. From my point of view, promoted tweets are little more than ads, or marketing junk. I’d like to be able to switch off promoted tweets from eToro. I’m just not interested.

I get the need to monetise your product. Google manages to ship me reasonably relevant advertising in my Gmail. YOu get a lot more information out me so….why do I get ads for Apple Stock?

I read a piece Hilary Mason wrote the other day about interview questions for data science questions. She said she’d ask what, based on your knowledge of’s data, you would do that they are not doing.

Well I don’t know for to be honest. I don’t use the service quite enough to comment. However, where Twitter is concerned, I’d do a better job on contextualising the inline advertising. Take me. It’s clear from the accounts I follow, the links I follow, the posts I make, even my description that I have certain specialised interests….photography. Surf. Kitesurf. Computer related stuff. Travel.

Nowhere in my account is any evidence that I am interested in eToro’s services. But I wouldn’t object to more relevant tasting promoted tweets, so how about it? Are you working in that area at all?







Big Data. Many things to many different people

Late last night, I picked up a tweet from Hilary Mason, chief scientist with

I’m troubled by the increasing interpretation of “big data” to mean “data without the scientific method”. When did that happen?

This is an interesting question, made all the more difficult by the growing impression I have that the definition of Big Data is a very dynamic concept. What is big data?

The truth is, I think a lot of people aren’t sure. Hilary herself provides an interesting definition:

I prefer the big = “too big to analyze on one computer” definition

I have some mixed feelings. I don’t like the phrase big data; I never have because it comes over far too much like a marketing buzzword and less like some underlying concept. For various reasons, people have had cause to ask me what I understand by “big data” lately which indicates to me that it’s something that has come at people without them recognising what lies beneath it.

For me, when I have to describe what I see it as, I say this. I say “We generate a lot of data. From different activities within a given organisation. We allow some people to analyse certain specific areas of it because historically, we didn’t have so much data, and it was all subject specific. But things are different now. Our activities generate a lot more data and that data is very much interdependent sometimes. You may be a subject area specialist in one particular area of an organisation and you may only care about that particular area. But your organisation is much bigger than your area and it could be – often is the case for example – that marrying the data from your area with the data from other areas can have a huge impact on the way your organisation does business”. In other words, we have a lot of data now, some of it more voluminous than others, but the vast majority of organisations do not use their data to join the dots coherently.

When I read articles about big data, and data science, I too am troubled. I’m troubled by the impression I have that big data is somewhere we should be at without understanding why we should be at it and what we can get from it. There is a degree of unclarity about what a data scientist actually does. Business is not generally good with a lack of clarity. Matters are not helped when the media helpfully supply articles about how datascience is the next big thing or that being a datascientist is the sexiest new job going.

It probably is but this only serves to attract people who weren’t really interested in the first place.

I’m interested in the interpretation of numbers, what they mean, how we got to those numbers, where we can go with them. How they inform us. I equally got to where I am right now by recognising that a lot of people were very interested in drawing pretty pictures displaying numbers but not so interested in the validity of the numbers. I have seen bar charts comparing social media site usage which compared the number of Facebook page loads with the number of photo uploads on Flickr. You don’t need me to tell you this is not a valid comparison given that Flickr gets page loads and Facebook has photo uploads as well.

My big huge concern with big data is that people look at the big bit but not the data bit.

I can’t write this in 140 characters, by the way; that’s why we are here and not trashing it out into the middle of the night on twitter. Also, Hilary had a good go at it last night.

If you were an executive standing in front of me, I would ask if you ever measured your website response times against your website demands linked to – for example – your sending out a marketing email, or whether there were unexpected regional variations, or whether there are trends in the google search terms bringing people to your website that indicate a questionable link somewhere or a business opportunity lost. You can call this big data if it makes you feel better, or data science. I tend to prefer data science because I suspect it is going to be around a lot longer than big data.

What matters to me is that you get the best possible information out of your data, regardless of how much or how little you have. One of the things that concerns me slightly about discussions on data science and big data is the lack of attention to basic skills in analytics. It is not just a case of running a SQL query and picking out the highest band of the ensuing bar chart. When we look at the skill sets required for data science, there tends to be a focus on computer programming (which is good, don’t get me wrong), but less importance attached to basic statistics.

When I got interested in this about 18 months ago, I had an understanding of what I wanted to do, and went back to college to get the maths and stats skills that are handy here. I’ve been programming for 12 years so I don’t worry too much about the necessary programming skills. What worries me a little is that we will wind up with a lot of people calling themselves data scientists on the basis of a few Python scripts and not a lot of understanding of the actual data.

When people focus on concerns about big data, they talk about the skills squeeze (see this New York Times piece, for example, and Jason Ward of EMC in Ireland on a similar subject) and not the actual underlying business of data science. A key issue is that datascience is not just about dragging out the data into a spread sheet, pressing a few buttons and going “hey presto”. Harvard Business Review had a useful piece on this which I recommend looking at.

Hilary Mason has an interesting piece on Getting Started With Data Science. Another place to start is probably to take a step back and try and describe what you’d want to do with a lot of data. What matters with data, no matter what scale it’s on, is how you interpret it. If you really want an example of why this is important, although it’s not data on a massive scale, Nate Silver’s work on the last US election is a good place to start, particularly given that he and a number of other analysts disagreed on the data interpretation. You need to recognise that data is to inform, not to be bent to your needs and that sometimes, that data will not tell you what you want to hear.

I’d agree with Hilary in that communication is a key skill which very often gets forgotten about, but this issue is not limited to data analytics.

So, this brings us back to what big data is, or is not, and whether it really means what we think it means. I really do think it means so many different things to so many people that as a label, it’s functionally useless. Hence, I’d prefer data science as a label, or data analytics. In this way, you can highlight that yes, methodology matters, and statistical skills will matter.

In essence, I would answer Hilary’s original question as follows: Big Data lost its underlying rigour when it become a product to be sold rather than a job to be done.

Useful data sources

This is a temporary directory of possible sources of data for datavizualisation and data analysis projects.

Mainly it’s here (at the moment) because I have identified another home for it. It will probably move to a page of its own later, and maybe out to the projects site. Blink and you’ll miss it.


Statcentral Ireland

Pew Internet

Amazon AWS Public Datasets

R datasets list compiled by Vincent Arelbundock

Shish list of open data sets.







Data science

The last few weeks have been pretty busy on the assignment front as there were three in total due in the last couple of weeks, two maths and one statistics so I am really only catching on up on things here.

I started studying mathematics and statistics for a couple of reasons; (i) I liked mathematics a lot as a kid, but when push came to shoved aged 17, languages got higher up the priority list and (ii) the amount of data in the world is increasing; the number of people equipped to interpret it however doesn’t seem to be increasing. Also increasing are the number of people creating information graphics and data visualisations.

Some people are very good at this. The New York Times, for example, do sterling work in this area, as does the Office for National Statistics in the UK.

Some are not so good in interpreting underlying data. I’ve seen one absolutely beautifully drawn graphic that purported to display the strength of FaceBook in the social media world which compared FaceBook pageloads with Flickr image uploads. A fairer comparison would be pageloads for both sites. And this is a very simple criticism.

In other words, without a reasonable grounding in data analysis, it probably isn’t guaranteed that good datagraphics are going to appear.

Big Data is a buzzword which is turning up in my newsfeeds increasingly often. I’m not always sure what people understand by it but it is definitely flavour of the month and so we turn to this report from Silicon Republic on the subject of support for data science courses.

I am of the opinion that STEM (not sure I like that term for science, technology and maths courses but it has its uses) is definitely something worth investing in the future. However, like a lot of things, important and all as it is, it isn’t often adequately rewarded economically. Here, there are debates about how much people working in universities get paid; typically in the UK, funding for research is falling, and a lot of privately funded research is moving out of the UK, or its validity is being criticised purely on the grounds of the commercial nature of its funding (see pharmaceutical research as an example here – it is difficult to make any conclusion without some accusation of bias). In certain respects, research into options for the future is between a rock and a hard place.

EMC are best known to me for data storage. It’s interesting to see one of their senior guys talking about the importance of data science and I’d be interested to know if it’s coming from their interest in providing storage for large, nay massive quantities of data, or whether they also have some interest in how that information is organised. Obviously the big name in terms of how information is organised is Google. I will be interested to see if UCC do actually put a data science course together.

In the meantime, I have another 3-4 years of my own maths/stats to go and no doubt, the industry will change a bit again in that time.