Category Archives: datascience

Undergraduate languages in the United Kingdom

I write, from time to time, on language related matters and one of the items on my list of backburner projects was to have a look at undergraduate language options in the United Kingdom. I had a look at Ireland as well but since we have 7 universities, there isn’t very much of interest to consider when it comes to language provision in Ireland. UCC is about your best option there. I’ll post the graph of that later.

The United Kingdom is interesting for a couple of reasons: firstly, tuition provision in languages has been falling off a cliff there and language departments have been closing near hand over fist. One of my recollections relating to language tuition provision in the university sector was that there was a great breadth of provision in terms of languages offered when I was looking for somewhere to study back in 1990, and given changes to language related matters in the UK in the interim, I was interested to see how things looked. Data, however, is not that easily come by and in the end I would up collecting it manually.

One of the things I wanted to do was see what the obvious clusters were and it occurred to me that using languages and higher education organisations as nodes might allow a network chart to be built. I actually did a proof of concept of that with the Irish provisions purely because there were neither too many languages nor too many universities (seven of the latter and not far off seven for the former). The network depicting software which I used was Gephi.

According to the basic research which I did, 78 higher education organisations are offering primary degrees of which a language is a major component. I suspect, if I were to look more closely and root out things like “International Business With A Language” type degrees, the number of pure language related courses would be significantly lower. I have not decided how best to sort out data to get that information and I may not do it just yet.

Eventually, when I plotted things, there was an interesting imbalance on the graph. I noted this on the graph itself for which you can find here, but it is obvious enough below too.

UnitedKingdomWhat this tells you is that if you want to learn anything other than, effectively, French, Spanish, English, Italian, German, Russian or Chinese, most of your options are limited to two universities in London or one in Edinburgh. The overwhelming number of universities which offer any language study at all draw primarily from the seven listed above. There are a few stragglers around but that’s more or less the way things are.

One of the things I would consider doing with this data at some stage is comparing language provision in the United Kingdom with language provision in the university sector in a bunch of other European countries, and also, looking at comparing provision of official European languages within the university sector across Europe. I really have no idea how I could quickly get this data together – I do not know if it’s even available anywhere. But it would be interesting to see where the holes exist in terms of provision of tuition at university level of official European languages.

National bus stops

Having done Dublin Bus, it occurred to me to see if the Bus Eireann network was available. It is.

BusEireann

This is actually a bit more interesting than the Dublin Bus one for various reasons, specifically the gaps. I’m fascinated by the big hole in the middle and I will probably look at doing some additional work in terms of other spatial data.

I’ve just been asked where I get the data and it’s remiss of me not to credit the data source. The data for both Bus Eireann and Dublin Bus, plus a number of other operators is available on Transport for Ireland’s website. The link is here.

I haven’t cleaned up this graph all that much and I have additional plans for this and the other transport data that I have been looking at.

The beauty in Dublin Bus Stops

I have an ongoing project in the area of public transport in Dublin, which has a) stagnated for a while and b) grown a bit since I have had to interact more with public transport in Dublin.

DublinBusStops

This is one of the items on the project list. This image is a scatter plot of Dublin Bus latitude/longitude values for Dublin bus stops. The file this is based on, which I pulled into Excel to do this (yes, I have other plans involving R at some point which may see this revisited) has more than 4700 datapoints. Looking at it like this, I’m going to see if I can find a similar dataset for street lights. I think it’s a rather beautiful looking spiders’ web.

 

Flight routes operated by Aer Lingus out of Ireland

Following yesterday’s project with the Ryanair route data, I did the same for Aer Lingus this morning. I included one extra chart, mainly because of the northwest Atlantic destinations which are served by Dublin and Shannon and I wanted a view on how many airports service two routes.

So here are the charts.

AerLingus_EX_IRL

Complete network

AerLingus_EX_IRL_2

Airports serving at least 2 routes

AerLingus_EX_IRL_3

Airports serving at least 3 routes.

So there are a couple of points to note about this. Both the Ryanair and Aer Lingus data were imported into Illustrator to build a web friendly file format (I exported the graphs as PDFs. I am primarily an expert in Photoshop rather than Illustrator so there are a few things I missed yesterday, which I could fix on today’s files particularly with respect to label positioning. I did this specifically for the second and third charts.

The underlying data have some similarities. For Aer Lingus, Dublin has something in the region of 80 destinations which is not that dissimilar to Ryanair’s offering. Aer Lingus flies out of a couple of extra airports, namely Belfast and Donegal but the routes concerne are limited – Belfast targets London at the moment, and Donegal targets Dublin. The other point to note about this data is that it includes destinations operated by either Aer Lingus, or Aer Lingus Regional (operated by Stobart) but not connecting destinations beyond their hubs in North America, for example. Direct flights only.

Just a brief comment on those airports with three or more routes outbound: they include the following:

  1. Cork
  2. Dublin
  3. Shannon
  4. London Gatwick
  5. London Heathrow
  6. Manchester
  7. Birmingham
  8. Bristol
  9. Edinburgh
  10. Lanzarote
  11. Malaga
  12. Faro

This is significantly UK focused compared to Ryanair which was highly holiday destination focused. I’m not saying you couldn’t go on your holidays in Manchester or Birmingham…but I suspect most people don’t.

I am not really finished with this project – I have a couple of other thoughts about it and I’d also like to look at combined connectivity out of Ireland across all airlines. That data is going to take a while to gather up and certain things I want to do I am not sure are possible using Gephi. I will also look at graphic decisions like the fonts and colours as well.

Flight routes from Ireland operated by Ryanair

Flights operated by Ryanair from Ireland

The image above is a network chart of flights out of Ireland as operated by Ryanair from the following airports:

  • Cork;
  • Dublin;
  • Kerry;
  • Knock Ireland West; and
  • Shannon.

The data is not organised geographically, but in terms of connections between nodes. The nodes on the charge are various airports, and the links, or edges are operated routes. This chart shows all the connections between the five Irish airports above and any airport that Ryanair flies to from those airports.

However, Gephi allows you to fine tune what you want to see, and so there is this:

Ryanair_EX_IRL_3

This basically includes only those nodes which have three connections to other nodes. So only those airports which are destinations for at least 3 other nodes in this network. If you like, it basically includes destinations which are served by Ryanair from at least three airports in Ireland. Without looking in too much detail, you could probably guess a few: London Stansted is an obvious candidate. Dublin still has the most connections; this is not surprising as it had 80 or so to begin with.  But the target airports are interesting:

  • Alicante;
  • Faro;
  • London Stansted;
  • Malaga;
  • Tenerife South;
  • London Gatwick;
  • Palma;
  • Lanzarote;
  • Milan Bergamo;
  • Girona Barcelona; and
  • Liverpool.

Most of those airports, with the exceptions of London Gatwick and Stansted, and Liverpool, are holiday destinations. The outlier – as in the one I did not really expect to see – was Milan Bergamo.

The graph would probably be bigger if I stripped it down to airports with two connnections, mainly because Dublin has destinations in common with most of the other airports. Fuerteventura is one of the few destinations which is served by Cork and Shannon but not by Dublin, for example.

Gephi is a really nice tool to use for stuff like this and I have other plans for it. This is actually the first project I have done using it and I have an interest in figuring out how much more I can customise to use more data. For example, there is no weighting on any of the edges in either of the graphs above, and that parameter could probably be used to demonstrate frequency or seasonality. I have other plans as well. I also have plans to do something like this with bus routes in Dublin and general public transport, for example.

Couple of notes about the data:

  1. Route data was collected on 12 May 2015. As such, it will go out of date as Ryanair update their routes out of Ireland.
  2. It does not take account of any seasonal differences: all of these flights may not operate year round. Personally am considering a flight to Grenoble myself as I did not know one existed until today.
  3. This is a proof of concept for other work I want to do later with transport routing.
  4. I will probably look at other airlines later if I can access the data easily.

A Magna Carta for Big Data

I need first to provide a disclaimer: I did my MSc in CompSci at University College Dublin which is one of the universities providing a home to the Insight Centre. And LinkedIn sent me the vacancy for Oliver Daniels’ job several times as a vacancy for which I was suitable. I know some of the Insight people and I have a particular amount of respect for the senior ones I know both in UCD and UCC.

With that out of the way, Oliver Daniels wrote a piece for the Huffington Post which I have some reservations about.

The data industry has to stop seeing itself as Big Data. The term is loaded. When people are talking about Big Pharma, they are talking about the pharmaceutical industry acting in its best interests (and not yours), and when they talk about Big Ag, they are talking about the agricultural-industrial complex acting in its best interests, not yours and not the environments. Big X is never a positive label for X. It implies a behemoth which really has no interest in your interests. I hate the term Big Data for this reason. It has never really meant serious data analytics, only a marketing tool for people who genuinely aren’t interest in data, but in buzzwords. Big Data is turning toxic.

If you read Oliver Daniels’ piece about a Magna Carta for Big Data, it is obvious that he is not looking for a Magna Carta for you or me, but for the right of large scale data analytics companies to have access to and use your data. There are a lot of benefits to large scale analytics but it is a stretch to call it a charter of rights when you have to give them access to your data, and they promise not to sell it to AN Other Company. The example in the Daniels piece relates to health data specifically, and the risk of sale of same to insurance companies.

Unlike Oliver Daniels, I have always known my mother’s age, and indeed, my father’s age and so I won’t be using either as an emotional hook on which to demand that people make their data available. What I would like to see Insight, and organisations attempting to be active in the health analytics side do is recognise that the vast majority of people, while not analytics experts, are not necessarily stupid. And I have issues with statements like this:

Healthcare has always been about data analytics, only now we have access to so much more data.

The thing is we don’t. We can certainly generate more data, but we don’t necessarily have the right to use it. When Oliver Daniels is talking about a Magna Carta for big data, he is looking for the right to use it, framed in a way that suggests my rights are protected. This might be viable if the data industry – and hardly any company is not a data company at this stage – had an even remotely sane record on not losing data.

There is no point in saying “and we promise your data won’t be released to AN Company you don’t approve of” when all over the world, vendors are getting hacked, losing data, losing laptops, spending a small fortune writing to customers suggesting they get their credit cards reissued, re-enacting U2 videos by beating their chests and being sorry. Really Sorry. Very, really sorry. We lost your data.

I have already written about the cost of messing up individuals in the quest of getting access to their health data in the past.

Oliver Daniels writes:

We need the public to feel trust when they hand over details about their health.

Even if we were to take the view that of course you can have everything you want, we trust you completely not to misuse the data, the simple truth is that we already know that large scale data sites have been hacked in highly public manners. I have correspondence from Adobe apologising for losing a lot of data. I have correspondence from any number of online data centric companies explaining that they have allowed their perimeters to be breached. The data industry has simply not earned the right to respect in terms of practically protecting data.

It would be an overarching, policy-led document that describes what we want, and don’t want, from Big Data. It is a document that would put citizens at the centre of the Big Data age, and ensure that the technology develops with democracy and human rights as guiding principles.

The Magna Carta was a document of rights, not a policy document. What Oliver Daniels wants is not so much a charter of rights for humanity but a bill of rights for Big Data – he uses the term; I think he should move away from it to have access to humanity’s data. The regulatory framework at the moment, piecemeal as it might be, in Europe, in particular, errs on the side of the individual, not the gathering of large datasets.

You know this is what he is looking for with this:

A Magna Carta for Data would not be a list of protectionist rules about privacy triggered by court cases and data infractions.

A Magna Carta for Data is not a Magna Carta for owners of data.

You know this when he says this:

The Magna Carta would not enshrine privacy measures that risk bringing enlightened data research to a standstill.

The core objective of this measure is not to balance the rights of humans who generate data and companies and organisations which want to exploit that data. It is to make it easier to get access to that data. And it uses the argument that privacy concerns are already left behind by big data.

I have a couple of issues with this. At this stage, I’d like senior managers who genuinely believe in the benefits of large scale data analytics to stop calling it Big Data. It is a toxic term with strongly negative connotations.

I also take issue with describing this as a Magna Carta for Data. This is a marketing metaphor and nothing else. It is not even appropriate in the context of trying to get people to give up some existing privacy rights – rights which are not negated just because you claim they are.

I would like the data industry to understand that to date, they have already made demonstrable screw ups, both in the private sector (Target and Adobe as two examples) and the public sector (the NHS mess with attempting to sell care.data to the public).

I have a lot of time for data analytics and in particular, the machine learning side of things. I honestly believe there is a lot of insight to be gained from it. But equally, I believe that there is no god given right for access to this data, and I’d like practitioners of big data to pay more attention to the fact that a lot of what they are trying to do has been done by statisticians who recognise underlying problems with large scale analytics. The fact that you’ve 10 billion records does not automatically infer you have a wholly representative sample or, indeed, a viable model. Tim Harford has an illustrative piece here.

I’ve done some work with large datasets. I’m fully aware of the benefits of being able to get a picture of the behaviour of system components over time – such as buses running ahead of or behind schedule. But I’m also aware of the risk of assuming technology gives us more exact pictures of reality. The garbage in garbage out principle will always exist, and the cartoon I saw more than twenty years which had the tagline “The beauty of computers is that you can screw up so much more precisely”.

More than anything, I want people in the industry to stop playing with marketing tags like Magna Carta for data and Big Data. Neither of these instil much confidence. I’d hate to see the benefits of health analytics killed by pretending these things can be simplified down to a Universal Declaration of Data Rights.

 

Landscape changes in Analytics

Microsoft has bought Revolution Analytics. It’s possible, if you are not in the R space or the analytics space, that the significance of this in terms of data manipulation and exploitation will have passed you by but it’s big, both in terms of an announcement of Microsoft’s intentions, and in terms of the tools they are targeting.

I use R as a primary statistical programming tool and prefer it over SPSS and Minitab for most things.

Recommendations at Etsy

Robert Hall, one of the data engineers at Etsy, a large online craft market place, has written a comprehensive overview of how they manage recommendations. It’s a very interesting piece in that it’s quite open particularly as relates to something which could be considered commercially sensitive, and it broaches on both the mathematics and infrastructure side of things.

Read it here. The Etsy Code as Craft blog is well worth a read in ongoing terms.