GIT and open source, the victory or not

During the week, Wired published a piece under the title Github’s Top Coding Languages Show Open Source Has Won.

This is basically – and I am being diplomatic here – not what Github’s Top Coding Languages shows.

Fundamentally, for Github to show this, every piece of operational code would have to be on Github. It isn’t. I’d be willing to bet less than half of it is, and probably less than a quarter, but that’s a finger in the air guess. Most companies don’t have their code on Github.

What Github’s top ten coding language shows is that these are the ten most popular languages posted by people who use Github. Nothing more and nothing less.

I suspect Github know this. I really wonder why Wired does not.

 

National bus stops

Having done Dublin Bus, it occurred to me to see if the Bus Eireann network was available. It is.

BusEireann

This is actually a bit more interesting than the Dublin Bus one for various reasons, specifically the gaps. I’m fascinated by the big hole in the middle and I will probably look at doing some additional work in terms of other spatial data.

I’ve just been asked where I get the data and it’s remiss of me not to credit the data source. The data for both Bus Eireann and Dublin Bus, plus a number of other operators is available on Transport for Ireland’s website. The link is here.

I haven’t cleaned up this graph all that much and I have additional plans for this and the other transport data that I have been looking at.

The beauty in Dublin Bus Stops

I have an ongoing project in the area of public transport in Dublin, which has a) stagnated for a while and b) grown a bit since I have had to interact more with public transport in Dublin.

DublinBusStops

This is one of the items on the project list. This image is a scatter plot of Dublin Bus latitude/longitude values for Dublin bus stops. The file this is based on, which I pulled into Excel to do this (yes, I have other plans involving R at some point which may see this revisited) has more than 4700 datapoints. Looking at it like this, I’m going to see if I can find a similar dataset for street lights. I think it’s a rather beautiful looking spiders’ web.

 

Flight routes operated by Aer Lingus out of Ireland

Following yesterday’s project with the Ryanair route data, I did the same for Aer Lingus this morning. I included one extra chart, mainly because of the northwest Atlantic destinations which are served by Dublin and Shannon and I wanted a view on how many airports service two routes.

So here are the charts.

AerLingus_EX_IRL

Complete network

AerLingus_EX_IRL_2

Airports serving at least 2 routes

AerLingus_EX_IRL_3

Airports serving at least 3 routes.

So there are a couple of points to note about this. Both the Ryanair and Aer Lingus data were imported into Illustrator to build a web friendly file format (I exported the graphs as PDFs. I am primarily an expert in Photoshop rather than Illustrator so there are a few things I missed yesterday, which I could fix on today’s files particularly with respect to label positioning. I did this specifically for the second and third charts.

The underlying data have some similarities. For Aer Lingus, Dublin has something in the region of 80 destinations which is not that dissimilar to Ryanair’s offering. Aer Lingus flies out of a couple of extra airports, namely Belfast and Donegal but the routes concerne are limited – Belfast targets London at the moment, and Donegal targets Dublin. The other point to note about this data is that it includes destinations operated by either Aer Lingus, or Aer Lingus Regional (operated by Stobart) but not connecting destinations beyond their hubs in North America, for example. Direct flights only.

Just a brief comment on those airports with three or more routes outbound: they include the following:

  1. Cork
  2. Dublin
  3. Shannon
  4. London Gatwick
  5. London Heathrow
  6. Manchester
  7. Birmingham
  8. Bristol
  9. Edinburgh
  10. Lanzarote
  11. Malaga
  12. Faro

This is significantly UK focused compared to Ryanair which was highly holiday destination focused. I’m not saying you couldn’t go on your holidays in Manchester or Birmingham…but I suspect most people don’t.

I am not really finished with this project – I have a couple of other thoughts about it and I’d also like to look at combined connectivity out of Ireland across all airlines. That data is going to take a while to gather up and certain things I want to do I am not sure are possible using Gephi. I will also look at graphic decisions like the fonts and colours as well.

Flight routes from Ireland operated by Ryanair

Flights operated by Ryanair from Ireland

The image above is a network chart of flights out of Ireland as operated by Ryanair from the following airports:

  • Cork;
  • Dublin;
  • Kerry;
  • Knock Ireland West; and
  • Shannon.

The data is not organised geographically, but in terms of connections between nodes. The nodes on the charge are various airports, and the links, or edges are operated routes. This chart shows all the connections between the five Irish airports above and any airport that Ryanair flies to from those airports.

However, Gephi allows you to fine tune what you want to see, and so there is this:

Ryanair_EX_IRL_3

This basically includes only those nodes which have three connections to other nodes. So only those airports which are destinations for at least 3 other nodes in this network. If you like, it basically includes destinations which are served by Ryanair from at least three airports in Ireland. Without looking in too much detail, you could probably guess a few: London Stansted is an obvious candidate. Dublin still has the most connections; this is not surprising as it had 80 or so to begin with.  But the target airports are interesting:

  • Alicante;
  • Faro;
  • London Stansted;
  • Malaga;
  • Tenerife South;
  • London Gatwick;
  • Palma;
  • Lanzarote;
  • Milan Bergamo;
  • Girona Barcelona; and
  • Liverpool.

Most of those airports, with the exceptions of London Gatwick and Stansted, and Liverpool, are holiday destinations. The outlier – as in the one I did not really expect to see – was Milan Bergamo.

The graph would probably be bigger if I stripped it down to airports with two connnections, mainly because Dublin has destinations in common with most of the other airports. Fuerteventura is one of the few destinations which is served by Cork and Shannon but not by Dublin, for example.

Gephi is a really nice tool to use for stuff like this and I have other plans for it. This is actually the first project I have done using it and I have an interest in figuring out how much more I can customise to use more data. For example, there is no weighting on any of the edges in either of the graphs above, and that parameter could probably be used to demonstrate frequency or seasonality. I have other plans as well. I also have plans to do something like this with bus routes in Dublin and general public transport, for example.

Couple of notes about the data:

  1. Route data was collected on 12 May 2015. As such, it will go out of date as Ryanair update their routes out of Ireland.
  2. It does not take account of any seasonal differences: all of these flights may not operate year round. Personally am considering a flight to Grenoble myself as I did not know one existed until today.
  3. This is a proof of concept for other work I want to do later with transport routing.
  4. I will probably look at other airlines later if I can access the data easily.

Future work

Via twitter yesterday, I was pointed to this piece on one of the WSJ’s blogs. Basically it looks at the likelihood that given job type might or might not be replaced by some automated function. Interestingly, the WSJ suggested that the safest job might be amongst the interpreter/translation industry. I found that interesting for a number of reasons so I dug a little more. The paper that blogpost is based on is this one, from Nesta.

I had a few problems with it so I also looked back at this paper which is earlier work by two of the authors involved in the Nesta paper.  Two of the authors are based at the Oxford Martin institute; the third author of the Nesta paper is linked with the charity Nesta itself.

So much for the background. Now for my views on the subject.

I’m not especially impressed with the underlying work here: there’s a lot of subjectivity in terms of how the underlying data was generated and in terms of how the training set for classification was set up. I’m not totally surprised that you would come to the conclusion that the more creative work types are more likely to be immune to automation for the simple reason that there are gaps in terms of artificial intelligence on a lot of fronts. But I was surprised that the outcome focused on translation and interpreting.

I’m a trained interpreter and a trained translator. I also have postgraduate qualifications in the area of machine learning with some focus on unsupervised systems. You could argue I have a foot in both camps. Translation has been a target of automated systems for years and years. Whether we are there yet or not depends on how much you think you can rely on Google Translate. In some respects, there is some acknowledgement in the tech sector that you can’t (hence Wikipedia hasn’t been translated using it) and in other respects, that you can (half the world seems to think it is hilariously adequate; I think most of them are native English speakers). MS are having a go at interpreting now with Skype. As my Spanish isn’t really up to scratch I’m not absolutely sure that I’m qualified to evaluate how successful they are. But if it’s anything like machine translation of text, probably not adequately. Without monumental steps forward in natural language processing – in lots of languages – I do not think you can arrive at a situation where computers are better at translating texts than humans and in fact, even now, to learn, machine translation systems are desperately dependent on human translated texts.

The interesting point about the link above is that while I might agree with the conclusions of the paper, I remain unconvinced by some of the processes that delivered them to those conclusions. To some extent, you could argue that the processes that get automated are the ones that a) cost a lot of people a lot money and b) are used often enough to be worth automating. It is arguable that for most of industry, translation and interpreting is less commonly required. Many organisations just get around the problem by having an in house working language, for example, and most organisations outsource any unusual requirements.

The other issue is that around translation, there has been significant naiveté – and I believe there continues to be – in terms how easy it is to solve this problem automatically. Right now we have a data focus and use statistical translation methods to focus on what is more likely to be right. But the extent to which we can depend on that tend to be available data and that varies in terms of quantity and quality with respect to language pairs. Without solving the translation problem, I am not sure we can really solve the interpreting problem either given issues around accent and voice recognition. For me, there are core issues around how we enable language for computers and I’ve come to the conclusion that we underestimate the non-verbal features of language such that context and cultural background is lost for a computer which has not acquired language via interactive experience (btw, I have a script somewhere to see about identifying the blockages in terms of learning a language). Language is not just 100,000 words and a few grammar rules.

So, back to the question of future work. Technology has always driven changes in employment practices and it is fair to say that the automation of boring repetitive tasks might generally be seen as good as it frees people up to higher level tasks, when that’s what it does. The papers above have pointed out that this is not always the case; that automation occasionally generates more low level work (see for example mass manufacture versus craft working).

The thing is, there is a heavy, heavy focus on suggesting that jobs disappearing through automation of vaguely creative tasks (tasks that involve a certain amount more decision making for example) might be replaced with jobs that serve the automation processes. I do not know if this will happen. Certainly, there has been a significant increase in the number of technological jobs, but many of those jobs are basically irrelevant. The world would not come to a stop in the morning if Uber shut down, for example, and a lot of the higher profile tech start ups tend to be targeting making money or getting sold rather than solving problems. If you look at the tech sector as well, it’s very fluffy for want of a better description. Outside jobs like programming, and management, and architecture (to some extent), there are few recognisable dream jobs. I doubt any ten year old would answer “business analyst” to the question “What do you want to do when you grow up”.

Right now, we see an excessive interest in disruption. Technology disrupts. I just think it tends to do so in ignorance. Microsoft, for example, admit that it’s not necessary to speak more than one language to work on machine interpreting for Skype. And at one point, I came across an article regarding Duolingo where they had very few language/pedagogy staff particularly in comparison to the number of software engineers and programmers, but the target for their product was to a) distribute translation as a task to be done freely by people in return for free language lessons and b) provide said free language lessons. The content for the language lessons is generally driven by volunteers.

So the point I am driving at is that creative tasks, which feature content creation, for example carrying out translation tasks, or providing appropriate learning tools, these are not valued by the technology industry. What point is there training to be an interpreter or translator if technology distributes the tasks in such a way as people will do it for free? We can see the same thing happening with journalism. No one really wants to pay for it.

And at the end of the day, a job which doesn’t pay is a job you can’t live on.

Falling out of love with Amazon

I remember a time when I used to love Amazon. It was back around the time when there was a lot less stuff on the web and it was an amazing database of books. Books, Books, Books.

I can’t remember when it ended. I find the relationship with Amazon has deteriorated into one of convenience more than anything; I need it to get books, but it’s doing an awful job of selling me books at the moment too. Its promises have changed, my expectations have risen and fallen accordingly. Serendipity is failing. I don’t know if it is me, or if it is Amazon.

But something has gone wrong and I don’t know if Amazon is going to be able to fix it.

There are a couple of problems for me, which I suspect are linked to the quality of the data in Amazon’s databases. I can’t be sure of course – it could be linked to the decision making gates in its software. What I do know is it is something I really can’t fix.

Amazon’s search is awful. Beyond awful. Atrocious. A disaster. It’s not unique in that respect (I’ve already noted the shocking localisation failings for Google if you Are English Speaking But You Live In Ireland And Not The United States When Looking For Online Shops) but in terms of returning books which are relevant to the search you put in, it is increasingly a total failure. The more specific your search terms as well, the more likely to are to get what can only be described as a totally random best guess. So, for example, if I look for books regarding Early Irish History, then search returning books on Tudor England are so far removed from what I want that it’s laughable. On 1 May 2015 (ie, day of writing) fewer than a quarter of the first 32 search results refer to Ireland, and only 1 of them is even remotely appropriate.

Even if you are fortunate enough to give them an author, they regularly return searches of books not by that author.

I find this frustrating at the best of times because it wastes my time.

Browsing is frustrating. The match between the categories and the books in those categories can be random. The science category is full of new age nonsense and it often is very much best selling so the best sellers page becomes utterly useless. School books also completely litter the categories, particularly in science. I have no way of telling Amazon that I live in Ireland and have no real interest in UK school books, or, in fact, any school books when I am browsing geography.

Mainly I shouldn’t have to anyway. They KNOW I live in Ireland. They care very much about me living in Ireland when it comes to telling me they can deliver stuff. They just keep trying to sell me stuff that really, someone in Ireland probably isn’t going to want. Or possibly can’t buy (cf the whinge about Prime Streaming video to come in a few paragraphs). Amazon is not leveraging the information it has on me effectively AT ALL.

The long tail isn’t going to work if I can’t find things accidentally because I give up having scrolled through too many Key Stage Three books.

Foreign Languages: Amazon makes no distinction between text books and, for want of a better word, non-text books in its Books in Foreign Languages section. So again, once you’ve successfully drilled down to – for example – German – you are greeted with primarily Learn German books and Dictionaries, probably because of the algorithm which prioritises best sellers.

How can I fix this?

Basically, Amazon won’t allow me to fix things or customise things such that I’m likely to find stuff that interests me more. I don’t know whether they are trying to deal with these problems in the background – it’s hard to say because well, they don’t tend to tell you.

But.

  1. It would be nice to be able to reconfigure Treasa’s Amazon. Currently, its flagship item is Amazon Prime Streaming Video, which is not available in Ireland.Amazon knows I am in Ireland. It generally advises me how soon it can deliver stuff to Ireland if I’m even remotely tempted to buy some hardcopy actual book. Ideally they wouldn’t serve their promotions for Amazon Prime Streaming Video, but if they have to inflict ads for stuff they can’t sell me, the least they could do is let me re-order the containers in which each piece of information appears. So I could prioritise books and coffee which I do buy, over streaming video and music downloads which I either can’t or don’t buy from amazon usually.
  2. It would be nice to be able to set up favourite subject streams in books or music or dvds. I’d prefer to prioritise non-fiction over beach fiction, for example.
  3. I’d like to be able to do (2) for two other languages as well. One of the most frustrating things with the technology sector is the assumption of monolinguality. I’d LIKE to be able to buy more books in German, in fact I’m actively TRYING to read more German for various reasons, and likewise for French.
  4. I don’t have the time to Fix This Recommendation. They take 2 clicks and feature a pop up. As user interaction, it sucks. I’d provide more information for fixing the recommendations if I could click some sort of Reject from the main page and have them magically vanish. Other sites manage this.

But there are core problems with Amazon’s underlying data I think. Search is so awful and so prone to bringing back wrong results, it can only be because metadata for the books in question is wrong or incomplete. If they are using text analysis to classify books based on title and description, it’s not working. Not only that, their bucket classification is probably too broadbased. Their history section includes a metric tonne of historical fiction, ie, books which belong in fiction and not in history. If humans are categorising Amazon’s books, they are making a mess of it. If machine learning algorithsm are, they are making a mess of it.

There is an odd quirk in the sales based recommender which means that I can buy 50 books on computer programming but as soon as I buy one oh book of prayers as a gift for a relative, my recommender becomes highly religious focused and prayer books outplay programming books. Seriously: 1 prayer book to 50 programming books means you could probably temper the prayer books. Maybe if I bought 2 or 3 prayer books you could stop assuming it was an anomaly. This use of anomalous purchases to pollute the recommendations is infuriating and could be avoided by Amazon not overly weighting rare purchases.

I’m glad Amazon exists. But the service it has provided, particularly in terms of book buying, is nowhere near as useful as it used to be. Finding stuff I know I want is hard. Finding stuff I didn’t know I wanted but now I HAVE to have is downright impossible.

And this is a real pity because if the whole finding stuff I wanted to buy was easier on the book front, I’d be happy to spend money on it. After all, the delivery mechanisms, by way of Kindle etc have, have become far, far easier.

this is about data and technology and where I interact with both