Boards of various state bodies

Another data cry for help. I’m trying to identify all the state bodies and the members of their boards to do some membership analysis

I have a list as follows:

  • NTMA
  • NAMA
  • Coillte
  • Bord na gConn
  • Bord na Mona
  • Bord Gais
  • Electric Ireland

It’s not a very long list so I know it’s non-exhaustive.

I’m interested in the state bodies I have missed like the Arts Council and other similar organisations and I am interested in a comprehensive list of the members of the boards for each of these organisations.

If you can suggest organisations I have missed, that would be a great start.

Thanks a million.

 

Captcha usage

Hi folks,

This is the first of occasional small requests for help on the data front.

I have a tiny little poll going here on the subject of captchas. Three simple questions with yes/no answers.

I am keeping it open for another week or so and then I will publish some comments on the outcome.

______Interim results suggest fewer than half my respondents are aware that reCaptchas are used to solve OCR problems. This is interesting.

 

Documentation quick tip

Update Word – if you use it – with a couple of extra styles:

  • a style for code in some sort of monospace font (I use Courrier New and I colour it red)
  • a style for code commentary in some other font and colour (I use Century and I colour it blue).

When you are creating the style in Word 2013, you can tell the software to use this in all new documents as well and make it part of the Word normal template or the default. This is useful if you don’t want to build a separate template.

The other thing which may be useful is something highlighting action points and completed action points. I tend to use bold and and again, different colours, and for the case of completed action points, strike through.

People handle work and coding differently – I tend to like to have a commentary file of what I am doing, what I am trying to do, where I am stuck, how I’ve resolved problems, for each project and this is to ensure I don’t have to build a brand new document with new styles every time. Useful information on customising Word is here – I don’t recommend doing everything he suggests but there are ways of making it more helpful for you. If you’re not familiar with styles, they are useful to be able to work with.

Recommender systems for non-frequent purchases or what I’d do with a lot of airline/holiday data

Declaration of interest: I am doing a lot of learning in the area of machine learning, classification, recommender and personalisation systems at the moment (at least compared to 3 months ago). 

If you were to look at the recommendations which Amazon offer me in the area of books, you’d probably wonder a little about me. The two front runners, content wise, are ethnic recipe books, and machine learning related programming or algorithms.

I go through them every once in a while, usually late at night, and update them with useful information such as which of the recommendations I already own, and which I absolutely don’t want. And I might occasionally add something unexpected to my wishlist.

This has a fascinating impact on my recommendations. Last night, the addition of a single machine learning book to my wishlist had the net impact of dropping the number one recommendation, a cookery book called Jerusalem, down to number 6. A subsequent addition of an Edward Tufte datavisualisation book caused two new datavisualisation books to get into the top ten including one I had never heard of at number 3 (after Jerusalem got pushed down to number 6, Stephen Few wound up in number 1 with a book called Show me the Numbers). I haven’t decided yet whether I want Jerusalem or not either; I have over 100 cookbooks so theoretically, I can’t argue that I need it.

Deletions of books I wasn’t interested in usually resulted in the list just shuffling up a bit. Additions to the wishlist caused changes to the content of the list. From this I can conclude there’s a greater weight given to additions to the wishlist rather than deletion from the recommendation list. I would love to see the underlying datastructure and code for this. There’s this but it’s 10 years old and I have no doubt but that they’ve done a serious amount of work in the interim.

What does all this mean for the supposed content of this blog post? Well I realise that the Amazon data set relating to me is large and gathered over around 10 years at this stage, but deep down a part of me would like to do a little more research into it.

However, during the week, I was also considering recommender systems for less frequently used services and in particular, airlines.

Recommender systems work best if you have a decent picture of your individual customer at the point of loading up the site. Amazon does this using accounts. If you have a look at the airlines, in general, they have a mixed experience in that front. The majority of them offer you some form of registering, although not all, some of them allow you to connect your account to a frequent flier card, and some of them allow you to create an account.

However, I’m not sure how many of them compel you to create an account to book a flight directly with them. I’m pretty certain that the last few times I booked airline tickets, I did so without an account.

This is not necessarily an impediment to providing some personalisation services. While I do have a Hotels.com account, for example, they are well capable of remembering where I was last looking for hotels even if I haven’t signed in with my own account.

There is an issue, however, in that the airlines are already perceived to, perhaps, game that sort of idea by providing you higher charges the second time you look. This isn’t ideal from the point of view of endeavouring to provide any sort of personalisation and recommendation system.

The other key issue is that arguably, how do you provide personalisation services to a cohort that doesn’t buy airline tickets every other day (or at five past midnight when they can’t sleep)? If you take any of the major airlines, they carry millions of passengers, and by definition, a lot of them have to be duplicates courtesy of return ticketing, business travelling, family visits. The airline business got on the loyalty business early with the frequent flier cards but again, the picture of airline travel has changed a lot for a lot of the market since those things were invented. There is not necessarily a lot in common between your Netflix recommendations and your frequent flier points.

I have no doubt work is going on in this area – check this out from Rick Seaney in USA Today – however, what follows are some of my own thoughts on the subject.

  1. Passengers need to be classified. You could have sixty million passengers travelling with you every year, but that’s unlikely to be sixty million different people so it is possibly not as huge a task to classify them; what matters is the feature selection side of things. Not just into leisure travel and business travel, or short trip travel and long trip travel (they don’t always overlap), or a few subsets of those, but into enough classes that allows you to provide a reasonable level of personalisation. Late last night, I figured on 20-25 groupings but I’d argue that figure is possibly dependent on what airline is doing the classification. A long haul operation like Etihad is likely to be very different to a short haul operation concentrated mainly within Europe (countless).
  2. Routes need to be classified. To be fair, the vast majority of airlines already have this one sewn up.
  3. Passenger booking behaviour needs to be classified. Again, not just in terms of how often they book, but how frequently they book, how frequently they buy carhire, hotels, travel insurance. Whether they turn up to fly. Whether they look for refunds when they don’t fly. Not just for the amount of free money you gather up from them, but to add to your picture of them.

There are a couple of things which I could see coming out of this.

Here’s something that would certainly buy my interest immediately if, for example, I was travelling to Paris every Monday morning and coming back on a Tuesday evening for business. Provide me a login that generates a page that has two buttons: Paris and Other. The Paris button could be prefilled with the most likely routing/timing options if they are available. Or, Sorry Miss Lynch, your usual flight is fully booked. Allow me to create another personalised button based on possible plans. For example, I might want to fly to oh, Malaga to go kitesurfing in Tarifa maybe six times a year. Let me build one of those so that my landing screen is Paris, Malaga and Other. Include sports equipment as an option by default in the Malaga booking. Learn enough about me to know that, for example, I have annual travel insurance, and don’t try to sell me more. Know enough about me to know that if I am flying into Nice, I’ll hire a car, but not if I fly into London. Even if I am not booking, it might be worth letting me build dreams on your site like this for three reasons:

  1. it makes me happy
  2. it tells you a lot more about potential customers.
  3. It can support families sorting out holiday plans
  4. It can support groups organising trips away together

You can make it clear you are not locking down a fare at that point, but you do get a picture of some of the possible bookings on that flight and this may have an impact on how you manage bookings on that route around those dates. While you’re at it, keep an eye on possible efforts to game your recommender system and identify it as a class of behaviour.

Based on the information I provide when I am booking, airlines can obtain enough data to do this, even without tying the behaviour to an account. However, right now, this is not the approach that they take.

But here’s something else you could do.

Suppose I click on my Malaga button and the flights for the dates I choose are full. Maybe there is some golf competition on there and you know this because you’re good at knowing when events are on but the average kitesurfer might not care about the European PGA. Or it’s the week before the school holidays. Or O’Reilly have decided to run a big technical conference down there. Any number of reasons, but the flight from, say, Dublin to Malaga is full. Or any flight to Malaga is full depending on where I am living.

If I, as an airline, know that a lot of kitesurfers take their kitesurfing gear to Tenerife, or, at least have built potential bookings, I could suggest Tenerife as an alternative – a targeted alternative (particularly if I am flying alone), with the practical date data already provided for Malaga filled into a new booking form. Or if Tenerife is your first choice, Lanzarote is a viable alternative. Or Faro. Or Madeira. Based on the time frame and the amount of money concerned, and whether you interline with anyone, you have endless opportunity here. Clearly someone going golfing in Portugal for four days is not going to want to fly 11 hours via London to somewhere in Italy – but someone going for 14 days might consider a non-direct option.

Of course to do this, you need to know that my sports gear is kitesurfing equipment. But this is not impossible. And of course, you’ll never ask if I want to bring kitesurfing equipment on my regular Monday morning trip to Paris because you know already I don’t. If I don’t have much of a direct history with you, the data you have on other people can be leveraged to build a feature set to classify me.

The point I am trying to make here is that, publicly, there is a perception that airlines basically use whatever personalisation options they have to increase the fares by trapping you. Airline yield management is complex so with the best will in the world, it’s never likely to be quite that simple. But if airline personalisation tools made life easier for their customers, they might engender a lot more repeat business, particularly now. ¬†Obviously gaining that trust in a way which is not perceived to be creepy is going to be a challenge because it’s based on knowing a lot about your customers which is something a lot of people go out of their way to discourage. I mean, I know people who deliberately like to confuse Amazon about their taste in books and music – I’m not one but then you’re talking to someone who got pleasure out of checking out how her recommendations changed by updating her wishlist.

Another interesting thing which could be done with this sort of model of engaging with your customer, based on what you know about them is telling them how many seats are available or grading the flight as commonly searched Hotels.com does this with hotel rooms. Two rooms left at this price. This is useful because while it may not cause me to book at that point in time, it’s hardly going to come as a shock to me that the price of a room in the Georges V in Paris has increased in the last two or three hours since I managed to get my travelling companion on the phone. It provides some trust. If my flight is rated Red for popular, I’ll know I am competing with, for example, 5000 Munster fans for that last seat on a flight the day before a match.

All of this is only possible if my customers trust me to use this data effectively to support them and not, specifically, to abuse them. I mean, if I assume someone who books every Monday morning will always book every Monday morning and start applying stealthy price increases to them that I do not necessarily apply to non-regular passengers, I will wind up with some public relations issues. And the loss of regular streams of income.

In summary, I believe it is possible to personalise the booking experience to the benefit of both passenger and airline. I can see that hotel booking agencies are already working in this area but I think there’s even more potential there. Even after the booking experience is personalised down to the nth degree, this information could have a huge impact on targeting promotional emails (which is something, in my experience, the hotels aren’t quite getting right yet).

 

 

Datascience job twitter feed for Ireland

I’ve started a datascience twitter feed for Ireland. Mainly I’ve done it out of frustration that there isn’t one and that searching for them isn’t always straightforward and also because that’s my own area of interest.

I’m particularly interested in those jobs with datascientist as their job title; I will consider other titles on a case by case basis – in particular if you’re looking for a data analyst to run simple reports, that won’t get listed here. I will look at machine learning related options and I will consider PhDs in the area of analytics and machine learning if you’ve gone one. Key requirement is that they be based in Ireland.

The twitter feed is here: http://www.twitter.com/datascience_ie and if you’ve got one, either DM a link to the description to me or send a link via email to datascience [at] treasalynch.com

AIRO – Two Tier property market

It’s not dated so I am not absolutely certain when AIRO posted this to their site. It’s a graph of the changes in two sections of the Irish property market since 2005, Dublin, and National ex-Dublin.

It’s very interesting for a couple of reasons. It demonstrates that both the increase and decrease in market prices in Dublin was sharper than it was in National ex-Dublin. This doesn’t totally surprise me – anecdotally there has always been some evidence to suggest the prices are were behaving at more extreme levels in Dublin. It’s interesting to see how the graphlines cross (do click through – it’s worth it).

The data is from the CSO and as far as I am aware, CSO data is limited to the mortgage market. This is interesting because there is some evidence to suggest that a lot of the market in Dublin, in particular, in recent months, has been cash driving. Without having the CSO data in detail, and a cleaned up extract from the Property Price register, it would be hard to say for certain what the split was.

The other chief regret I have about this data is that it only goes back as far as 2005. I’m mindful of sounding like an auld one but there is some evidence to suggest that the period from about 1997 might be educational as well. I guess a lot depends on what data you have available to you.

Anyway, this was done in Tableau and there is some scope for playing around in it. I am glad AIRO did it – it’s a useful exercise, and perhaps, there might be some scope for doing a county by county comparison. We have a lot more data now on the property market than we did even 3 years ago (yes, I have some programming under way for it myself) so information should be easier to come by, particularly if and as we get postcodes, the data will be cleaner up front.

Book Review: The Signal and the Noise by Nate Silver

Over the semester break I spent some time ploughing through books which were on my to read list. One of them was The Signal and the Noise by Nate Silver.

I kind of like Nate Silver’s writing, and I especially like his analysis but I had started the book, gotten half way through, got distracted and only picked it up again in January. So the review is more or less “I seem to remember this was fascinating” and “the content of this book should be fascinating but I’m not really sure I like it any more.

I like numbers. I like playing with them. I like manipulating them. I’m not very good at them; I don’t have many regrets in life but a maths and languages course up front might have been a better choice when I was 17 rather than pure maths.

I like that there is an increasing recognition that there is meaning in numbers and that the meaning needs to be interpreted. In many respects, that’s not that different to languages anyway. There is meaning in words; it has to be extracted; interpreted.

So to Nate Silver. Yes, he got the polls right in the last few US elections, and yes, he’s doing the start up thing with Five Thirty Eight now.

The focus of the book, to some extent, was the art of prediction, and his dependency on Bayes. It featured some case studies – baseball and gambling are included (although I really do suggest that you have a look at MoneyBall if you’ve any interest in the application of statistical inference and prediction to the baseball numbers as it’s a better read on that front). There was a section meteorology which was fascinating. A key point which he raises is perception and what people want from a weather forecast. Is it a weather forecast, or some entertainment?

One of the stories in it which fascinated me related to Deep Blue and the chess match with Garry Kasparov. What particularly interested me there was the idea that the computer behaved in a specific way, based on a bug. But the way it behaved rattled Kasparov and  caused some investigation as to what the long term outcome of that move could be.

I’m interested in machine learning so this is something which would catch my attention in a lateral way. We train computers to make decisions; sometimes it is not clear whether a given decision is based on a bug or some aspect of the training.

However, a couple of things annoyed me about the book. The Kindle edition has a frustrating number of typos. I can understand this in a scanned book I just think it’s a bit unforgivable now. And there are a lot of elements of the book where Nate Silver assumes he is writing for a uniquely US based audience. I don’t think this was ever going to be a safe assumption for him.

A couple of sections of the book fascinated me in a way that led me back to subject specific books of which one is earthquake prediction – we just aren’t good at it at all at the moment. As it is, I have a more than passing interest in earthquakes, volcanoes and rogue waves so which I finished this, you can make an approximate guess what other books were on my reading list.

I’m inclined to say that The Signal and the Noise is a fascinating book and well worth reading. But it’s difficult to grade in terms of is this a five star read, is it four or is it just average. I’m inclined to classify it as a book you should read, but be aware that it’s not a perfect reading book; there are elements of it which might annoy you. And you could skip it if you were so inclined. It is the sort of book that should help your Trivial Pursuit score and will open your mind. Oh and you’ll probably be left with the impression that Nate Silver is brighter than you are which isn’t always the most edifiying either.

So yeah, this is coming to you from the Raspberry Pi

About the most exciting thing I have to report at this point is that the wireless is now working on the Raspbian install which is an improvement over the last three times I’ve plugged in that particular SD card.

This is important because it means that I can finally start working in comfort at my desk rather than curled up on the living room floor.

I ran into two main issues:

  1. the wireless would not work in Raspbian
  2. two of the three keyboards I have at my disposal did not want to work effectively – I wound up with repeating letters which made getting a password entered impossible. The third keyboard is working and to facilitate that and the new monitor, major desk reorg required.

So okay, I’ve got a browser working on it; I can fire up Wolfram and Mathematica, Python is installed, what next?

Well.

I am very shortly going to go and get one of the Raspberry Pi books and look into building a can’t fail media centre and I will write instructions about that here when it’s done and running.

I also want to try and build a weather station. And a robot. And I want to build snake on it as well but I think I may have code for that.

My reading list, for anyone who is interested includes:

  • Raspberry Pi for Kids
  • Raspberry Pi User Guide
  • Raspberry Pi in Easy Steps and
  • Linux User Issue 134

There are also numerous websites. Raspberry Pi’s own website and Wolfram’s site, for example. I anticipate hours of endless fun.

Wedding Magazines and other thoughts.

Here’s some random information which might be worth looking at in some more detail.

On Saturday, I counted – sad person that I am – the number of wedding magazines on sale in Easons in Heuston Station. I did this basically because Irish Rail hadn’t told me what platform my train was going from, I didn’t feel like getting some food, and I was hanging around. There was a large display of them just inside the door. So easy to count and so attractive to do so when there seemed to be rather a lot of them.

So I can tell you the answer that I came up with was 13. I suppose if I had been really good I might have taken a photograph of the display. I can tell you that there were two subspecialisation, mainly one on wedding flowers and one on wedding cakes. The rest were things like Bride, or Bridal Magazine. There was a surfeit of white. It was a bit overwhelming.

When I posted this to twitter, a couple of things happens. Someone knew there was a bridal show on at the RDS – news to me – and then this.

Damien Mulley told me there were approximately 21,000 weddings in the country each year.

Paul Savage told me that according to Facebook, 78,000 people were engaged.

Damien Mulley came back and noted that according to Facebook, 42,000 of those were female, aged 20 or older.

You can have a look at the conversation here.

The average circulation of the general Irish fashion mags like Image and Irish Tatler is around 25,000. I’m having serious problems getting any wider circulation figures and this distresses me – the JNRS is coming back at me with newspaper and newspaper related circulation figures. But no magazines.

I can pick up some of the advertising rate cards for the Ireland based magazines and I can tell you that for one of them, the bulk of their readership is in the 25-34 age bracket.

But actual circulation figures, the magazines in Ireland appear to be very coy about.

In one respect, it might be an interesting exercise to:

  1. figure out what the picture of bridal magazines in Ireland has been for the last 15 years or so. Have we always sold 13 different magazines? What is the market entry and exit rate for them
  2. Figure out how many of them are selling every month. The cover price rate is somewhere in the region of around 5E.
  3. Figure out some way of comparing their advertising rate cards which are not uniform across the different charges.
  4. Figure out how they compare to the other women’s interest segment magazines.

Why am I interested in this? Well deep down I am wondering whether Ireland can sustain that many bridal magazines when it’s already having trouble sustaining its broadsheet newspapers. I’m also interested in seeing whether weddingsonline.ie has had an impact on the market in any indirect or direct way.

And of course, part of me is wondering about market segmentation in the glossy magazine market. Ireland has a population of around 4.5 million. It’s not, by any stretch of the imagination a huge market. This is not just limited to the whole bridal magazine thing – we also produce a couple of other specialist interest magazines, the sales of which are also augmented by imports from the UK and in some cases, the US.

Finally – the comments from Paul and Damien when I discussed this on twitter the other day were interesting because it shows that some ballpark information regarding the possible target cohort of this particular market segment could be obtained from other, social, sources.

So basically, if any one has any idea how I might get granular circulation data to play with for all magazines on sale in the Irish market at the moment, I might be interested in setting some time to have a flute around it.

Care.data

If you’re in the UK at all, you may have heard of some discussion around something called care,data. The general idea about it is that all healthcare data is centralised and that this repository of data would be made available to researchers. Such a repository of data would be massively useful for healthcare researchers.

So far so good. As someone with a great deal of interest in data, and how it can be best used to advance human society, you’d think I’d be wild about this idea. I’m not wild about the implementation and this is a pity.

The data, we are told, will be pseudoanonymised. This is the number one problem I have with it – it’s not actually properly anonymised. It comes with postcode data and NHS number. In the UK, postcode data can in a lot of cases be personally identifiable. This is wrong.

This is before you start asking questions about who gets to use the data. Plus, given the changes to the NHS organisation in the UK courtesy of the current government, you’d have to ask whether the data is even going to be as useful as it might have been 10 years ago under a centralised system.

So okay, I can knock it and be concerned. But I do believe something akin to it would be useful. Not necessarily directly profitable, but useful. So how could we implement it?

Well, there’s no reason why we can’t, straight out, why postcode is relevant? It provides regional variation information. So one of the things we need to do is provide geographically classified data. Using postcodes to create a geographic classication which does not include the postcode itself is, or should be, straightforward enough. Ergo, the postcode issue can be dealt with.

The NHS number can be replaced with a different primary key number which is not made available as part of the database of care,data data, but for which a conversation table exists with the original data. Again, depending on the actual implementation of the data structures, this should be straightforward.

This deals with the data privacy side of things and one of the big huge issues I have with the current idea.

After that, we need to be aware that more data doesn’t always cater for better/more accurate detailing. Large datasets can amplify statistical errors which, given we are talking about health data sets matter a lot, They affect real people.

These errors are the type of errors where, for example, 1 in 100 cases might be misdiagnosed because a particular test isn’t 100% acccurate for example.

Ultimately, I’m strongly in favour of this project, or, more to the point a project like it, provided it comes with built in data protection concerns and is implemented to benefit health care rather than, for example, corporate health business interests. As matters stand, I’m inclined to feel that there are lacunae here at the moment.