if i had access to your data – Musings on Languages, IT and other stuff

Falling out of love with Amazon

I remember a time when I used to love Amazon. It was back around the time when there was a lot less stuff on the web and it was an amazing database of books. Books, Books, Books.

I can’t remember when it ended. I find the relationship with Amazon has deteriorated into one of convenience more than anything; I need it to get books, but it’s doing an awful job of selling me books at the moment too. Its promises have changed, my expectations have risen and fallen accordingly. Serendipity is failing. I don’t know if it is me, or if it is Amazon.

But something has gone wrong and I don’t know if Amazon is going to be able to fix it.

There are a couple of problems for me, which I suspect are linked to the quality of the data in Amazon’s databases. I can’t be sure of course – it could be linked to the decision making gates in its software. What I do know is it is something I really can’t fix.

Amazon’s search is awful. Beyond awful. Atrocious. A disaster. It’s not unique in that respect (I’ve already noted the shocking localisation failings for Google if you Are English Speaking But You Live In Ireland And Not The United States When Looking For Online Shops) but in terms of returning books which are relevant to the search you put in, it is increasingly a total failure. The more specific your search terms as well, the more likely to are to get what can only be described as a totally random best guess. So, for example, if I look for books regarding Early Irish History, then search returning books on Tudor England are so far removed from what I want that it’s laughable. On 1 May 2015 (ie, day of writing) fewer than a quarter of the first 32 search results refer to Ireland, and only 1 of them is even remotely appropriate.

Even if you are fortunate enough to give them an author, they regularly return searches of books not by that author.

I find this frustrating at the best of times because it wastes my time.

Browsing is frustrating. The match between the categories and the books in those categories can be random. The science category is full of new age nonsense and it often is very much best selling so the best sellers page becomes utterly useless. School books also completely litter the categories, particularly in science. I have no way of telling Amazon that I live in Ireland and have no real interest in UK school books, or, in fact, any school books when I am browsing geography.

Mainly I shouldn’t have to anyway. They KNOW I live in Ireland. They care very much about me living in Ireland when it comes to telling me they can deliver stuff. They just keep trying to sell me stuff that really, someone in Ireland probably isn’t going to want. Or possibly can’t buy (cf the whinge about Prime Streaming video to come in a few paragraphs). Amazon is not leveraging the information it has on me effectively AT ALL.

The long tail isn’t going to work if I can’t find things accidentally because I give up having scrolled through too many Key Stage Three books.

Foreign Languages: Amazon makes no distinction between text books and, for want of a better word, non-text books in its Books in Foreign Languages section. So again, once you’ve successfully drilled down to – for example – German – you are greeted with primarily Learn German books and Dictionaries, probably because of the algorithm which prioritises best sellers.

How can I fix this?

Basically, Amazon won’t allow me to fix things or customise things such that I’m likely to find stuff that interests me more. I don’t know whether they are trying to deal with these problems in the background – it’s hard to say because well, they don’t tend to tell you.

But.

It would be nice to be able to reconfigure Treasa’s Amazon. Currently, its flagship item is Amazon Prime Streaming Video, which is not available in Ireland.Amazon knows I am in Ireland. It generally advises me how soon it can deliver stuff to Ireland if I’m even remotely tempted to buy some hardcopy actual book. Ideally they wouldn’t serve their promotions for Amazon Prime Streaming Video, but if they have to inflict ads for stuff they can’t sell me, the least they could do is let me re-order the containers in which each piece of information appears. So I could prioritise books and coffee which I do buy, over streaming video and music downloads which I either can’t or don’t buy from amazon usually.
It would be nice to be able to set up favourite subject streams in books or music or dvds. I’d prefer to prioritise non-fiction over beach fiction, for example.
I’d like to be able to do (2) for two other languages as well. One of the most frustrating things with the technology sector is the assumption of monolinguality. I’d LIKE to be able to buy more books in German, in fact I’m actively TRYING to read more German for various reasons, and likewise for French.
I don’t have the time to Fix This Recommendation. They take 2 clicks and feature a pop up. As user interaction, it sucks. I’d provide more information for fixing the recommendations if I could click some sort of Reject from the main page and have them magically vanish. Other sites manage this.

But there are core problems with Amazon’s underlying data I think. Search is so awful and so prone to bringing back wrong results, it can only be because metadata for the books in question is wrong or incomplete. If they are using text analysis to classify books based on title and description, it’s not working. Not only that, their bucket classification is probably too broadbased. Their history section includes a metric tonne of historical fiction, ie, books which belong in fiction and not in history. If humans are categorising Amazon’s books, they are making a mess of it. If machine learning algorithsm are, they are making a mess of it.

There is an odd quirk in the sales based recommender which means that I can buy 50 books on computer programming but as soon as I buy one oh book of prayers as a gift for a relative, my recommender becomes highly religious focused and prayer books outplay programming books. Seriously: 1 prayer book to 50 programming books means you could probably temper the prayer books. Maybe if I bought 2 or 3 prayer books you could stop assuming it was an anomaly. This use of anomalous purchases to pollute the recommendations is infuriating and could be avoided by Amazon not overly weighting rare purchases.

I’m glad Amazon exists. But the service it has provided, particularly in terms of book buying, is nowhere near as useful as it used to be. Finding stuff I know I want is hard. Finding stuff I didn’t know I wanted but now I HAVE to have is downright impossible.

And this is a real pity because if the whole finding stuff I wanted to buy was easier on the book front, I’d be happy to spend money on it. After all, the delivery mechanisms, by way of Kindle etc have, have become far, far easier.

Facebook and that study

Just briefly, given the general response to the Facebook empathy contagion article on PNAS a while back (an hour is a long time on the internet, let’s face it), the question I would have to ask is this: is everyone in Facebook so attached to what they can do with their dataset that they no longer remember to ask whether they should be doing that stuff with their dataset?

A while back, I met a guy doing a PhD in data visualisation or something related and he spoke at length about how amazing it was, what could be done with health data and how the data had to be freed up because it would benefit society so much. I’ve never really bought that idea because the first thing you have to ask is this: do individuals get messed up if we release a whole pile of health data, and if so, to what extent are you willing to have people messed up?

What I’m leading to here is the question of group think and yesmenery. Ultimately, there comes a point where people are so convinced that they should do what they want, that they are unwilling to listen to dissent. The outcry over Facebook’s study has been rather loud and yet, it doesn’t appear to have occurred to anyone who had anything to do with the study that people might find it a bit creepy, to say the least. It’s not even a question of “oh, you know, our terms and conditions” or “oh, you know, we checked with Cornell’s review board”, it’s just straight up “is it creepy that we’re trying to manipulate people’s feelings here? Without telling them?”

I mean, I can’t ever imagine a case in which the answer to that question is anything other than Yes, yes it is creepy and eugh. And yet, it doesn’t seem to have occurred to anyone connected with it that it was kind of creepy and gross.

Once we get past that, what’s being focussed on is the datascience aspect and I have a hard time swallowing that too. This was a psychological experiment, not a datascience on. I mean, if you did a similar study with 40 people, you wouldn’t call it a statistical experiment, would you? In many respects, the datascience aspect is pretty irrelevant; it’s a tool to analyse the data and not the core of the experiment in and of itself. A datascience experiment might involve identifying the differences in outcome between using a dataset with 10,000 records and a dataset with 10 million records for example. Or identifying the the scale of difference in processor speeds between running a data analysis on one machine versus another.

Anyway, the two main issues I want to take away from this is that a) it wasn’t really a datascience experiment and b) sometimes you need to find people who are willing to tell you that what you are doing is ick, and you need to listen to them.

Thing is – and this is where we run into fun – what have they done that they haven’t told us about?

If I had your data: Job Bridge

Yesterday I read a piece that suggested that the Department of Social and Family Affairs weren’t about to release the names of companies which used the Job Bridge service. So I decided to have a closer look at it with a view to doing some data analysis on the subject.

There are a couple of things I’d like to have a look at in terms of the current vacancies (around 2500 per the Job Bridge site) but I haven’t yet figured out how I am going to pull down all that data (webscraping isn’t something I am prone to do too often). Having looked at a couple of pages of internships I am struck by a few things.

lot of hairdressing and beauty spa
lot of “administration”
lot of comment about “formal/informal” training
lot of boilerplate text.

And that’s just the start of it.

Job Bridge itself provides “Job Bridge Data“. This is not data. This is a summary of aggregated data. I have seen a claim that 60% of Job Bridge participants go on to obtain work after the program but there is no information on that in the “Job Bridge Data” for example. What would be interesting out of that data is where those participants are getting jobs. Are they getting jobs in their host companies and is it pretty much the case that the state is effectively paying for 6 to 9 months of a probationary period? Having done internships myself in a previous life, my experience regarding full time internships is that if companies tended to need them, they paid for them. I know for example that a few of the technical companies here still do, for example.

I have a couple of data projects on the go at the moment, plus some work for my own college course right now and unfortunately, I don’t see a quick and obvious way of getting the current Job Bridge vacancies down to me, even in an unstructured manner. This is regrettable. I know there is ongoing a lot of controversy about the Job Bridge program and to some extent, understandably so – I saw at least one teaching position “must commit for the nine months” and one farm labourer position. These, in my book, are jobs rather than internships. I’m not really that interested in picking out the odd job here and there, however, to make that point. I’m interested to see what sectors are using the program, whether there’s any way that internships can be classified as jobs rather than internships. Some sort of structured data that I can pull into R would be nice. I’d also like to do some spatial analysis on where these are and again, the structure of the site is not lending itself to that because of the odd things like Tipp North, Tipp South, all the Dublin vacancies dumped into one bucket, but the city and county listings being separate for Cork, Galway, Limerick and Waterford. What would be nice is a structured dump of the data.

One can wish, I suppose.

Putting a value on desired skills

I have an eye on the jobs market on an ongoing basis and this morning, a temporary vacancy dropped into my inbox for a data analyst role, requiring fluent French.

I tick these boxes. I speak fluent French; lived in France for one year, Belgium for 2. Added to that I have very good German as well. I’ve never felt, however, that language skills have been particularly valued. They are nice to haves but the jobs they are considered for are often low paid jobs. In 1999 – which is a long time ago – I laughed at a recruitment agent who told me that I was on to a good thing with two fluent languages and recent experience living in countries with both language, that oooh, I could be earning up to £14000 pounds as I would get two language premia.

That was ten thousand pounds a year less than I had been earning as a secretary in Belgium before I came back to Ireland. It was also less than I was learning as a contract secretary in jobs where all they cared about was my ability to answer the phone and type at more than 65wpm.

So, I rocked up in a job in IT that didn’t involve much of a need to speak languages. I’m now interested in data analytics anyway – more possibly interested in numeracy as well – and am following a university course which features analytics as a core skill.

This ad had an hourly rate attached. It also talked about a possibility of earning up to a particular level for very hard work.

The level was not very high. Being frank, there are a lot of secretarial roles out there which have higher salaries.

This suggests to me that language skills are not particularly valued in Ireland, and nor are data analytical roles; or at least a lot of people looking for data analysts don’t value skills enough.

I don’t have a lot of free time at the moment, but I’m inclined to see if I can possibly figure out a way of identifying the economic premium paid for perceived desirable skills. I’m inclined to wonder if we hear skills are desired simply because people don’t want to pay for them sometimes.

Recommender systems for non-frequent purchases or what I’d do with a lot of airline/holiday data

Declaration of interest: I am doing a lot of learning in the area of machine learning, classification, recommender and personalisation systems at the moment (at least compared to 3 months ago).

If you were to look at the recommendations which Amazon offer me in the area of books, you’d probably wonder a little about me. The two front runners, content wise, are ethnic recipe books, and machine learning related programming or algorithms.

I go through them every once in a while, usually late at night, and update them with useful information such as which of the recommendations I already own, and which I absolutely don’t want. And I might occasionally add something unexpected to my wishlist.

This has a fascinating impact on my recommendations. Last night, the addition of a single machine learning book to my wishlist had the net impact of dropping the number one recommendation, a cookery book called Jerusalem, down to number 6. A subsequent addition of an Edward Tufte datavisualisation book caused two new datavisualisation books to get into the top ten including one I had never heard of at number 3 (after Jerusalem got pushed down to number 6, Stephen Few wound up in number 1 with a book called Show me the Numbers). I haven’t decided yet whether I want Jerusalem or not either; I have over 100 cookbooks so theoretically, I can’t argue that I need it.

Deletions of books I wasn’t interested in usually resulted in the list just shuffling up a bit. Additions to the wishlist caused changes to the content of the list. From this I can conclude there’s a greater weight given to additions to the wishlist rather than deletion from the recommendation list. I would love to see the underlying datastructure and code for this. There’s this but it’s 10 years old and I have no doubt but that they’ve done a serious amount of work in the interim.

What does all this mean for the supposed content of this blog post? Well I realise that the Amazon data set relating to me is large and gathered over around 10 years at this stage, but deep down a part of me would like to do a little more research into it.

However, during the week, I was also considering recommender systems for less frequently used services and in particular, airlines.

Recommender systems work best if you have a decent picture of your individual customer at the point of loading up the site. Amazon does this using accounts. If you have a look at the airlines, in general, they have a mixed experience in that front. The majority of them offer you some form of registering, although not all, some of them allow you to connect your account to a frequent flier card, and some of them allow you to create an account.

However, I’m not sure how many of them compel you to create an account to book a flight directly with them. I’m pretty certain that the last few times I booked airline tickets, I did so without an account.

This is not necessarily an impediment to providing some personalisation services. While I do have a Hotels.com account, for example, they are well capable of remembering where I was last looking for hotels even if I haven’t signed in with my own account.

There is an issue, however, in that the airlines are already perceived to, perhaps, game that sort of idea by providing you higher charges the second time you look. This isn’t ideal from the point of view of endeavouring to provide any sort of personalisation and recommendation system.

The other key issue is that arguably, how do you provide personalisation services to a cohort that doesn’t buy airline tickets every other day (or at five past midnight when they can’t sleep)? If you take any of the major airlines, they carry millions of passengers, and by definition, a lot of them have to be duplicates courtesy of return ticketing, business travelling, family visits. The airline business got on the loyalty business early with the frequent flier cards but again, the picture of airline travel has changed a lot for a lot of the market since those things were invented. There is not necessarily a lot in common between your Netflix recommendations and your frequent flier points.

I have no doubt work is going on in this area – check this out from Rick Seaney in USA Today – however, what follows are some of my own thoughts on the subject.

Passengers need to be classified. You could have sixty million passengers travelling with you every year, but that’s unlikely to be sixty million different people so it is possibly not as huge a task to classify them; what matters is the feature selection side of things. Not just into leisure travel and business travel, or short trip travel and long trip travel (they don’t always overlap), or a few subsets of those, but into enough classes that allows you to provide a reasonable level of personalisation. Late last night, I figured on 20-25 groupings but I’d argue that figure is possibly dependent on what airline is doing the classification. A long haul operation like Etihad is likely to be very different to a short haul operation concentrated mainly within Europe (countless).
Routes need to be classified. To be fair, the vast majority of airlines already have this one sewn up.
Passenger booking behaviour needs to be classified. Again, not just in terms of how often they book, but how frequently they book, how frequently they buy carhire, hotels, travel insurance. Whether they turn up to fly. Whether they look for refunds when they don’t fly. Not just for the amount of free money you gather up from them, but to add to your picture of them.

There are a couple of things which I could see coming out of this.

Here’s something that would certainly buy my interest immediately if, for example, I was travelling to Paris every Monday morning and coming back on a Tuesday evening for business. Provide me a login that generates a page that has two buttons: Paris and Other. The Paris button could be prefilled with the most likely routing/timing options if they are available. Or, Sorry Miss Lynch, your usual flight is fully booked. Allow me to create another personalised button based on possible plans. For example, I might want to fly to oh, Malaga to go kitesurfing in Tarifa maybe six times a year. Let me build one of those so that my landing screen is Paris, Malaga and Other. Include sports equipment as an option by default in the Malaga booking. Learn enough about me to know that, for example, I have annual travel insurance, and don’t try to sell me more. Know enough about me to know that if I am flying into Nice, I’ll hire a car, but not if I fly into London. Even if I am not booking, it might be worth letting me build dreams on your site like this for three reasons:

it makes me happy
it tells you a lot more about potential customers.
It can support families sorting out holiday plans
It can support groups organising trips away together

You can make it clear you are not locking down a fare at that point, but you do get a picture of some of the possible bookings on that flight and this may have an impact on how you manage bookings on that route around those dates. While you’re at it, keep an eye on possible efforts to game your recommender system and identify it as a class of behaviour.

Based on the information I provide when I am booking, airlines can obtain enough data to do this, even without tying the behaviour to an account. However, right now, this is not the approach that they take.

But here’s something else you could do.

Suppose I click on my Malaga button and the flights for the dates I choose are full. Maybe there is some golf competition on there and you know this because you’re good at knowing when events are on but the average kitesurfer might not care about the European PGA. Or it’s the week before the school holidays. Or O’Reilly have decided to run a big technical conference down there. Any number of reasons, but the flight from, say, Dublin to Malaga is full. Or any flight to Malaga is full depending on where I am living.

If I, as an airline, know that a lot of kitesurfers take their kitesurfing gear to Tenerife, or, at least have built potential bookings, I could suggest Tenerife as an alternative – a targeted alternative (particularly if I am flying alone), with the practical date data already provided for Malaga filled into a new booking form. Or if Tenerife is your first choice, Lanzarote is a viable alternative. Or Faro. Or Madeira. Based on the time frame and the amount of money concerned, and whether you interline with anyone, you have endless opportunity here. Clearly someone going golfing in Portugal for four days is not going to want to fly 11 hours via London to somewhere in Italy – but someone going for 14 days might consider a non-direct option.

Of course to do this, you need to know that my sports gear is kitesurfing equipment. But this is not impossible. And of course, you’ll never ask if I want to bring kitesurfing equipment on my regular Monday morning trip to Paris because you know already I don’t. If I don’t have much of a direct history with you, the data you have on other people can be leveraged to build a feature set to classify me.

The point I am trying to make here is that, publicly, there is a perception that airlines basically use whatever personalisation options they have to increase the fares by trapping you. Airline yield management is complex so with the best will in the world, it’s never likely to be quite that simple. But if airline personalisation tools made life easier for their customers, they might engender a lot more repeat business, particularly now. Obviously gaining that trust in a way which is not perceived to be creepy is going to be a challenge because it’s based on knowing a lot about your customers which is something a lot of people go out of their way to discourage. I mean, I know people who deliberately like to confuse Amazon about their taste in books and music – I’m not one but then you’re talking to someone who got pleasure out of checking out how her recommendations changed by updating her wishlist.

Another interesting thing which could be done with this sort of model of engaging with your customer, based on what you know about them is telling them how many seats are available or grading the flight as commonly searched Hotels.com does this with hotel rooms. Two rooms left at this price. This is useful because while it may not cause me to book at that point in time, it’s hardly going to come as a shock to me that the price of a room in the Georges V in Paris has increased in the last two or three hours since I managed to get my travelling companion on the phone. It provides some trust. If my flight is rated Red for popular, I’ll know I am competing with, for example, 5000 Munster fans for that last seat on a flight the day before a match.

All of this is only possible if my customers trust me to use this data effectively to support them and not, specifically, to abuse them. I mean, if I assume someone who books every Monday morning will always book every Monday morning and start applying stealthy price increases to them that I do not necessarily apply to non-regular passengers, I will wind up with some public relations issues. And the loss of regular streams of income.

In summary, I believe it is possible to personalise the booking experience to the benefit of both passenger and airline. I can see that hotel booking agencies are already working in this area but I think there’s even more potential there. Even after the booking experience is personalised down to the nth degree, this information could have a huge impact on targeting promotional emails (which is something, in my experience, the hotels aren’t quite getting right yet).

Changing times…

Via Damien Mulley’s fluffy links the other day I found myself perusing the Irish Motor Directory and Motor Annual 1911-1912 late last night. The directory itself can be found here, hosted by Lurgan Ancestry and while we’re at it, a shout out to My Kerry Ancestors who are talking about this link too. Okay, that’s the commercials out of the way.

I decided to see from it who was the first person in the “I come from a small, small” town where I grew up to register a car, and glanced down through the list looking at the addresses.

The first owner registered in the town where I grew up was my great grandfather, and it looks like he registered a motorbike. My mother is stunned, but was pretty certain that it was him, so I went to the 1911 census to check who of the relevant surname was living on the street concerned at the time, and by process of very simple elimination confirmed that yes. the named owner in question was her grandfather. In 1911, he was 27.

So I could write a bit about the family background but this is a data/tech blog and actually I’m going to write about changes in society.

If you have a look at the Lurgan website above, it’s actually interesting in the questions it leaves unanswered.

The register classifies vehicles by type – car, charabanc, bicycle, tricar, steam car, steam lorry, dogcart, steam plough. It would be enthralling to know who manufactured these things.
The register provides the registration numbers and some address information.
The addresses are interestingly diverse – for example, because I grew up in Cork, I was looking at the IF register – but a number of the addresses are in Dublin and the UK, for example.
in 1911-1912, there are 239 cars and 146 bikes registered in Cork, but the highest registration number is IF 434 as far as I can see. So I’m interested to see what the gaps are.
There are county and borough register authorities – I don’t know enough about local government organisation in Ireland in the early 20th century (but then, who does?)
this document was a reference handbook for motorists. So it was openly available.

That last bit is the bit that interests me. Any motorist in Ireland could have had a list of all the car owners in Ireland, known their names and where they lived, sorted by registration number. This doesn’t happen today and I don’t know if it could. I just googled my own car reg and Motorcheck came up with a background check for the car – but it will not give any personal details about the owner of the car or the address at which they live.

The Reference book for 1911-1912 suggests that there are 9169 vehicles listed in it, split slightly in favour of cars. Registrations would have started in 1903 when the registration system was implemented first (citation – Wikipedia but I don’t think there’s much arguing here). The series for Cork, IF, started being used in 1903 and eventually ran out in 1935. The number/index letters were reversed and used again later between 1975 and 1976. So the only conclusion that I can draw about my grandfather’s bike is that it was registered at some stage between 1903 and 1911, and the likelihood, I suspect, closer to 1903 than 1911 based on the numbers.

For comparison, 86,932 new cars were registered in Ireland in 2012. (Summary of Statistical Yearbook of Ireland, 2012). The 1911-1912 Reference Book was compiled by Henry G. Tempest and given the available communications options, it’s fair to say that to compile and print that information for over 9,000 vehicles was an achievement but I couldn’t see him doing it for nearly 90,000 new cars, never mind all the cars still on the road from prior to 2012.

And times have changed. We are more concerned about personal data. For years, people have been applying their right not to be listed in the phone directory, and I’m not sure anyone would want their address details along with details about their car in any easily accessible database for various reasons including, no doubt, not wanting to have their movements identified too easily, or not being easy prey for thieves.

Of course, I wouldn’t be me if I wasn’t thinking of ways I could analyse this data in more detail and wondering what other extraneous data sources could be used to enhance it (and not just, for example, the 1911 census).

Property Price register in Ireland…

I’ve started looking at the possibility of doing something with the data released by the Irish government on the subject of property prices in Ireland.

This is something which really only started happening in the last few years, and in fact, it started happening well after the property market in Ireland had started to collapse. The first year for which we have data is 2010.

In general terms, I think it is a good thing that we have this data available but there are ways in which it could be enhanced, I think, which would make it more useful.

As far as I am aware, data for the Irish Property Price register comes from stamp duty returns

Currently, the data headers are as follows:

date of sale
Address of property
postal code
County
Price in euros
Not Full Market value (yes or no – in this case YES means it was not full market value)
VAT Exclusive (yes or no)
Description (New Dwelling House/Appartment or Secondhand dwelling house/appartment)
Property size description (greater than 125 sqm, greater than or equal 38 sqm and less than 125 sqm, less than 38 sqm)

As things stand, there is very little useful information about the properties in the register that allow us to do anything particularly interesting.

Data can be downloaded a county level. The county column is otherwise not useful to an end user
The postal code field is currently inapplicable for most of the country and for Dublin, it is not always filled in because the postal code has been integrated with the address
date of sale is useful
price of property is useful
full market value or not is useful
VAT exclusive is useful
property description only informs us whether the property is new or second hand. It does not tell us the type of property
property size would be more useful if the bin ranges were more granulated.

Nevertheless I have plans for this data but only within the confines of the possible.

However, one of the things I would consider is how could we make this better for the future?

postcodes are coming for the entire country. I have yet to look at the implementation (soon) but this could be very useful in terms of segmenting the market, provided they are entered in the correct field.
No estate agent describes a property as a new or second hand dwelling house/apartment. DAFT, for example, drills down house, apartment, duplex, bungalow. Arguably to that we could add detached, semi detached, terrace (or town house or whatever you’re having yourself for houses bounded on both sides). Put simply, for type of property, there isn’t enough information.
Additional column for new or secondhand dwelling
Accurate surface area measurements. At this point I need to note that from experience, in other countries, surface area is more important in ads and classifiers to the numbers of bedrooms and bathrooms. I would like it if a) it was mandatory to provide surface area measurements in property sale/rent ads and that this information gets included in the property price registry.

There are a few benefits here. Price per square meter is a useful indicator of value across different areas (which might be more definable with valid postcodes). We can also get a picture of which are are bigger houses (I have plans to look into what I can find on the subject of surface area measurements against time at some stage too).

Our property market now is very different to what it was in 2007, but also compared to what it was in 2000, and in 1993. We built a lot of apartments in later years which makes comparing averages very difficult and fraught with danger.

According to the data I have available to me right now, up to 7 October or so, there have been 5943 sales in the Dublin area. In comparison, for the whole of 2012, there were 8808 sales based on a superficial glance at the data.

I will be having a look at this in more detail and will post the outcome in the future and I will also put the code up on github.

In the meantime, it would be nice if we could consider getting past “we have a register” and on to the question “how can we make it even more useful”.

Dublin Bus data – can I have some please?

If you use the RTPI signs, or have either the Dublin or Ireland transport info applications on your phone, here are the questions that gets answered:

When is the next bus due?
What is the next bus?
When is the next train going to leave from here?
Where is that train going?

I use Dublin Bus’s RTPI applicaton all the time. I use it more than the all Ireland one because I have my favourite bus stops set up and somehow I haven’t the time to set them up in a second app. I will probably use the all-Ireland one for Irish Rail.

These are useful questions to answer. They take the guess work out of the time table, traffic and delays. Only on one day have the really let me down and to be fair, Dublin Bus had extenuating circumstances as O’Connell Bridge was closed.

But there is another question I’d prefer an answer to and it is this.

How packed is my onwards bus connection.

Every morning, I get two buses, one from where I live on Dublin’s northside to the city centre, and one from the city centre to UCD’s Belfield campus. In total, the journey normally takes me about an hour, end to end. It’s not bad, but on occasion, things go wrong and I wind up delayed, and occasionally a bit soaked.

Normally I change buses on D’Olier Street. Most of the cross city buses wind up there when they are running southbound, and both buses I can get into the city centre drop at stops there, and a lot of the buses to UCD pick up there. Sometimes, however, those buses are full, full enough for drivers to decide they are not taking on any more passengers. So I get left behind.

So, some mornings, instead of getting off in D’Olier Street, I stay on my first bus until I get to Kildare Street, and do the change there. This is because I gamble that the UCD bound buses will lose a significant number of passengers on Nassau Street, outside Trinity College.

Because of the way Dublin City Centre is laid out, and how the bus routes cross each other, there are often, multiple common stops between bus routes. So the question that I’d like an answer to some mornings, at 8am is this – can I get out at D’Olier Street (there’s a Spar there, and a little more shelter if it’s raining) or should I wait until Kildare Street before doing my bus route change.

Dublin Bus collects a lot of data that I know about, and probably a whole lot more that I know nothing about. Typically, they will know how many people are getting on buses because they have two ticket readers and a driver. And because of the stage system, for a lot of those passengers, they can make an educated guess where people will be alighting.

It would be nice if the RTPI could indicate the likely busyness of a given bus. If, when you looked at the 41 due to go into town it was highlighted whether it was likely to be full or only half full at a given point on its route. So that when I look at RTPI for the 46A to go to Belfield, I can check how full the bus is as well.

Incidentally, this is worth a read – via ITS International..

What Twitter should know….and better exploit

Currently I follow around 2000 people on Twitter. It varies down slightly as I do the occasional clear out to make space for new people. But Twitter has applied a top limit of around 2000 so once I hit that I run into trouble.

Which is fine – to some extent (as in it’s not, really) – it’s their site, they have to sort out the scalability and all that. But some time ago, after a raft of unsolicited promoted tweets from just one company, I blocked the relevant account. I recognise that Twitter has to find a way of monetising the service and I’m not, in principle, against the whole sponsored tweets thing provided it doesn’t become a onerous load on my feed. If one in 10 tweets was a sponsored tweet, I’d get annoyed. But I got annoyed with this one company because they are selling a service I just do not want, am not interested in and don’t care for.

They appeared in my timeline again during the week. I was not happy.

You can, to some extent, tune the advertising which Google serves you on certain products – ads can be muted and you won’t see them again. And Google’s context sensitive advertising in Gmail is generally actually very context sensitive. It’s typically appropriate. I just can’t tune the sponsored tweets that come my way, not overtly anyway.

Twitter could because twitter knows an awful lot about my interests. I have 32000+ tweets on twitter and I follow almost 2000 people. And I have a bio. And based on this, twitter could offer advertisers/users of sponsored tweets a lot more granularity in targetting their sponsored tweets.

If you read my bio, here are two key things you get that I am interested in 1) photography 2) crochet.

If you then perform an analysis of the people I follow, you will find a couple of more interesting things 1) surfing 2) kitesurfing 3) science 4) computers 5) data analysis and statistics 6) mathematics 7) certain newsmedia sites.

If you then perform an analysis of the things I tweet and things I retweet you can get a feel for even more of the things I have a slightly more than passing interest in but which haven’t turned up under my bio or my followee accounts.

With that information, you can target advertising to me a lot more effectively. In that, the right accounts get to me and I get something of value back. We both, to a certain extent, win and Twitter gets to offer an enhanced service to their sponsoring accounts. And I don’t get annoyed by tweets turning up in my time line from accounts I don’t follow (in this case absolutely don’t want to follow) while simultaneously not being allowed by Twitter to add to the accounts I do want to follow because of the limit I regularly brush up against, the 2000 follower limit.

I can’t believe that Twitter don’t know this – it’s online advertising 101 – but if they are applying it, I don’t see it yet. I imagine it is being worked on.

IIHA2YD – Pinterest

Pinterest is another one of those services whose data I suspect would be very interesting to look at. For the most part, it’s starting to be picked up commercially as a gallery option for a lot of companies. It gets very heavy usage from the handcrafts sector which is how I encountered it first, and then also via the surf magazine world (where I saw it starting to get a lot of use).

Most of the people I know who actively use Pinterest for themselves are women. I’d be interested to see how much of that is based in reality or whether that’s limited to my particular circle of boards. For the limited number of boards I follow, there is a heavy increase in usage between about 6 in the evening and 11 at night. There are a lot of specialist boards (I follow one which handles antique ink bottles only).

Quite a few stores are starting to implement pinterest links and this is interesting because one of the uses I have gotten out of it is as a visual shopping list. It also gets used as a bookmark service (see all the recipes that get bookmarked on it) and as a gallery (see how companies are using it to show case their products).

So I’d like to map out how it gets used by different people and how it gets exploited commercially. There is some evidence to suggest that it drives serious sales for companies using their pinboards as store fronts and that it may be more effective than other social media in this respect. I’d be interested to see if it’s possible to get a global picture of this, and whether some sectors do better than other sectors in this respect. I’d like to have a look at international traffic flows and whether different cultures target different uses of the pinboards.

I’m also interested in how the pins are categorised. Unlike – for example – Flickr – pinterest doesn’t really use tags – it uses categories. I’m not certain how its search works but my guess is it’s part category based and part text content based and this is interesting because usage patterns suggest that repinners very often don’t change the text accompanying a particular pin.

I’m really going to see if I can figure out a way of structuring how I would do this if I had a chance. It would create one giant infographic I suspect.