Grouping parliamentary constituencies by their demographic characteristics
What am I trying to achieve?
After working with demographic data for a while I decided to try and create sensible groupings of constituencies – technically known as clusters – based on that information. In England alone, there are more than 500 individual constituencies so it makes sense to bundle them together into smaller, more usable groups.
This activity was triggered by the current fragmenting of the traditional party structure. We had (briefly) Change UK. We have the new Brexit Party. We have the Greens and Lib Dems polling at levels not seen before. The AI models I’ve previously built that predict vote share by party based on demographics aren’t really valid any more – or at least we won’t know if they are until the next election comes around.
This kind of Supervised Machine Learning (using AI models to predict an outcome) assumes that you can predict the future based on past behaviours. For this to be true requires a certain level of continuity. While it can handle slow, evolutionary change it can’t deal with sudden, fundamental paradigm shifts.
So I moved to using Unsupervised Machine Learning. This is basically throwing your data at an AI model and asking it what it can see rather than asking it to predict something in particular. In this instance, what it sees is how to carve those constituencies up into groups with similar charateristics.
What do these clusters look like?
The first thing to say is that there isn’t a single, perfect clustering. There are different ways to approach the problem which will all result in slightly different results. How many clusters do you want to group into? Changing the number of clusters will clearly result in different outcomes. Even using the same approach with the same number of clusters can yield different results because an element of randomness is inherent.
Any clustering is better thought of as a guide; it’s just one way of seeing the data rather than a definitive and static answer. It’s really meant to start a conversation and trigger debate. It could form the underpinnings of a strategy for a political party but it isn’t a route map.
As with my previous constituency analysis, this is only looking at English constituencies. This is because the data flows from the analysis on party vote share. Northern Ireland Scotland and Wales are complicated by the presence of the Nationalist parties.
After an initial analysis, I selected seven clusters as the optimum number – basically trying to get the most information value with the fewest number of groups. The clusters were created by picking an example constituency that best represented the whole (technically known as a mediod) and then establishing which constituencies were closest in character to that exemplar.
This table shows the seven clusters, the size and relative size of each and what the example (mediod) constituency is. I’ve then been through all the variables to pick out which demographic criteria appear to have the most impact on the cluster selection.
I say ‘appears’ because all Machine Learning and AI is by its nature opaque. You are asking the model to make decisions, but this means it’s not necessarily easy to see why a particular decision has been made.
Show me on a map
If you want to see these clusters and some of the high level demographic measures, this shows the constituencies mapped out. I’ve added in the vote share of the two main parties and the Leave vote proportion.
Also, there is actually a constituency called Southampton Test. It’s not a typo. Feel free to google it if you don’t believe me.
There are no specific conclusions to be drawn – other than that there are clear correlations between demographics, party lean and brexit lean and as a consequence any clustering based on demographic data would have value to a political party.
Why 7 clusters?
The selection of 7 clusters was based on the standard ‘elbow’ method – i.e. plotting the number of clusters against the with-cluster sum of squares. I did look at other methods (dendrogram and reducing the variables to 2 dimensions) but the data did not lend itself to clear answers. The elbow method saw a nice drop when the 7th cluster was added followed by a flatter tail. So seven it was.
To an extent, it didn’t really matter how many clusters I picked. If this was to be used in anger (by a political party or lobbying group) I’d see this as a starting point not the end outcome. There may be local knowledge that would provide natural further splits. Also, this is only based on publicly available data; I would assume anyone doing this professionally would have access to other data sources…
Why mediod not centroid clusters?
You will see from the non-technical description that I’ve chosen to use mediod clustering rather than centroid. The reason for this is non-technical – I felt that a mediod cluster is easier for a reader to understand than a centroid. Working with a tangible real-life example is far more natural than equating back to a more nebulous point in multi-dimensional euclidean space. I appreciate that I may lose some technical accuracy, but since the intent is really to trigger discussion I felt this was worth the trade-off.
Reverse engineering characteristics
The issue with clustering and all forms of Machine Learning is that they are black box. The natural question is why a given constituency is in a given cluster. Given the relatively large number of variables and the opaque calculation methodology, I had to spend a fair amount of time working with the data to see what characteristics each cluster had so it could be converted back into a written description. I deliberately used slightly vague language – as they are patterns, rather than rules. A cluster may tend to have a younger median age, but there will be outliers and exceptions in that group – where other variables over-ride age as a criteria.
Spot the deliberate mistake
Finally, if you are particularly eagle-eyed, you’ll notice that one constituency on the map is N/A and has no cluster. This is pure user-error. Every decade or so the constituency boundaries are re-written. I’d attempted to create a dataset for longer term trending for another piece of analysis. I essentially picked which current constituency was closest to the old version. And cocked up one constituency for the current version. If I corrected this error and rerun I suspect I’d get subtly different results, but quite frankly I can’t be bothered. I’ve got a day job after all…