Classifying players’ positions using public data

This post is a little wonkier than usual. It sets out a method for assigning players to positions.

It’s difficult to assess players’ performance without knowing what role they’re playing. Averaging three scoring shots per game might be a solid performance for an inside midfielder, but deeply underwhelming for a key forward. But to take players’ positions into account when assessing their performance, first we need to know what position they play.

Classifying players’ positions is more difficult than it might seem, because positions are so fluid in footy. It’s easy to identify a ruckman or a key forward, whether you’re watching a game or poring over a spreadsheet. Other positions are not as easy. I set out to answer the question: is there a way to assign players to positions, just using publicly available data?

Doesn’t the AFL just list players’ positions?

The AFL’s public data doesn’t specify players’ positions, so we’re out of luck there.

Champion Data does classify players’ positions, assigning each player to one of seven categories. They base their choices about players’ positions on a combination of their secret-sauce data, such as the location of disposals, as well as “feedback from AFL club staff”. If we were only interested in current players’ positions, we would just use Champion Data’s classifications, as we wouldn’t be able to improve on them using only the limited data the AFL chooses to make available. But Champion Data doesn’t release its positional classifications for previous years, so their positions are no use to us if we want to compare players’ performance over several seasons.If we want to look further back in time than the most recent season, we can’t rely on Champion Data’s public position ratings, so we’ll need to construct our own, using data in the public domain.

Players’ are also assigned to the positions for the purposes of fantasy football. But those classifications only go back a few years.

Step 1: Choose your unit of analysis

Players’ positions can change from game-to-game, or even within game. Ideally, we would classify players’ for each game they’ve played, but we can’t do that reliably with public data. There’s too much noise in the game-level data.

Instead, I’ve chosen to classify players to a position each season of their (active) career. So our unit of analysis is a player-year.

Step 2: Choose your stats

Our player classifications are based on data for every AFL game from the 2003 season to the 2016 season, inclusive. I’ve used this period because 2003 is the first season that contains goal assist data and data on the percentage of the game that each player played. All data are from AFL Tables. We have 119 235 rows of player data for the 2003-2016 period, with each row corresponding to a particular player in a particular game. Once we summarise these as player-years, we’re left with 8 512 rows.

I’ve classified players’ positions using the following stats:

Height;
Scoring shots;
Marks inside 50;
Marks outside 50;
Rebound 50s;
Hit outs;
Tackles;
Inside 50s;
Clearances;
Non-clearance contested possessions;
Other disposals;
One percenters; and
Goal assists.

Scoring shots are goals plus behinds. The public data don’t tell us about shots that fell short or went out of bounds on the full, so these aren’t included.

I’ve defined “non-clearance contested possessions” as contested possessions minus clearances. This is a bit of a kludge, as it’s possible to have a clearance that isn’t a contested possession, like if a ruckman smashes a ball away from a contest. This is only a small problem, as I’m reliably informed that only around 4% of clearances aren’t contested possessions.

‘Other disposals’ is a much bigger kludge – it’s disposals minus scoring shots, minus goal assists, minus contested possessions, minus inside 50s. This is not ideal at all. We’re ‘double subtracting’ quite a few disposals here. But I want to isolate uncontested possessions that don’t result in a scoring shot, I50 or goal assist, and until the AFL chooses to start releasing more granular data, this is about the best we can do.

Step 3: Adjust stats to a per-full-game basis

If a player came on as the substitute, played 20 per cent of the game, and got 5 hit-outs, we’re on pretty safe ground assuming he’s a ruck man. If he was out on the ground for 80 per cent of the game and got 5 hit-outs, we might conclude he was a key position player pinch-hitting in the ruck or going third man up at a few contests.

To get greater clarity, we’ve converted players’ stats to a “per-full game basis.” If a player was on the ground for 75% of the game and recorded 30 kicks, that’s equivalent to 40 kicks on a per-full game basis. The only stat that doesn’t get adjusted for playing time is height, for obvious reasons.

Step 4: Standardise the data

The game has changed. In 2003, players recorded an average of 1.94 tackles a game. In 2016, the average was up to 3.16. We want to adjust for changes like this, so we standardise all stats, including height, within each season.

If a player averaged 3 tackles a game in 2003, that was above the average for that season. To be precise, they averaged 1.06 tackles above the league-wide average for 2003, which is equal to 0.6 standard deviations above average. The player’s standardised tackle score is therefore 0.6. If a player averaged 3 tackles a game in 2016, they would have a standardised tackle score of -0.06, as they were 0.06 standard deviations below the season average.

All stats are placed on a standard scale in this way and we use these standardised stats in the position classifications.

Step 5: Decide on a clustering methodology

I’ve used a clustering algorithm called ‘Partitioning Around Medoids‘ (PAM). There are other options that would also work, like the popular k-means algorithm. After some trial and error, PAM gives results that make more sense to me.

Step 6: Determine the number of positions that can be identified in the data

It’s not obvious how many positions we should use. There are four ‘elemental’ positions, namely forward, midfield, defence and ruck. But any observer of football knows that there are important distinctions within those elemental categories, like the distinction between small and key forwards, or inside and outside midfielders.

I’ve tried to let the data tell us how many positions can be reliably identified. The ‘gap statistic‘ is often used for this purpose. Using this measure, I find that there are eight positions that can be identified in the data.

graph.gapstat

Step 7: Results!

The next step is to run the PAM algorithm over the full dataset, assigning each player in each season to one of eight positions. I’ve named the positions, but these names are subjective.

Here’s the statistical profile of each position, as indicated by the average standardised stats of the players in each position.

positions.characteristics.png Each dot in the chart below is a 2016 player, with the colour corresponding to his assigned 2016 position. The horizontal axis is the first principal component of the standardised data, while the vertical axis is the second principal component.

PCA.graph

The slideshow below shows each 2016 player, at each club, colour coded by their assigned position.

This slideshow requires JavaScript.

Here’s how our estimated positions line up with those used in fantasy footy, using 2015 data. For example, out of 36 (active) players classified as a ruck for fantasy purposes, I classify 32 of them as a ruck.

		The Arc’s position classifications
		Defensive midfielder	General defender	General forward	Inside midfielder	Key defender	Key forward	Outside midfielder	Ruck
AFL Fantasy positions	DEF	25	72	1	1	68	0	3	0
	FWD	24	2	41	1	3	51	9	1
	FWD/DEF	6	1	4	0	11	4	1	0
	FWD/RUC	1	0	0	0	0	2	0	13
	MID	31	10	12	64	0	0	22	0
	MID/DEF	11	14	0	2	0	1	9	0
	MID/FWD	12	6	20	17	0	1	25	0
	MID/RUC	0	0	0	0	0	0	0	1
	RUC	3	0	0	0	0	1	0	32

Note: Table includes all 2015 players who could be matched across datasets.Step 10: Help me improve these classifications

Help me improve on these classifications

Here’s a spreadsheet containing the assigned position for every player in every season from 2003 onwards.

What do you think? Are there glaring problems here? I’m sure there are particular players who have been assigned to positions that don’t look right; I’m more interested in types of players who are systematically mis-classified using this method. Do these results look mostly right, most of the time?

I don’t doubt that these classifications could be improved upon by an expert who sat down and watched every game by every player since 2003. They could. The question is whether there is there some alternative, systematic means of classifying players’ positions that performs ‘better’, using public data. It could do so by using different data, or a different statistical method, or a different number of positions, or some combination of the above.

I’d be very interested in feedback on either the process I used to classify players, or the results. You can contact me via email from here, or via Twitter.

Share this post: