The 2021–22 Norwegian Football Cup

The 2021–22 Norwegian Football Cup had \(128\) teams playing matches. In the first round, \(64\) matches were arranged and the winners proceed to the second round. In the second round, \(32\) matches were arranged and the winners proceed to the third round. Continuing this pattern, there are \(7\) rounds in total. The seventh round consists of the final match.

Almost two years ago I wrote about modeling the outcome of football matches. In this article we run a similar model on a fresh data set from the football cup.

The model

We assume that goals \(y\) are Poisson distributed, i.e., \(p \left( y | \theta \right) = \theta^y \exp \left( -\theta \right) / y!\) with rate parameter \(\theta\). Every team has a latent attack and defence strength influencing \(\theta\). We also add an intercept (bias) to the model. In the previous article we modeled home team advantage, but in the cup the weakest teams played on their home turf—so adding home advantage would spoil the inference of team strength in this model.

In an attempt to ease notation, we do not explicitly index each match in the model description below. We use the parameter \(y^\text{h}\) to indicate goals scored by the home team, which is generated by a Poisson distribution with rate parameter \(\theta^\text{h}\). The meaning of \(\text{attack}_{i[\text{h}]}\) is “attack strength of team \(i\), which is the home team in the match.”

The generalized linear model uses a logarithmic link function with a Poisson distribution:

\begin{align} y^\text{h} &\sim \text{Poisson} \left( \theta^\text{h} \right) \\ y^\text{a} &\sim \text{Poisson} \left( \theta^\text{a} \right) \\ \log \left( \theta^\text{h} \right)  &= \text{attack}_{i[\text{h}]} - \text{defence}_{i[\text{a}]} + \text{intercept} \\ \log \left( \theta^\text{a} \right)  &=   \text{attack}_{i[\text{a}]} - \text{defence}_{i[\text{h}]} + \text{intercept} \end{align}

The attack and defence parameters are random variables with prior distributions. The priors were established by prior predictive simulations, and are given by:

\begin{align} \text{intercept} &\sim \operatorname{Normal} \left( 1, 0.4 \right) \\ \text{attack}_i &\sim \operatorname{Normal} \left( 0, \text{attack_std} \right) \\ \text{defence}_i &\sim \operatorname{Normal} \left( 0, \text{defence_std} \right) \\ \text{attack_std} &\sim \operatorname{Uniform} \left( 0, 1 \right) \\ \text{defence_std} &\sim \operatorname{Uniform} \left( 0, 1 \right) \end{align}

At this point the attack and defence parameters are not identifiable. In other words, they can be arbitrarily shifted since \(\text{attack}_{i} - \text{defence}_{i} = (\text{attack}_{i} + \alpha)- (\text{defence}_{i} + \alpha)\) for all values of \(\alpha\). To enforce identifiability we add soft constraints to the parameters.

\begin{align} \sum_i \text{attack}_i \sim \operatorname{Normal} \left( 0, \epsilon \right) \qquad \sum_i \text{defence}_i \sim \operatorname{Normal} \left( 0, \epsilon \right) \end{align}

The results

For each team \(i\), the figure below shows the median of \(\text{attack}_i + \text{defence}_i\), along with intervals for \(50\%\) and \(90\%\) of the probability density. The teams are sorted by how far they made it in the tournament. For instance, “Molde” won, “Bodø/Glimt” made it to the final match but lost, “Viking” and “Strømsgodset” made it to the semifinals but lost, and so forth.

Notice that the model is more certain of the strength of teams that made it further in the tournament. This makes sense, since there is more data on teams that made it far.

Let’s examine two teams that are outliers, in the sense that the model disagrees with the tournament results.

What happened to “Staal”? Even though “Staal” lost its first match, the model thinks it’s a better team than many who won the first match. How can this be? Let’s examine the data:

home_name   away_name     home_goals   away_goals
Staal       Viking                  2            3

Staal” lost to “Viking”, but barely, and “Viking” went all the way to the semifinals. From this information the model infers that “Staal” is actually a quite strong team. They were unlucky to meet “Viking”, but if they met a more average team they were likely to make it further in the tournament.

What happened to “Harstad”?

The model ranks “Harstad” and a poor team, even though the team made it to round \(3\). Again, we look at the match outcomes for insights:

home_name   away_name     home_goals   away_goals
Harstad     Melbo                  2            0
Harstad     Senja                  2            2
Harstad     Brann                  0            5

Harstad” won against Melbo, but not by very much. Then they tied Senja, but won in the penalty shoot-out. Finally, they were crushed by Brann, who in turn only made it one round further. All this is evidence for the fact that “Harstad” is not a particularly strong team.

Interpreting the latent variables  \(\text{attack}_i\) and \(\text{defence}_i\) can be a bit challenging. The figure below shows the result of simulating matches against a “typical” team, i.e. a team with \(\text{attack}_i = \text{defence}_i = 0\). The green bar indicates the probability of winning, the yellow bar the probability of a tie, and the red bar the probability of losing.

Finally, here’s a plot showing the attack strength and defence strength of each team, with colors indicating how far the team made it in the tournament.

Summary

This article applies some new data to an old model. We removed home field advantage from the original model, since the weaker team gets to play on their home turf. Home field advantage would therefore simply indicate which team is considered the weakest.

There’s a discrepancy between the strength inferred by the model and the strength implied by the tournament results alone (how far a team got). I believe the model more accurately reflects the strength of teams, since it considers by how much a team lost or won.

For references, see the references in my original article on modeling football matches.