The 2022 Norwegian Football Elite Series
- 10. November 2022
- #statistics
I have upgraded the football model that I created two years ago. In this article I train the model on 2022 Elite Series (\(16\) teams playing \(240\) matches). We will predict results for the upcoming matches, visualize the results, examine the structure of the model, dicuss computation and showcase the code.
There are still \(8\) remaining matches, to be played on the 13th of November. Here are the predictions for those matches:
home_team | away_team | home_win | draw | away_win |
---|---|---|---|---|
Kristiansund | Jerv | 46.8 | 23 | 30.2 |
Lillestrøm | HamKam | 49.7 | 24.3 | 26.1 |
Strømsgodset | Bodø/Glimt | 25.6 | 19.9 | 54.6 |
Rosenborg | Sarpsborg 08 | 60.1 | 18.6 | 21.3 |
Sandefjord Fotball | Haugesund | 28.4 | 22.4 | 49.1 |
Tromsø | Aalesund | 46.8 | 24.1 | 29.1 |
Viking | Odd | 39.8 | 23.2 | 37 |
Vålerenga | Molde | 28.9 | 21.8 | 49.3 |
Results
Molde is the strongest team. The figure below shows the estimated attack and defence skill of each team, with an ellipsoid indicating the correlations within each team. See the model specification below for details on how attack and defence and other variables are used in the statistical model.
We can simulate matches against Molde to get a feeling for how strong the team is, relative to other teams. For instance, the chance that Jerv would win against Molde is approximately \(10 \, \%\).
It’s also interesting to look at \([\text{attack}_t, \text{defence}_t, \text{home}_t ]^{T}\) in one figure. Across all teams there is positive correlation between each attribute. A team that has high attack likely has high defence and high home advantage too.
But within each team the uncertainty plays out differently. We see that \(\text{attack}_t\) and \(\text{defence}_t\) are positively correlated (ellipsoids point to the northeast). The outcomes can be explained rather well by adding to both the attack and the defence.
Both \(\text{attack}_t\) and \(\text{defence}_t\) and negatively correlated with \(\text{home}_t\), since for instance in the equation
we can add to \(\text{defence}_{\text{a}[i]}\) if we subtract from \(\text{home}_{\text{h}[i]}\) without perturbing the model too much.
Model
The generalized linear model uses a logarithmic link function with a Poisson distribution.
In the model above, \(i\) is the match and \(h[i]\) gives the team that played home in match \(i\). Similarly, \(h[i]\) gives the team that played away in match \(i\).
For every team \(t\), the attack, defence and home advantage are drawn from a multivariate normal distribution. The correlation matrix \(C\) is given a Lewandowski-Kurowicka-Joe (LKJ) prior, and the standard deviation \(\sigma\) is given a normal prior. This specifies a multilevel model. The hyperpriors were set using prior predictive checks.
The full code is included at the bottom of this article.
Computations
All chains run with no problems:
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
The trace plots for all \(8\) chains all look good. The model runs quickly, running with
iter_warmup=10_000
iter_sampling=10_000
only took 43.9 seconds.
Stan code
The code was run on CmdStan version 2.30.1. I make use of the multivariate version of the non-centered parameterization to make the sampling more efficient. Read more about it in the section on multivariate priors for hierarchical models in the Stan docs.
// Bayesian football model by @tommyod, november 2022
// https://tommyodland.com/articles/2022/the-2022-norwegian-football-elite-series
data {
int<lower=0> num_matches;
int<lower=0> num_teams;
array[num_matches] int<lower=0> home_goals;
array[num_matches] int<lower=0> away_goals;
// Mappings from match number to team number
array[num_matches] int<lower=1> home_team;
array[num_matches] int<lower=1> away_team;
}
parameters {
real intercept;
cholesky_factor_corr[3] cholesky_factor_C;
vector<lower=0>[3] sigma;
array[num_teams] vector[3] alpha;
}
transformed parameters{
array[num_matches] real<lower=0> theta_home;
array[num_matches] real<lower=0> theta_away;
array[num_teams] vector[3] att_def_home;
array[num_teams] real attack;
array[num_teams] real defence;
array[num_teams] real home;
for (i in 1:num_teams){
// Equivalent to:
// att_def_home[i] ~ multi_normal_cholesky([0, 0, 0]',
// diag_pre_multiply(sigma, cholesky_factor_C));
att_def_home[i] = diag_pre_multiply(sigma, cholesky_factor_C) * alpha[i];
// For notational convenience in theta definitions below
attack[i] = att_def_home[i][1];
defence[i] = att_def_home[i][2];
home[i] = att_def_home[i][3];
}
for (i in 1:num_matches){
// Goals scored by the home team
theta_home[i] = exp( attack[home_team[i]]
- defence[away_team[i]]
+ intercept
+ home[home_team[i]] );
// Goals scored by the away team
theta_away[i] = exp( attack[away_team[i]]
- defence[home_team[i]]
+ intercept
- home[home_team[i]] );
}
}
model {
// Priors were chosen using prior predictive checks
intercept ~ normal(0.4, 0.1);
sigma ~ normal(0.15, 0.05);
// equal to: cholesky_factor_C * cholesky_factor_C' ~ lkj_corr(1);
cholesky_factor_C ~ lkj_corr_cholesky(1);
for (i in 1:num_teams){
alpha[i] ~ std_normal();
}
for (i in 1:num_matches){
home_goals[i] ~ poisson(theta_home[i]);
away_goals[i] ~ poisson(theta_away[i]);
}
}