The 2022 Norwegian Football Elite Series

I have upgraded the football model that I created two years ago. In this article I train the model on 2022 Elite Series (\(16\) teams playing \(240\) matches). We will predict results for the upcoming matches, visualize the results, examine the structure of the model, dicuss computation and showcase the code.

There are still \(8\) remaining matches, to be played on the 13th of November.     Here are the predictions for those matches:

home_team           away_team       home_win   draw   away_win
Kristiansund       Jerv               46.8   23         30.2
Lillestrøm         HamKam             49.7   24.3       26.1
Strømsgodset       Bodø/Glimt         25.6   19.9       54.6
Rosenborg           Sarpsborg 08       60.1   18.6       21.3
Sandefjord Fotball Haugesund           28.4   22.4       49.1
Tromsø             Aalesund           46.8   24.1       29.1
Viking             Odd                 39.8   23.2       37  
Vålerenga           Molde               28.9   21.8       49.3

Results

Molde is the strongest team. The figure below shows the estimated attack and defence skill of each team, with an ellipsoid indicating the correlations within each team. See the model specification below for details on how attack and defence and other variables are used in the statistical model.

We can simulate matches against Molde to get a feeling for how strong the team is, relative to other teams. For instance, the chance that Jerv would win against Molde is approximately \(10 \, \%\).

It’s also interesting to look at \([\text{attack}_t, \text{defence}_t, \text{home}_t ]^{T}\) in one figure. Across all teams there is positive correlation between each attribute. A team that has high attack likely has high defence and high home advantage too.

But within each team the uncertainty plays out differently. We see that \(\text{attack}_t\) and \(\text{defence}_t\) are positively correlated (ellipsoids point to the northeast). The outcomes can be explained rather well by adding to both the attack and the defence.

Both \(\text{attack}_t\) and \(\text{defence}_t\) and negatively correlated with \(\text{home}_t\), since for instance in the equation

\begin{align} \theta^\text{h} = \exp \left( \text{intercept} + \text{attack}_{\text{h}[i]} - \text{defence}_{\text{a}[i]} + \text{home}_{\text{h}[i]} \right) \end{align}

we can add to \(\text{defence}_{\text{a}[i]}\) if we subtract from \(\text{home}_{\text{h}[i]}\) without perturbing the model too much.

Model

The generalized linear model uses a logarithmic link function with a Poisson distribution.

\begin{align} \text{home_goals}_i &\sim \text{poisson} \left( \theta^\text{h}_i \right) \\ \text{away_goals}_i &\sim \text{poisson} \left( \theta^\text{a}_i \right) \\ \theta^\text{h}   &= \exp \left( \text{intercept} + \text{attack}_{\text{h}[i]} - \text{defence}_{\text{a}[i]} + \text{home}_{\text{h}[i]} \right) \\ \theta^\text{a}  &= \exp \left( \text{intercept} +  \text{attack}_{\text{a}[i]} - \text{defence}_{\text{h}[i]} - \text{home}_{\text{h}[i]} \right) \end{align}

In the model above, \(i\) is the match and \(h[i]\) gives the team that played home in match \(i\). Similarly, \(h[i]\) gives the team that played away in match \(i\).

For every team \(t\), the attack, defence and home advantage are drawn from a multivariate normal distribution. The correlation matrix \(C\) is given a Lewandowski-Kurowicka-Joe (LKJ) prior, and the standard deviation \(\sigma\) is given a normal prior. This specifies a multilevel model. The hyperpriors were set using prior predictive checks.

\begin{align} [\text{attack}_t, \text{defence}_t, \text{home}_t ]^{T} &\sim \operatorname{normal} \left( \mathbf{0}, \Sigma \right) \\ \Sigma &=  \operatorname{diag}(\sigma) \, C \, \operatorname{diag}(\sigma) \\ \sigma &\sim \operatorname{normal}(0.15, 0.05) \\ C &\sim \operatorname{lkj\_corr}(1) \\ \text{intercept} &\sim \operatorname{normal}(0.4, 0.1) \end{align}

The full code is included at the bottom of this article.

Computations

All chains run with no problems:

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

The trace plots for all \(8\) chains all look good. The model runs quickly, running with

  • iter_warmup=10_000
  • iter_sampling=10_000

only took 43.9 seconds.

Stan code

The code was run on CmdStan version 2.30.1. I make use of the multivariate version of the non-centered parameterization to make the sampling more efficient. Read more about it in the section on multivariate priors for hierarchical models in the Stan docs.

// Bayesian football model by @tommyod, november 2022
// https://tommyodland.com/articles/2022/the-2022-norwegian-football-elite-series

data {        
    int<lower=0> num_matches;
    int<lower=0> num_teams;

    array[num_matches] int<lower=0> home_goals;
    array[num_matches] int<lower=0> away_goals;

    // Mappings from match number to team number
    array[num_matches] int<lower=1> home_team;
    array[num_matches] int<lower=1> away_team;
}

parameters {
    real intercept;
    cholesky_factor_corr[3] cholesky_factor_C;  
    vector<lower=0>[3] sigma;
    array[num_teams] vector[3] alpha;
}

transformed parameters{
    array[num_matches] real<lower=0> theta_home;
    array[num_matches] real<lower=0> theta_away;

    array[num_teams] vector[3] att_def_home;

    array[num_teams] real attack;
    array[num_teams] real defence;
    array[num_teams] real home;
    
    for (i in 1:num_teams){
        // Equivalent to: 
        // att_def_home[i] ~ multi_normal_cholesky([0, 0, 0]', 
        //                   diag_pre_multiply(sigma, cholesky_factor_C));
        att_def_home[i] = diag_pre_multiply(sigma, cholesky_factor_C) * alpha[i];
        
        // For notational convenience in theta definitions below
        attack[i]  = att_def_home[i][1];
        defence[i] = att_def_home[i][2];
        home[i]    = att_def_home[i][3];
    }
  
    for (i in 1:num_matches){
        // Goals scored by the home team
        theta_home[i] = exp( attack[home_team[i]] 
                           - defence[away_team[i]] 
                           + intercept 
                           + home[home_team[i]] );

        // Goals scored by the away team
        theta_away[i] = exp( attack[away_team[i]] 
                           - defence[home_team[i]] 
                           + intercept 
                           - home[home_team[i]] );
    }
}

model {
    // Priors were chosen using prior predictive checks 
    intercept ~ normal(0.4, 0.1);
    sigma ~ normal(0.15, 0.05);
    // equal to: cholesky_factor_C * cholesky_factor_C' ~ lkj_corr(1); 
    cholesky_factor_C ~ lkj_corr_cholesky(1);

    for (i in 1:num_teams){
      alpha[i] ~ std_normal();
    }

    for (i in 1:num_matches){
        home_goals[i] ~ poisson(theta_home[i]);
        away_goals[i] ~ poisson(theta_away[i]);
    }
}