Betting on football with the Kelly criterion
- 3. December 2023
- #optimization
In this article we’ll use mathematical optimization to place bets on football matches.
In 2022 I created a Bayesian model that predicts the outcome of football (soccer) matches. The natural next step is to try to place bet on matches.
Let’s be concrete. Suppose we have a matrix \(\boldsymbol{p}\) of probabilities from a statistical model and a matrix \(\boldsymbol{b}\) of odds. Here is the input data and the solution that we’ll work our way up to in this article.
Each row represents a match.
The matrix \(\boldsymbol{p}\) contains the probabilities for each outcome (obtained by simulating form the Bayesian model), the matrix \(\boldsymbol{b}\) represents the odds (European style decimal) and \(\boldsymbol{w}^{\star}\) the solution.
The first columns represent “home” probabilities and odds, the second column “draws” and the third column “away”.
The solution is the fraction of total wealth to bet on each outcome. Here we bet \(\sum_{ij} w_{ij} = 92.4 \%\) of our total wealth. That’s a lot of our money, but remember that it’s spread out on \(m=8\) matches.
If we place the bets above, the growth rate that will multiply our money has the following distribution (assuming the model probabilities are correct!):
The rest of the article explains how we reached the solution above. First we’ll discuss the Kelly criterion, then we’ll generalize it to football, investigate the properties of the optimization problem, solve it using a general-purpose solver in CVXPY, and finally solve it using gradient sampling and the Adam optimization algorithm.
Standard Kelly
The Kelly criterion maximizes long-term wealth growth over an infinite number identical rounds of betting. In this section we introduce notation and explain the simplest possible case. Consider a bet with European style odds \(b\). You place \(w\) units of money on the bet.
- If you win (with probability \(p\)), your net gain is \((b-1)w\) units of money.
- If you lose (with probability \(1-p\)), your net gain is \(-w\).
Assume that you have \(1\) unit of money, and that \(w\) is a real number between zero and one. In other words, \(w\) represents the fraction of your wealth that you put into the bet. What is the optimal \(w\)? It depends on whether or not you get to bet once or many times.
A single bet. If you get only get to bet once, then maximizing the expected value
is a natural course of action. For instance, if \(b=100\) and \(p=0.6\), you would probably bet a large fraction \(w\).
Several bets. If you get to bet several times, the answer should change. Consider again the situation with \(b=100\) and \(p=0.6\). If you bet \(w=1\) you have a \(40 \%\) chance of losing everything on the first bet! If that happens you would not be able to take advantage of the coming opportunities to bet again.
Infinitely many bets. Consider the limit where you get to bet infinitely many times. What should \(w\) be then? This is what Kelly answered in his 1956 paper1. To derive the answer, we must:
- Determine the factors that multiply our wealth after each turn
- Take the geometric average of these factors, then take logarithms
- Take the limit as the number of bets goes to infinity, to get the expected value
- Maximize the function (the expected value)
Factors. After a single bet, our initial wealth is multiplied by \((1 + (b-1)w)\) if we win, and with \((1 - w)\) if we lose. Assume we are given \(N\) bets and we win \(W\) times and lose \(L\) times. The multiplicative factors that act on our initial wealth are
Geometric average. Recall that the geometric average of factors \(r_1, r_2, \ldots, r_N\) is given by \((r_1 \times r_2 \times \cdots \times r_n)^{1/N}\).
The geometric average of our factors is \(R(w) = \left [ (1 + (b-1)w)^W (1 - w)^L \right]^{1/N}\). We maximize \(R(w)\) to maximize the growth rate. It’s easier to work with logarithms, so we aim to maximize
Taking the limit. Taking limits, we observe that \(W/N \to p\) and \(L/N \to (1-p)\). We have
Maximizing the function. Differentiating \(\mathbb{E} \left[ \ln R(w) \right]\) with respect to \(w\), setting it to zero and solving, we obtain the analytical solution
For instance, with a sixty percent chance of doubling our money on every bet, we have \(p = 0.6\), \(b=2\) and \(w^{\star} = 0.2\). We should bet \(20 \, \%\) of our wealth at each bet. Simulations of this case are shown in the figure below.
A generalization for football
We now generalize Kelly’s idea to football, where we have multiple mutually exclusive outcomes (outcomes on each match) and multiple simultaneously independent bets (each match).
Data. When betting on football there are \(m\) matches indexed by \(i=1, \ldots, m\) and \(3\) outcomes per match, indexed by \(j=0, 1, 2\). The outcomes are a home win, a draw and a win for the away team. Construct a \(m \times 3\) table with probabilities \(p_{ij}\) (from a football model or from a panel of experts) and odds \(b_{ij}\). There are \(3^m\) outcomes in total, instead of \(2\) like in standard Kelly.
Decision variables. We create a table of weights \(w_{ij}\) that denote the fraction that we bet on match \(i\) and outcome \(j\). We impose constraints \(w_{ij} \geq 0\) and \(\sum_i \sum_j w_{ij} \leq 1\).
The multiplicative factor for an outcome. Let \(\boldsymbol{o} \in \mathbb{Z}_3^m\) denote an outcome. For instance, the outcome \((0, 2)\) means that there are \(m=2\) matches, since the length of the outcome tuple is two, and the first match resulted in a home win (denoted by the \(0\)) and the second match resulted in an away win (denoted by \(2\)).
If \(\boldsymbol{o}_i\) equals \(j\), then match \(i\) results in outcome \(j\). The factor that multiplies the original wealth becomes
where \([ \text{argument} ]\) is the Iverson bracket, evaluating to \(1\) if the condition is true and \(0\) otherwise.
The probability of an outcome. The probability of an outcome is \(\prod_{i=1}^m p_{i, \boldsymbol{o}_i}\). For instance, the probability of the outcome \((0, 2)\) is \(p_{1, \boldsymbol{o}_1} \times p_{2, \boldsymbol{o}_2} = p_{1, 0} \times p_{2, 2}\). The notation is dense, so it might be helpful for the reader to work out a few examples by hand.
Geometric average. Combining the multiplicative factors and probabilities for outcomes together, we get the following multiplicative factors
where \(T_{\boldsymbol{o}}\) is the number of times outcome \(\boldsymbol{o}\) happens. Let’s assume we bet \(N\) times.
Taking the logarithm and the limit.
As \(N \to \infty\) we get \(T_{\boldsymbol{o}} / N = \prod_{i=1}^m p_{i, \boldsymbol{o}_i}\). Taking the logarithm and then the limit, we obtain
This is the objective function we maximize, subject to the constraints (1) \(w_{ij} \geq 0\) and (2) \(\sum_i \sum_j w_{ij} \leq 1\).
Properties
We will now state some properties of the function \(\mathbb{E} \left[ \ln R(\boldsymbol{w}) \right]\).
Convexity.
The objective \(\mathbb{E} \left[ \ln R(\boldsymbol{w}) \right]\) is concave. In each logarithm there is a sum of linear terms, i.e. a linear function. The logarithm of a linear function is concave, and the sum of those logarithms is a sum of concave functions. A sum of concave function is concave.
The constraints (1) \(w_{ij} \geq 0\) and (2) \(\sum_i \sum_j w_{ij} \leq 1\) also form a convex set. In summary we’re maximizing a concave function over a convex set, so it’s a convex optimization problem. This has favorable implications: there is a unique maximum, the Hessian of the objective is negative definite, and we can use specialized software designed for convex optimization.
Leverage. Individual positive expected value is not a prerequisite for betting. Consider the problem instance below:
Treating each outcome as independent (they are not!), we obtain the expected values
Notice that the third outcome has negative expected value if treated as independent, yet we still bet on it in the solution \(w^{\star}\).
The independent solutions are to bet zero on outcomes two and three, and bet \(0.212\) on outcome one. This would give an expected growth rate of \(1.0437\) on the combined problem. But the optimal solution that takes dependence into account yields an expected growth rate of \(1.0554\).
This is leveraging. Kelly would use outcome three as leverage against outcome one, enabling him to increase the expected geometric growth rate.
More matches means more money betted. Consider the coin toss problem below, with \(m=2\) rows. The solution, obtained using CVXPY for convex optimization (more on that later), is also shown.
Could it be that bets can be decomposed? In other words, is the solution to a problem with two bets the sum of the solutions of the individual bets? It turns out that the answer is no, since
and \(0.5 \times 2 = 1 \neq 0.384 \times 2\).
Dependence between bets. Could it be that each row could be solved individually, then scaled? It turns out that the answer is no. Consider the two problems
and compare it to the joint problem
In the two individual problems, twice as much is bet on the last coin (\(0.4\) vs \(0.2\)). But on the joint problem we bet \(0.391\) on the second problem, which is more than twice as much as \(0.155\).
Having examined the properties of the problem a little bit, we now proceed to solve it.
Solution by CVXPY
Since the problem is convex, we can give it to CVXPY. While problems with \(m < 8\) bets are solved quickly, we run into trouble as \(m\) increases. A problem with \(m\) matches has \(3^m\) terms in the objective function value, so function evaluation and solution times both grow exponentially. This is shown in the figure below.
Can we solve the problem faster?
Gradient sampling
Recall the expression for the expected value of the logarithm of the random variable \(R(\boldsymbol{w})\):
Just like the expected value, the gradient is a linear operator, so we can exchange the order of operations. In mathematical notation, we have
The gradient \(\nabla_{\boldsymbol{w}} \ln R(\boldsymbol{w})\) is a \(m \times 3\) matrix with entries
By sampling outcomes \(\boldsymbol{o}\) from the outcome space \(\mathcal{O}\), we can approximate the expected value of the gradient by sampling \(S\) outcomes and taking the mean value:
Of course we have to sample outcomes using their respective probabilities, not uniformly. See the paper by Busseti et al.2 for more information.
We implement the Adam algorithm4 for stochastic gradient descent. The solution could move out of the constraint set, so we project it back into if that happens.
Running Adam and aborting if no progress has been made on \(10\) iterations, we obtain the following results. The main parameters to tune are:
- the number of gradient samples per iteration
- the step size
- the beta-parameter (momentum)
We set the number of gradient samples to \(4096\) and try out a couple of parameter choices.
Notice the oscillations that appear when the step size is too large. The problem has narrow high-dimensional valleys (and this problem gets more severe the more decision variables we have). Exchanging iterations on the horizontal with time, we get the following figure.
In a few seconds we obtain solutions with a relative error of less than \(0.01 \%\) compared to CVXPY. By contrast, CVXPY takes around \(7\) minutes.
Starting point. A natural initial solution (used in the plots above) is \(\boldsymbol{w} = \boldsymbol{0}\). In many cases however, we can obtain a fantastic initial solution by setting
where \(\Pi\) is the projection onto the simplex defined by the constraints. On this particular problem, this initial solution gives an expected geometric growth rate of \(1.2588\) (compared to the growth rate of \(\boldsymbol{w}^{\star}\) which is \(1.2668\)).
Large problems. The gradient sampling algorithm can also sample objective function values. This enables us to solve problems with large \(m\), where even objective function evaluation is too expensive due to the \(3^m\) terms that must be evaluated.
Monte Carlo sampling is all you need to get high quality solutions. We do not pursue this approach further here, but it’s easy to implement. When \(m\) grows the oscillations tend to increase, so perhaps a solver other than Adam might be more appropriate.
An explicit solution in the case of \(m=1\)
The Smoczynski paper3 presents a solution to the Karush–Kuhn–Tucker equations in the case of a single match \(m=1\). To adapt the method to our notation we observe that their expression \(1 - \sum_{m \in S}\), given in equation (6.8) in the paper, is
in our notation. Similarity, equation (6.5) in the paper is
in our notation. See the appendix of this article for a derivation. Unfortunately there does not seem to be any way to extend these results to \(m > 1\).
Summary
The Kelly criterion maximizes long-term wealth growth over an infinite number identical rounds of betting. We generalized the method to football, where we have multiple outcomes per match as well as multiple matches.
The result is a well-behaved convex optimization problem, which we solve by either CVXPY or by gradient sampling with the Adam algorithm for gradient descent. CVXPY is more general and requires no parameter tuning, but is slower than Adam. With Adam we can solve practical problems in seconds, and we can solve problems with a large number of matches (though one should beware of oscillations). In the case of a single match, an explicit algorithm exists.
Notes and references
- The Wikipedia entry on the Kelly criterion is a good starting point.
- There’s a book titled “The Kelly capital growth investment criterion: theory and practice.” It’s a collection of papers related to the Kelly criterion.
- Two papers co-authored by Stephen Boyd, the author of the excellent book “Convex Optimization”, are “Risk-Constrained Kelly Gambling” and “Distributional Robust Kelly Gambling: Optimal Strategy under Uncertainty in the Long-Run.” The first one is very readable.
Appendix: the Smoczynski paper
Here we walk through the results found in the paper “An explicit solution to the problem of optimizing the allocations of a bettor’s wealth when wagering on horse races’‘3 in our notation. Consider the case when there is a single match, so \(m=1\).
The Lagrangian and the KKT system
The Lagrangian is
The Karush–Kuhn–Tucker system is
The goal is to solve the equations for \(w_j\). We split the sums over whether the index variable is in the active set \(S\), or not. We have that \(\sum_{i} w_i = \sum_{i \in S} w_i\), since \(w_i\) is zero if \(i\) is not in the active set.
Isolating \(w_j\) and \(\sum_{i \in S} w_i\)
Sum both sides of the equation above over every \(j \in S\) to obtain
Solve this equation for the repeated term (whether we use \(j\) or \(k\) as the index is not important) and obtain
Substitute the equation above back into \eqref{splitset} and get
Now multiply both sides by the denominator on the left hand side, then pull out the \(1 - \sum_{i \in S} w_i\) term to obtain
Take a deep breath. We’re very close.
An expression for \(1 - \sum_{i \in S} w_i\)
We want to get rid of the term \(1 - \sum_{i \in S} w_i\). To do so, we sum equation \eqref{halfway} over every \(j \in S\) and solve. At one point we have to use the fact that \(\sum_{j \in S} p_i + \sum_{j \notin S} p_i = 1\). We end up with
Put this back into equation \eqref{halfway} and solve for \(w_j\) as
-
Kelly, J. L.(1956). “A New Interpretation of Information Rate” ↩
-
Busseti et al. (2016). “Risk-Constrained Kelly Gambling” ↩
-
Smoczynski et al. (2010). “An explicit solution to the problem of optimizing the allocations of a bettor’s wealth when wagering on horse races” ↩↩
-
Kingma et al. (2014). “Adam: A Method for Stochastic Optimization” ↩