Ridge regression with a link function
- 21. January 2024
- #optimization, #datascience
A high-level overview of this material in the context of Generalized Additive Models is presented in my YouTube video Generalized Additive Models - A journey from linear regression to GAMs.
Suppose you want to create a regression model, but the target variable \(\boldsymbol{y}\) is bounded. This happens all the time, for instance when \(\boldsymbol{y}\) consists of strictly positive prices (lower bound) or a percentage (lower and upper bound). There are three options when e.g., \(\boldsymbol{y}\) must be positive:
- Ignore it, or perhaps add \(\boldsymbol{y} := \max(\boldsymbol{0}, \boldsymbol{y})\) as a post-processing step to ensure non-negative outputs.
- Regress on a transformation, such as \(\log \boldsymbol{y}\)—but this means you’re no longer minimizing the squared error, since you’re working in a transformed space. This can have drastic consequences: if \(\boldsymbol{y} = (1, 10, 100)\) then the mean is \(37\), while the exponential of the mean of \(\log_{10} \boldsymbol{y}\) is \(10\).
- Introduce an inverse link function \(g\), such that \(\hat{\boldsymbol{y}} = g \left( X \hat{\boldsymbol{\beta}} \right)\). The name “inverse link function” is from the literature on generalized additive models, and it takes the linear prediction \(X \hat{\boldsymbol{\beta}}\) and maps it to the (typically bounded) target space.
In this article we pursue the last option. Here is a dataset] with a bounded target variable. The target is between zero and one, but the Ridge model will happily predict values outside of this range:
We will first solve linear regression, then solve the Ridge problem, and finally introduce a link function to the ridge objective.
Linear regression
Suppose we have a data set \(X\) and a vector of observations \(\boldsymbol{y}\). The goal of linear regression is to choose coefficients \(\boldsymbol{\beta}\) such that \(X \boldsymbol{\beta}\) is close to \(\boldsymbol{y}\). Closeness is most commonly measured by the sum of squares, and our loss function \(\mathcal{L}(\boldsymbol{\beta})\) becomes
The goal is to choose a vector of coefficients \(\boldsymbol{\beta}\) to minimize the loss function. In mathematical notation, we wish to solve the optimization problem
Solving linear regression
One approach to solving the problem is to differentiate the expression
and set it equal to zero. Doing so, we obtain the normal equations \(X^T X \boldsymbol{\beta} = X^T \boldsymbol{y}\). The most straightforward solution approach is to form the normal equations explicitly and solve for \(\boldsymbol{\beta}\).
Another approach that avoids explicitly solving the normal equation is to
- Form the economical QR factorization \(QR = X\).
- The equations \(X^T X \boldsymbol{\beta} = X^T \boldsymbol{y}\) become \(R^T R \boldsymbol{\beta} = X^T \boldsymbol{y}\) since \(Q^T Q = I\). Define \(\boldsymbol{\alpha} = R \boldsymbol{\beta}\), then solve \(R^T \boldsymbol{\alpha} = X^T \boldsymbol{y}\) for \(\boldsymbol{\alpha}\) using a triangular solver.
- Solve \(R^T \boldsymbol{\beta} = \boldsymbol{\alpha}\) for \(\boldsymbol{\beta}\) using a triangular solver.
Either way, the problem is solved by differentiating the loss function and setting it equal to zero, analogous to solving a simple quadratic function \(f(\beta) = (\beta x - y)^2\) for the optimal \(\beta\).
Ridge regression
Next let’s look at Ridge regression. In Ridge regression we add a penalty term \(\alpha \lVert \boldsymbol{\beta} \rVert^2\) to the objective, where \(\alpha\) is a positive regularization strength. This is also a good time to introduce the option of unequal weighting of data points, facilitated by a diagonal matrix \(W\) with non-negative weights.
Recall that the weighted norm \(\lVert \boldsymbol{\beta} \rVert^2_W\) is notational convenience for \(\boldsymbol{\beta}^T W \boldsymbol{\beta}\). We can form normal equations for this problem too. To do so, we multiply the expression out, differentiate it with respect to \(\boldsymbol{\beta}\), then set it equal to zero. Multiplying out, we obtain
Computing the gradient with respect to \(\boldsymbol{\beta}\), we then obtain
Setting this equal to zero and collecting terms with \(\boldsymbol{\beta}\), we obtain the normal equations
Solving Ridge regression
The most straightforward way to solve the Ridge problem is to compute the normal equations.
Instead of explicitly forming the diagonal matrix \(W = \operatorname{diag}(\boldsymbol{\beta})\) when computing \(X^T W X + \alpha I\), we can avoid forming the diagonal matrix explicitly and compute it as (X * w[:, None]).T @ X
in Python.
This is 2-3 orders of magnitude faster on a typical problem size.
Another approach is to use the SVD, as described in Section 3.4.1 of Elements.
Ridge with a link function
We will now introduce a link function to the optimization problem.
Assume our prediction of \(\boldsymbol{y}\) is \(\boldsymbol{\mu}\). The link function \(g\) is defined as \(g \left( \boldsymbol{\mu} \right) = X \boldsymbol{\beta}\). The inverse link function \(h\) is defined as \(\boldsymbol{\mu} = h \left( X \boldsymbol{\beta} \right)\). We will use the notation \(\boldsymbol{\eta} = X \boldsymbol{\beta}\) for predictions in the linear, non-transformed space. In the case above we used the log-link, so \(g(\boldsymbol{\mu}) = \log(\boldsymbol{\mu})\) and \(h(\boldsymbol{\eta}) = \exp(\boldsymbol{\eta})\).
The optimization problem becomes
Immediately there are two obvious ways to solve this problem. They both involve Taylor expansions of the objective:
- Taylor expansion of the inverse link function \(h\left( X \boldsymbol{\beta} \right)\). The link function is approximated by a hyperplane, and the objective squares this hyperplane. The result is a function with a positive definite Hessian. This is essentially the Gauss–Newton method.
- Taylor expansion of the entire objective \(\mathcal{L}(\boldsymbol{\beta})\). Successively solving this is Newton’s method.
Examining a 1D example with \(h(\eta) = \operatorname{softplus}(\eta) = \log (1 + \exp(\eta))\) is instructive:
On the particular problem shown in the figure, we observe that:
- Expansion of \(h\left( X \boldsymbol{\beta} \right)\) can lead to a large step size that overshoots the minimum value.
- Expansion of \(\mathcal{L}(\boldsymbol{\beta})\) need not be positive definite. Forcing it to be positive definite (by setting the eigenvalues of the Hessian to positive values) can circumvent this issue.
Let’s work out the math for both approaches. We’ll get back to numerical examples later on.
Solution approach 1: Taylor expansion of the inverse link function
Expanding \(h\left( X \boldsymbol{\beta} \right)\) about \(\boldsymbol{\beta}_\ell\), we obtain
where the \(\operatorname{diag} \left[ \cdot \right]\) operator acts like numpy.diag.
Now we substitute the Taylor expansion into the objective, ignoring the regularization term for the moment:
Notice that this is simply the Ridge problem all over again, with new data \(X'\) and \(\boldsymbol{y}'\). We already know how to solve this, we can construct an iterative algorithm:
Solution approach 2: Taylor expansion of the loss function
Our second approach will be to use Newton’s method
which approximates \(\mathcal{L}(\boldsymbol{\beta})\) using a second order Taylor expansion and iteratively solves the quadratic approximation for the minimum \(\boldsymbol{\beta}\).
In order to obtain an expression for the Hessian and gradient, it helps to write out the loss as a sum
where \(\boldsymbol{x}_i\) is the \(i\)‘th row vector in the matrix \(X \in \mathbb{R}^{m \times n}\).
The gradient
Differentiating with respect to \(\boldsymbol{\beta}\), we obtain the gradient as
In the more useful matrix notation, this becomes
where we also could’ve used the notation \(\boldsymbol{\eta} = X \boldsymbol{\beta}\). The product \(\boldsymbol{a} \odot \boldsymbol{b}\) is the elementwise product of two vectors \(\boldsymbol{a}\) and \(\boldsymbol{b}\).
The Hessian
Differentiating once more, we obtain the Hessian as
In matrix notation, this becomes
Now that we have expressions for the gradient \(\nabla \mathcal{L}(\boldsymbol{\beta})\) and the Hessian \(\nabla^2 \mathcal{L}(\boldsymbol{\beta})\), we are in a position to use Newtons method.
At this point you might be wondering if the math checks out. There’s only one way to find out—let’s code it up and test it on a problem!
Experimental results
Assume that we want to avoid negative predictions. We use the softmax function \(\log \left( 1 + \exp \left( \eta \right) \right)\) to enforce positive predictions. The softmax is a smooth approximation to \(\max(0, \eta)\).
The inverse link function and its derivatives become:
The code is included below.
The optimization routine produces exactly the same result as scipy.optimize.minimize
on this problem, but is 2-3 orders of magnitude faster since we use information about the gradient and Hessian.
from sklearn.model_selection import cross_val_score
class LinkedRidge(LinkedRidgeBase):
def h(self, x):
"""The softplus function log(1 + exp(x)), numerically stable."""
return np.log1p(np.exp(-np.abs(x))) + np.maximum(x, 0)
def h_prime(self, x):
"""The derivative of softplus, the logistic 1/(1 + exp(-x))."""
return sp.special.expit(x)
def h_prime_prime(self, x):
"""The derivative of the logistic function."""
return self.h_prime(x) * (1 - self.h_prime(x))
# Create a data set
rng = np.random.default_rng(42)
num_samples = 1000
num_features = 25
X = rng.normal(size=(num_samples, num_features))
beta_true = np.arange(num_features) + 1
# Create positive targets values and weights
y = LinkedRidge().h(X @ beta_true + rng.normal(size=num_samples))
sample_weight = np.exp(rng.normal(size=num_samples))
# Solve using model directly
model = LinkedRidge(alpha=1, verbose=False)
model.fit(X, y, sample_weight=sample_weight)
r2_scores = cross_val_score(model, X, y,
fit_params={"sample_weight": sample_weight})
assert r2_scores.mean() > 0.99
Here are results from running the model on a problem with \(1000\) observations and a varying number of features.
In order to check the results, I compared with scipy.optimize.minimize
without providing any information about the gradient nor the Hessian.
The iterations are not directly comparable, so for scipy I only include the final answer instead of progress per iteration.
The newton
method uses expansion of the full loss function, the ils
method (Iterated Least Squares) expands the inverse link function, and auto
cycles between the two approaches (one iteration is actually two iterations—one of each).
I also implemented step-halving, see the code below.
Python code
Here’s a simple Python implementation of the optimization routines outlined above. We subclass the Ridge model from scikit-learn so our implementation inherits the scoring method and becomes compatible with other functions in the scikit-learn ecosystem.
Scaling the input features and the target is a good idea. For instance, down-scale a percentage in the range \([0, 100]\) to \([0, 1]\) instead of scaling up the sigmoid function.
Please note that this is not a production grade-implementation. If you need that, check out my Python package generalized-additive-models.
import functools # Python version: 3.11.6
import numpy as np # Version: 1.26.1
import scipy as sp # Version: 1.11.3
from sklearn.linear_model import Ridge # Version: 1.3.2
class LinkedRidgeBase(Ridge):
"""The user must inherit this class and override
the functions h(eta), h'(eta) and h''(eta)."""
def h(self, x):
raise NotImplementedError
def h_prime(self, x):
raise NotImplementedError
def h_prime_prime(self, x):
raise NotImplementedError
def __init__(self, alpha=1, tol=0.0001, solver="auto", max_iter=99, verbose=False):
assert solver in ("newton", "ils", "auto", "scipy")
self.alpha = alpha # Regularization strength
self.tol = tol # Convergence tolerance
self.verbose = verbose # Whether or not to print
self.solver = solver
self.max_iter = max_iter # Maximum number of iterations
def fit(self, X, y, sample_weight=None):
# We store alpha as a vector, but it should correspond to sklearn alpha
self.alpha_ = np.ones(X.shape[1]) * self.alpha**0.5
w = np.ones(X.shape[0]) if sample_weight is None else sample_weight
beta_init = self._initial_guess(X, w=w, y=y)
if self.verbose:
print(f"Iteration 0. Obj: {self.objective(beta_init, X, w, y):.2f}")
self.coef_ = self.solve(beta_init, X, w, y, self.solver)
return self
def _initial_guess(self, X, w, y):
# Compute Taylor expansion of h^-1(mu) = eta = X @ beta, then solve
# |h(X @ beta) - y| ~ |X @ beta - h^-1(y)| for initial beta.
y0 = np.mean(y)
y_trans = y0 + (y - self.h(y0)) / (self.h_prime(y0))
return self.solve_lstsq(X, w=w, y=y_trans) # Solve transformed Ridge
def solve_lstsq(self, X, w, y):
"""Solve |X @ beta - y|^2 + |diag(alpha) * beta|^2. for beta."""
X_T_W = (X * w[:, None]).T
lhs = X_T_W @ X # Form X @ W @ X.T
np.fill_diagonal(lhs, lhs.diagonal() + self.alpha_**2) # Add alpha inplace
return sp.linalg.solve(lhs, X_T_W @ y, assume_a="pos")
def predict(self, X):
"""Predict on a new data set using h(X @ beta)."""
return self.h(X @ self.coef_)
def objective(self, beta, X, w, y):
"""Objective function, evaluated at beta."""
eta = X @ beta
diff = self.h(eta) - y
penalty = np.dot(self.alpha_ * beta, self.alpha_ * beta)
return np.dot(diff, w * diff) + penalty
def gradient(self, beta, X, w, y):
"""Gradient of the objective function, evaluated at beta."""
eta = X @ beta
iter_weights = w * ((self.h(eta) - y) * self.h_prime(eta))
return X.T @ iter_weights + self.alpha_**2 * beta
def hessian(self, beta, X, w, y):
"""Hessian of the objective function, evaluated at beta."""
eta = X @ beta
iter_weights = w * self.h_prime(eta) ** 2
iter_weights += w * (self.h(eta) - y) * self.h_prime_prime(eta)
hess = (X * iter_weights[:, None]).T @ X
np.fill_diagonal(hess, hess.diagonal() + self.alpha_**2) # Add alpha inplace
return hess
def halving_search(self, beta0, beta1, X, w, y):
"""Perform step halving search between beta0 and beta1."""
# Starting objective value
objective0 = self.objective(beta0, X, w, y)
for step_size in map(lambda i: 0.5**i, range(20)): # 1, 1/2, 1/4, ...
beta = (1 - step_size) * beta0 + step_size * beta1
objective = self.objective(beta, X, w, y)
if objective < objective0: # Found better solution
return beta
return beta0 # No better solution found
def solve_scipy(self, beta, X, w, y):
"""Solve with scipy.optimize.minimize."""
class Callback:
iteration = 1
prev_beta = np.copy(beta)
def __call__(cb, intermediate_result):
step = intermediate_result.x - cb.prev_beta
msg = f"Iteration {cb.iteration}. "
msg += f"Log10 delta beta: {np.log10(sp.linalg.norm(step)):.2f}. "
msg += f"Obj: {self.objective(intermediate_result.x, X, w, y):.2f}"
cb.iteration += 1
cb.prev_beta = intermediate_result.x
print(msg)
result = sp.optimize.minimize(
fun=functools.partial(self.objective, X=X, w=w, y=y),
x0=beta,
method="BFGS",
jac=functools.partial(self.gradient, X=X, w=w, y=y),
tol=self.tol,
callback=Callback() if self.verbose else None,
)
return result.x
def solve(self, beta, X, w, y, method):
"""Driver method for calling solution algorithms."""
if method == "scipy":
return self.solve_scipy(beta, X, w, y)
def auto(beta, X, w, y):
"""One step of each :)"""
beta1 = self.iterative_least_squares(beta, X, w, y)
return self.newton(beta1, X, w, y)
optimizers = {
"newton": self.newton,
"ils": self.iterative_least_squares,
"auto": auto,
}
for iteration in range(1, self.max_iter + 1):
beta_new = optimizers[method](beta, X, w, y)
step = beta_new - beta
beta = beta_new
if self.verbose:
msg = f"Iteration {iteration}. "
msg += f"Log10 delta beta: {np.log10(sp.linalg.norm(step)):.2f}. "
msg += f"Obj: {self.objective(beta, X, w, y):.2f}"
print(msg)
if self._should_stop(beta, step): # Tolerance criterion
return beta
return beta
def newton(self, beta, X, w, y):
"""Use Newton's algorithm to minimize the objective."""
g = self.gradient(beta, X, w, y)
H = self.hessian(beta, X, w, y)
# Solve H @ beta = g, assuming H = U @ s @ U.T is positive definite
s, U = sp.linalg.eig(H)
U = np.real(U)
s = np.maximum(1e-2, np.real(s))
step = np.linalg.multi_dot([U / s, U.T, g])
# step = sp.linalg.solve(H, g, assume_a="pos")
return self.halving_search(beta, beta - step, X, w, y)
def iterative_least_squares(self, beta, X, w, y):
"""User iterative least squares to minimize the objective."""
eta = X @ beta
diag_h_prime_eta_X = (X.T * self.h_prime(eta)).T # diag(h'(eta)) @ X
z = y - self.h(eta) + diag_h_prime_eta_X @ beta # Pesudodata
# Solve |(diag(h'(eta)) @ X) @ beta - z|^2 + |alpha|^2
beta1 = self.solve_lstsq(diag_h_prime_eta_X, w, z)
return self.halving_search(beta, beta1, X, w, y)
def _should_stop(self, beta, step):
"""Stopping criterion."""
return np.max(np.abs(step)) < self.tol * np.max(np.abs(beta))
A generalization to arbitrary loss functions
Instead of squaring the residuals \(\epsilon_i = h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) - y_i\) we can use a general function \(g\). The function \(g: \mathbb{R} \to \mathbb{R}\) could be \(\epsilon \mapsto \epsilon^2\), \(\epsilon \mapsto \lvert \epsilon \rvert\), \(\epsilon \mapsto \log(1 + \lvert \epsilon \rvert)\), the Huber loss function, or something else. Beware that not all of these functions are easy to optimize, even with gradient and Hessian information.
Loss function. Substituting the squared loss for an arbitrary pointwise function \(g\), the loss becomes
Gradient. Taking the gradient with respect to \(\boldsymbol{\beta}\) we obtain
where \(\boldsymbol{\eta} = X \boldsymbol{\beta}\) and \(\boldsymbol{\epsilon} = h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y}\).
Hessian. And finally, the Hessian becomes