Ridge regression with a link function

A high-level overview of this material in the context of Generalized Additive Models is presented in my YouTube video Generalized Additive Models - A journey from linear regression to GAMs.

Suppose you want to create a regression model, but the target variable \(\boldsymbol{y}\) is bounded. This happens all the time, for instance when \(\boldsymbol{y}\) consists of strictly positive prices (lower bound) or a percentage (lower and upper bound). There are three options when e.g., \(\boldsymbol{y}\) must be positive:

  • Ignore it, or perhaps add \(\boldsymbol{y} := \max(\boldsymbol{0}, \boldsymbol{y})\) as a post-processing step to ensure non-negative outputs.
  • Regress on a transformation, such as \(\log \boldsymbol{y}\)—but this means you’re no longer minimizing the squared error, since you’re working in a transformed space. This can have drastic consequences: if \(\boldsymbol{y} = (1, 10, 100)\) then the mean is \(37\), while the exponential of the mean of \(\log_{10} \boldsymbol{y}\) is \(10\).
  • Introduce an inverse link function \(g\), such that \(\hat{\boldsymbol{y}} = g \left( X \hat{\boldsymbol{\beta}} \right)\). The name “inverse link function” is from the literature on generalized additive models, and it takes the linear prediction \(X \hat{\boldsymbol{\beta}}\) and maps it to the (typically bounded) target space.

In this article we pursue the last option. Here is a dataset] with a bounded target variable. The target is between zero and one, but the Ridge model will happily predict values outside of this range:

We will first solve linear regression, then solve the Ridge problem, and finally introduce a link function to the ridge objective.

Linear regression

Suppose we have a data set \(X\) and a vector of observations \(\boldsymbol{y}\). The goal of linear regression is to choose coefficients \(\boldsymbol{\beta}\) such that \(X \boldsymbol{\beta}\) is close to \(\boldsymbol{y}\). Closeness is most commonly measured by the sum of squares, and our loss function \(\mathcal{L}(\boldsymbol{\beta})\) becomes

\begin{equation} \mathcal{L}(\boldsymbol{\beta}) = \lVert X \boldsymbol{\beta} - \boldsymbol{y} \rVert^2. \end{equation}

The goal is to choose a vector of coefficients \(\boldsymbol{\beta}\) to minimize the loss function. In mathematical notation, we wish to solve the optimization problem

\begin{equation} \underset{\boldsymbol{\beta}}{\operatorname{minimize}} \quad \mathcal{L}(\boldsymbol{\beta}) = \lVert X \boldsymbol{\beta} - \boldsymbol{y} \rVert^2. \end{equation}

Solving linear regression

One approach to solving the problem is to differentiate the expression

\begin{equation} \lVert X \boldsymbol{\beta} - \boldsymbol{y} \rVert^2 = \left(X \boldsymbol{\beta} - \boldsymbol{y} \right)^T \left(X \boldsymbol{\beta} - \boldsymbol{y} \right) \end{equation}

and set it equal to zero. Doing so, we obtain the normal equations \(X^T X \boldsymbol{\beta} = X^T \boldsymbol{y}\). The most straightforward solution approach is to form the normal equations explicitly and solve for \(\boldsymbol{\beta}\).

Another approach that avoids explicitly solving the normal equation is to

  1. Form the economical QR factorization \(QR = X\).
  2. The equations \(X^T X \boldsymbol{\beta} = X^T \boldsymbol{y}\) become \(R^T R \boldsymbol{\beta} = X^T \boldsymbol{y}\) since \(Q^T Q = I\). Define \(\boldsymbol{\alpha} = R \boldsymbol{\beta}\), then solve \(R^T \boldsymbol{\alpha} = X^T \boldsymbol{y}\) for \(\boldsymbol{\alpha}\) using a triangular solver.
  3. Solve \(R^T \boldsymbol{\beta} = \boldsymbol{\alpha}\) for \(\boldsymbol{\beta}\) using a triangular solver.

Either way, the problem is solved by differentiating the loss function and setting it equal to zero, analogous to solving a simple quadratic function \(f(\beta) = (\beta x - y)^2\) for the optimal \(\beta\).

Ridge regression

Next let’s look at Ridge regression. In Ridge regression we add a penalty term \(\alpha \lVert \boldsymbol{\beta}  \rVert^2\) to the objective, where \(\alpha\) is a positive regularization strength. This is also a good time to introduce the option of unequal weighting of data points, facilitated by a diagonal matrix \(W\) with non-negative weights.

\begin{equation} \underset{\boldsymbol{\beta}}{\operatorname{minimize}} \quad \mathcal{L}(\boldsymbol{\beta}) =  \lVert X \boldsymbol{\beta} - \boldsymbol{y} \rVert^2_W + \alpha \lVert \boldsymbol{\beta}  \rVert^2, \end{equation}

Recall that the weighted norm \(\lVert \boldsymbol{\beta} \rVert^2_W\) is notational convenience for \(\boldsymbol{\beta}^T W \boldsymbol{\beta}\). We can form normal equations for this problem too. To do so, we multiply the expression out, differentiate it with respect to \(\boldsymbol{\beta}\), then set it equal to zero. Multiplying out, we obtain

\begin{equation} \mathcal{L}(\boldsymbol{\beta}) = \boldsymbol{\beta}^T X^T W X \boldsymbol{\beta} - 2 \boldsymbol{\beta}^T X^T W \boldsymbol{y} + y^T W \boldsymbol{y} + \alpha \boldsymbol{\beta}^T \boldsymbol{\beta}. \end{equation}

Computing the gradient with respect to \(\boldsymbol{\beta}\), we then obtain

\begin{equation} \nabla \mathcal{L}(\boldsymbol{\beta}) = 2 X^T W X \boldsymbol{\beta} - 2 X^T W \boldsymbol{y} + 2 \alpha \boldsymbol{\beta} = 2 X^T W (X \boldsymbol{\beta} - \boldsymbol{y} ) + 2\alpha \boldsymbol{\beta}. \end{equation}

Setting this equal to zero and collecting terms with \(\boldsymbol{\beta}\), we obtain the normal equations

\begin{equation} \left( X^T W X + \alpha I \right) \boldsymbol{\beta} = X^T W \boldsymbol{y}. \end{equation}

Solving Ridge regression

The most straightforward way to solve the Ridge problem is to compute the normal equations. Instead of explicitly forming the diagonal matrix \(W = \operatorname{diag}(\boldsymbol{\beta})\) when computing \(X^T W X + \alpha I\), we can avoid forming the diagonal matrix explicitly and compute it as (X * w[:, None]).T @ X in Python. This is 2-3 orders of magnitude faster on a typical problem size.

Another approach is to use the SVD, as described in Section 3.4.1 of Elements.

We will now introduce a link function to the optimization problem.

Assume our prediction of \(\boldsymbol{y}\) is \(\boldsymbol{\mu}\). The link function \(g\) is defined as \(g \left( \boldsymbol{\mu} \right) = X \boldsymbol{\beta}\). The inverse link function \(h\) is defined as \(\boldsymbol{\mu}  = h \left( X \boldsymbol{\beta} \right)\). We will use the notation \(\boldsymbol{\eta} = X \boldsymbol{\beta}\) for predictions in the linear, non-transformed space. In the case above we used the log-link, so \(g(\boldsymbol{\mu}) = \log(\boldsymbol{\mu})\) and \(h(\boldsymbol{\eta}) = \exp(\boldsymbol{\eta})\).

The optimization problem becomes

\begin{equation} \underset{\boldsymbol{\beta}}{\operatorname{minimize}} \quad \mathcal{L}(\boldsymbol{\beta}) =  \lVert h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y} \rVert^2_W + \alpha \lVert \boldsymbol{\beta}  \rVert^2. \end{equation}

Immediately there are two obvious ways to solve this problem. They both involve Taylor expansions of the objective:

  • Taylor expansion of the inverse link function \(h\left( X \boldsymbol{\beta} \right)\). The link function is approximated by a hyperplane, and the objective squares this hyperplane. The result is a function with a positive definite Hessian. This is essentially the Gauss–Newton method.
  • Taylor expansion of the entire objective \(\mathcal{L}(\boldsymbol{\beta})\). Successively solving this is Newton’s method.

Examining a 1D example with \(h(\eta) = \operatorname{softplus}(\eta) = \log (1 + \exp(\eta))\) is instructive:

On the particular problem shown in the figure, we observe that:

  • Expansion of \(h\left( X \boldsymbol{\beta} \right)\) can lead to a large step size that overshoots the minimum value.
  • Expansion of \(\mathcal{L}(\boldsymbol{\beta})\) need not be positive definite. Forcing it to be positive definite (by setting the eigenvalues of the Hessian to positive values) can circumvent this issue.

Let’s work out the math for both approaches. We’ll get back to numerical examples later on.

Expanding \(h\left( X \boldsymbol{\beta} \right)\) about \(\boldsymbol{\beta}_\ell\), we obtain

\begin{align*} h\left( X \boldsymbol{\beta} \right) &= h\left( X \boldsymbol{\beta}_\ell \right) + \nabla h \left( X \boldsymbol{\beta}_\ell \right) \left( \boldsymbol{\beta} - \boldsymbol{\beta}_\ell \right) \\ &= h\left( X \boldsymbol{\beta}_\ell \right) + \operatorname{diag} \left[ h' \left( X \boldsymbol{\beta}_\ell \right) \right] X \left( \boldsymbol{\beta} - \boldsymbol{\beta}_\ell \right), \end{align*}

where the \(\operatorname{diag} \left[ \cdot \right]\) operator acts like numpy.diag.

Now we substitute the Taylor expansion into the objective, ignoring the regularization term for the moment:

\begin{align*} \lVert h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y} \rVert^2_W &\approx \lVert \left( h\left( X \boldsymbol{\beta}_\ell \right) + \operatorname{diag} \left[ h' \left( X \boldsymbol{\beta}_\ell \right) \right] X \left( \boldsymbol{\beta} - \boldsymbol{\beta}_\ell \right) \right)  - \boldsymbol{y} \rVert^2_W  \\  &= \lVert  \underbrace{\operatorname{diag} \left[ h' \left( X \boldsymbol{\beta}_\ell \right) \right]X}_{X'} \boldsymbol{\beta} - \underbrace{\left( \boldsymbol{y} + \operatorname{diag} \left[ h' \left( X \boldsymbol{\beta}_\ell \right) \right]X \boldsymbol{\beta}_\ell - h\left( X \boldsymbol{\beta}_\ell \right) \right)}_{\boldsymbol{y}'} \rVert^2_W \end{align*}

Notice that this is simply the Ridge problem all over again, with new data \(X'\) and \(\boldsymbol{y}'\). We already know how to solve this, we can construct an iterative algorithm:

\begin{equation} \boldsymbol{\beta}_{\ell+1} = \operatorname{argmin}_\boldsymbol{\beta}  \, \lVert X' \boldsymbol{\beta} - \boldsymbol{y}' \rVert^2_W + \alpha \lVert \boldsymbol{\beta}  \rVert^2. \end{equation}

Solution approach 2: Taylor expansion of the loss function

Our second approach will be to use Newton’s method

\begin{equation} \boldsymbol{\beta}_{\ell+1} = \boldsymbol{\beta}_{\ell} - \left[ \nabla^2 \mathcal{L}(\boldsymbol{\beta}_{\ell}) \right]^{-1} \nabla \mathcal{L}(\boldsymbol{\beta}_{\ell}), \end{equation}

which approximates \(\mathcal{L}(\boldsymbol{\beta})\) using a second order Taylor expansion and iteratively solves the quadratic approximation for the minimum \(\boldsymbol{\beta}\).

In order to obtain an expression for the Hessian and gradient, it helps to write out the loss as a sum

\begin{align*} \mathcal{L}(\boldsymbol{\beta}) &= \left( h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y} \right)^T W \left( h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y} \right) + \alpha \boldsymbol{\beta}^T \boldsymbol{\beta} \\ &= \sum_i^m w_i \left( h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) -y_i \right)^2 + \alpha \sum_j^n \beta_j^2, \end{align*}

where \(\boldsymbol{x}_i\) is the \(i\)th row vector in the matrix \(X \in \mathbb{R}^{m \times n}\).

The gradient

Differentiating with respect to \(\boldsymbol{\beta}\), we obtain the gradient as

\begin{align*} \nabla \mathcal{L}(\boldsymbol{\beta}) &= 2 \sum_i^m w_i \left( h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) -y_i \right) \nabla h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) + 2 \alpha \boldsymbol{\beta} \\ &= 2 \sum_i^m w_i \left( h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) -y_i \right) h'\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) \boldsymbol{x}_i^T + 2 \alpha \boldsymbol{\beta}. \end{align*}

In the more useful matrix notation, this becomes

\begin{equation} \nabla \mathcal{L}(\boldsymbol{\beta}) = 2 X^T W  \left[\left( h\left( X \boldsymbol{\beta} \right)  - \boldsymbol{y} \right) \odot h'\left( X \boldsymbol{\beta} \right) \right] + 2 \alpha \boldsymbol{\beta}, \end{equation}

where we also could’ve used the notation \(\boldsymbol{\eta} = X \boldsymbol{\beta}\). The product \(\boldsymbol{a} \odot \boldsymbol{b}\) is the elementwise product of two vectors \(\boldsymbol{a}\) and \(\boldsymbol{b}\).

The Hessian

Differentiating once more, we obtain the Hessian as

\begin{align*} \nabla^2 \mathcal{L}(\boldsymbol{\beta}) &= 2 \sum_i^m w_i \boldsymbol{x}_i^T \left( h' \left( \boldsymbol{x}_i \boldsymbol{\beta} \right)\right)^2 \boldsymbol{x}_i + w_i \left( h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) -y_i \right) \boldsymbol{x}_i^T h'' \left( \boldsymbol{x}_i \boldsymbol{\beta} \right) \boldsymbol{x}_i + 2\alpha I. \end{align*}

In matrix notation, this becomes

\begin{align*} \nabla^2 \mathcal{L}(\boldsymbol{\beta}) &= 2 X^T W \operatorname{diag} \left[ h'\left( X \boldsymbol{\beta} \right)^2 + \left( h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y} \right) \odot h''\left( X \boldsymbol{\beta} \right) \right] X + 2 \alpha I \end{align*}

Now that we have expressions for the gradient \(\nabla \mathcal{L}(\boldsymbol{\beta})\) and the Hessian \(\nabla^2 \mathcal{L}(\boldsymbol{\beta})\), we are in a position to use Newtons method.

At this point you might be wondering if the math checks out. There’s only one way to find out—let’s code it up and test it on a problem!

Experimental results

Assume that we want to avoid negative predictions. We use the softmax function \(\log \left( 1 + \exp \left( \eta \right) \right)\) to enforce positive predictions. The softmax is a smooth approximation to \(\max(0, \eta)\).

The inverse link function and its derivatives become:

\begin{align*} h(\boldsymbol{\eta}) &= \log \left( 1 + \exp \left( \boldsymbol{\eta} \right) \right) \\ h'(\boldsymbol{\eta}) &= \frac{1}{1 + \exp \left(-\boldsymbol{\eta} \right)} \\ h''(\boldsymbol{\eta}) &= h'(\boldsymbol{\eta}) \left(1 - h'(\boldsymbol{\eta}) \right) \end{align*}

The code is included below. The optimization routine produces exactly the same result as scipy.optimize.minimize on this problem, but is 2-3 orders of magnitude faster since we use information about the gradient and Hessian.

from sklearn.model_selection import cross_val_score

class LinkedRidge(LinkedRidgeBase):
    def h(self, x):
        """The softplus function log(1 + exp(x)), numerically stable."""
        return np.log1p(np.exp(-np.abs(x))) + np.maximum(x, 0)

    def h_prime(self, x):
        """The derivative of softplus, the logistic 1/(1 + exp(-x))."""
        return sp.special.expit(x)

    def h_prime_prime(self, x):
        """The derivative of the logistic function."""
        return self.h_prime(x) * (1 - self.h_prime(x))

# Create a data set
rng = np.random.default_rng(42)
num_samples = 1000
num_features = 25
X = rng.normal(size=(num_samples, num_features))
beta_true = np.arange(num_features) + 1

# Create positive targets values and weights
y = LinkedRidge().h(X @ beta_true + rng.normal(size=num_samples))
sample_weight = np.exp(rng.normal(size=num_samples))

# Solve using model directly
model = LinkedRidge(alpha=1, verbose=False)
model.fit(X, y, sample_weight=sample_weight)
r2_scores = cross_val_score(model, X, y, 
                            fit_params={"sample_weight": sample_weight})
assert r2_scores.mean() > 0.99

Here are results from running the model on a problem with \(1000\) observations and a varying number of features. In order to check the results, I compared with scipy.optimize.minimize without providing any information about the gradient nor the Hessian. The iterations are not directly comparable, so for scipy I only include the final answer instead of progress per iteration.

The newton method uses expansion of the full loss function, the ils method (Iterated Least Squares) expands the inverse link function, and auto cycles between the two approaches (one iteration is actually two iterations—one of each). I also implemented step-halving, see the code below.

Python code

Here’s a simple Python implementation of the optimization routines outlined above. We subclass the Ridge model from scikit-learn so our implementation inherits the scoring method and becomes compatible with other functions in the scikit-learn ecosystem.

Scaling the input features and the target is a good idea. For instance, down-scale a percentage in the range \([0, 100]\) to \([0, 1]\) instead of scaling up the sigmoid function.

Please note that this is not a production grade-implementation. If you need that, check out my Python package generalized-additive-models.

import functools  # Python version: 3.11.6
import numpy as np  # Version: 1.26.1
import scipy as sp  # Version: 1.11.3
from sklearn.linear_model import Ridge  # Version: 1.3.2

class LinkedRidgeBase(Ridge):
    """The user must inherit this class and override
    the functions h(eta), h'(eta) and h''(eta)."""

    def h(self, x):
        raise NotImplementedError

    def h_prime(self, x):
        raise NotImplementedError

    def h_prime_prime(self, x):
        raise NotImplementedError

    def __init__(self, alpha=1, tol=0.0001, solver="auto", max_iter=99, verbose=False):
        assert solver in ("newton", "ils", "auto", "scipy")
        self.alpha = alpha  # Regularization strength
        self.tol = tol  # Convergence tolerance
        self.verbose = verbose  # Whether or not to print
        self.solver = solver
        self.max_iter = max_iter  # Maximum number of iterations

    def fit(self, X, y, sample_weight=None):
        # We store alpha as a vector, but it should correspond to sklearn alpha
        self.alpha_ = np.ones(X.shape[1]) * self.alpha**0.5
        w = np.ones(X.shape[0]) if sample_weight is None else sample_weight
        beta_init = self._initial_guess(X, w=w, y=y)
        if self.verbose:
            print(f"Iteration 0. Obj: {self.objective(beta_init, X, w, y):.2f}")
        self.coef_ = self.solve(beta_init, X, w, y, self.solver)
        return self

    def _initial_guess(self, X, w, y):
        # Compute Taylor expansion of h^-1(mu) = eta = X @ beta, then solve
        # |h(X @ beta) - y| ~ |X @ beta - h^-1(y)| for initial beta.
        y0 = np.mean(y)
        y_trans = y0 + (y - self.h(y0)) / (self.h_prime(y0))
        return self.solve_lstsq(X, w=w, y=y_trans)  # Solve transformed Ridge

    def solve_lstsq(self, X, w, y):
        """Solve |X @ beta - y|^2 + |diag(alpha) * beta|^2. for beta."""
        X_T_W = (X * w[:, None]).T
        lhs = X_T_W @ X  # Form X @ W @ X.T
        np.fill_diagonal(lhs, lhs.diagonal() + self.alpha_**2)  # Add alpha inplace
        return sp.linalg.solve(lhs, X_T_W @ y, assume_a="pos")

    def predict(self, X):
        """Predict on a new data set using h(X @ beta)."""
        return self.h(X @ self.coef_)

    def objective(self, beta, X, w, y):
        """Objective function, evaluated at beta."""
        eta = X @ beta
        diff = self.h(eta) - y
        penalty = np.dot(self.alpha_ * beta, self.alpha_ * beta)
        return np.dot(diff, w * diff) + penalty

    def gradient(self, beta, X, w, y):
        """Gradient of the objective function, evaluated at beta."""
        eta = X @ beta
        iter_weights = w * ((self.h(eta) - y) * self.h_prime(eta))
        return X.T @ iter_weights + self.alpha_**2 * beta

    def hessian(self, beta, X, w, y):
        """Hessian of the objective function, evaluated at beta."""
        eta = X @ beta
        iter_weights = w * self.h_prime(eta) ** 2
        iter_weights += w * (self.h(eta) - y) * self.h_prime_prime(eta)
        hess = (X * iter_weights[:, None]).T @ X
        np.fill_diagonal(hess, hess.diagonal() + self.alpha_**2)  # Add alpha inplace
        return hess

    def halving_search(self, beta0, beta1, X, w, y):
        """Perform step halving search between beta0 and beta1."""
        # Starting objective value
        objective0 = self.objective(beta0, X, w, y)
        for step_size in map(lambda i: 0.5**i, range(20)):  # 1, 1/2, 1/4, ...
            beta = (1 - step_size) * beta0 + step_size * beta1
            objective = self.objective(beta, X, w, y)
            if objective < objective0:  # Found better solution
                return beta
        return beta0  # No better solution found

    def solve_scipy(self, beta, X, w, y):
        """Solve with scipy.optimize.minimize."""

        class Callback:
            iteration = 1
            prev_beta = np.copy(beta)

            def __call__(cb, intermediate_result):
                step = intermediate_result.x - cb.prev_beta
                msg = f"Iteration {cb.iteration}. "
                msg += f"Log10 delta beta: {np.log10(sp.linalg.norm(step)):.2f}. "
                msg += f"Obj: {self.objective(intermediate_result.x, X, w, y):.2f}"
                cb.iteration += 1
                cb.prev_beta = intermediate_result.x

        result = sp.optimize.minimize(
            fun=functools.partial(self.objective, X=X, w=w, y=y),
            jac=functools.partial(self.gradient, X=X, w=w, y=y),
            callback=Callback() if self.verbose else None,
        return result.x

    def solve(self, beta, X, w, y, method):
        """Driver method for calling solution algorithms."""
        if method == "scipy":
            return self.solve_scipy(beta, X, w, y)

        def auto(beta, X, w, y):
            """One step of each :)"""
            beta1 = self.iterative_least_squares(beta, X, w, y)
            return self.newton(beta1, X, w, y)

        optimizers = {
            "newton": self.newton,
            "ils": self.iterative_least_squares,
            "auto": auto,

        for iteration in range(1, self.max_iter + 1):
            beta_new = optimizers[method](beta, X, w, y)
            step = beta_new - beta
            beta = beta_new
            if self.verbose:
                msg = f"Iteration {iteration}. "
                msg += f"Log10 delta beta: {np.log10(sp.linalg.norm(step)):.2f}. "
                msg += f"Obj: {self.objective(beta, X, w, y):.2f}"

            if self._should_stop(beta, step):  # Tolerance criterion
                return beta
        return beta

    def newton(self, beta, X, w, y):
        """Use Newton's algorithm to minimize the objective."""
        g = self.gradient(beta, X, w, y)
        H = self.hessian(beta, X, w, y)

        # Solve H @ beta = g, assuming H = U @ s @ U.T is positive definite
        s, U = sp.linalg.eig(H)
        U = np.real(U)
        s = np.maximum(1e-2, np.real(s))
        step = np.linalg.multi_dot([U / s, U.T, g])
        # step = sp.linalg.solve(H, g, assume_a="pos")
        return self.halving_search(beta, beta - step, X, w, y)

    def iterative_least_squares(self, beta, X, w, y):
        """User iterative least squares to minimize the objective."""
        eta = X @ beta
        diag_h_prime_eta_X = (X.T * self.h_prime(eta)).T  # diag(h'(eta)) @ X
        z = y - self.h(eta) + diag_h_prime_eta_X @ beta  # Pesudodata
        # Solve |(diag(h'(eta)) @ X) @ beta - z|^2 + |alpha|^2
        beta1 = self.solve_lstsq(diag_h_prime_eta_X, w, z)
        return self.halving_search(beta, beta1, X, w, y)

    def _should_stop(self, beta, step):
        """Stopping criterion."""
        return np.max(np.abs(step)) < self.tol * np.max(np.abs(beta))

A generalization to arbitrary loss functions

Instead of squaring the residuals \(\epsilon_i = h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) - y_i\) we can use a general function \(g\). The function \(g: \mathbb{R} \to \mathbb{R}\) could be \(\epsilon \mapsto \epsilon^2\), \(\epsilon \mapsto \lvert \epsilon \rvert\), \(\epsilon \mapsto \log(1 + \lvert \epsilon \rvert)\), the Huber loss function, or something else. Beware that not all of these functions are easy to optimize, even with gradient and Hessian information.

Loss function. Substituting the squared loss for an arbitrary pointwise function \(g\), the loss becomes

\begin{align*} \mathcal{L}(\boldsymbol{\beta}) &= W g \left( h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y} \right) + \alpha \boldsymbol{\beta}^T \boldsymbol{\beta} \\ &= \sum_i^m w_i g \left( h\left( \boldsymbol{x}_i \boldsymbol{\beta} \right) -y_i \right) + \alpha \sum_j^n \beta_j^2. \end{align*}

Gradient. Taking the gradient with respect to \(\boldsymbol{\beta}\) we obtain

\begin{align*} \nabla \mathcal{L}(\boldsymbol{\beta}) &= X^T W \left[ g'\left( \boldsymbol{\epsilon} \right) \odot h'( \boldsymbol{\eta} ) \right] + 2 \alpha \boldsymbol{\beta} \\ &= \sum_i^m w_i g'\left( \epsilon_i \right) h'(\eta_i) \boldsymbol{x}_i^T + 2 \alpha \boldsymbol{\beta}, \end{align*}

where \(\boldsymbol{\eta} = X \boldsymbol{\beta}\) and \(\boldsymbol{\epsilon} = h\left( X \boldsymbol{\beta} \right) - \boldsymbol{y}\).

Hessian. And finally, the Hessian becomes

\begin{align*} \nabla^2 \mathcal{L}(\boldsymbol{\beta}) &= X^T W \operatorname{diag} \left[ g''\left( \boldsymbol{\epsilon} \right) \odot h'\left(\boldsymbol{\eta} \right)^2 + g'\left( \boldsymbol{\epsilon} \right) \odot h''\left(\boldsymbol{\eta} \right) \right] X + 2 \alpha I \\ &= \sum_i^m w_i \boldsymbol{x}_i^T \left[ g''\left( \epsilon_i \right) h'(\eta_i)^2 + g'\left( \epsilon_i \right) h''(\eta_i) \right] \boldsymbol{x}_i + 2 \alpha I. \end{align*}