Ranking doctors

This article discusses two separate issues: the first two sections contain my thoughts of a recent study, while the remaining sections describe and solve an interesting ranking problem.

If you’re not interested in the study, skip ahead to the section The ranking problem.

The study

In Norway there’s a website called Legelisten.no. Legelisten roughly means “List of doctors”, and contains reviews of doctors (also called general practitioners, or GPs). Norwegians can use the website to choose which GPs to visit, based on ratings by others.

In the study Web-Based Public Ratings of General Practitioners in Norway: Validation Study by Bjertnæs et al., a group of researchers compare (1) ratings on Legelisten.no with (2) ratings from a survey of GPs conducted by the Norwegian Institute of Public Health. They found a Spearman correlation of \(0.41\) between how the same GPs were rated in (1) and (2), but their sample only contained \(46\) GPs.

The study was picked up by dagensmedisin.no under the headline “Forskere har undersøkt nytten av legelisten.no”, and then by forskning.no under the headline “Stor sannsynlighet for at Legelisten.no ikke fungerer for å finne en god fastlege”. The translation of the second headline is “High probability that Legelisten.no does not work for finding a good doctor,” which I think is unsubstantiated—but we’ll get to that.

Comments on the study

The authors of the study selected \(50\) GPs from the study by the Norwegian Institute of Public Health. Out of those, only \(46\) were on Legelisten.no. Out of those, only \(24\) had \(10\) or more ratings. The GPs were rated in \(13\) categories (e.g. “trust in advice”, “listening skills”, “opening hours”, etc). Based on this, the authors report, among other things, a table of \(13 \times 13 = 169\) correlations. Here are my immediate thoughts:

  • People who write reviews on Legelisten.no are typically either very happy or unhappy. A randomized survey like the one conducted by the Norwegian Institute of Public Health is likely a better indicator of patient satisfaction.
  • The sample size is low. I’m not sure what value all those correlations give, especially when they are based on so few GPs. Legelisten.no has information about tens of thousands of GPs, and the survey by Norwegian Institute of Public Health contains information about \(2000\) GPs. I wish the researchers had compared more GPs.
  • Legelisten.no is still the best source of information people have. The overall correlation was \(0.41\), but even if the correlation was \(0.01\), any information is better than no information. Until the Norwegian Institute of Public Health publishes their data in a searchable format, Legelisten.no remains the best place to go for information. It’s nice to conduct randomized surveys and write research papers, but it doesn’t help people who are choosing a GP.
  • It’s reported that the correlation was \(0.2\) for GPs with less than \(10\) reviews on Legelisten.no, but it was \(0.6\) for GPs with \(10\) or more reviews. I don’t think a correlation of \(0.6\) is bad at all, especially if it’s the only piece of information you have (though there is a lot of noise in that \(0.6\) number). People who use Legelisten.no will correct for the number of reviews—they know that a GP with ratings \([5, 5, 5, 4, 5, 5]\) is a better bet than \([5]\), even though \(\operatorname{mean}([5]) > \operatorname{mean}([5, 5, 5, 4, 5, 5])\). This is the kind of thing that’s conveniently ignored in the mathematical formulation of a research paper, but does not reflect reality—users would likely read reviews (there’s numbers and text!) and consider the number of reviews.

In summary I think it’s an interesting data set to look at. I wish the authors had used more data when computing correlations. They seem aware of the shortcomings. The headline chosen by the journalist or editor is ridiculous. There is no reason to believe that there is “a high probability that Legelisten.no does not work for finding a good doctor.”

The ranking problem

In the article “Stor sannsynlighet for at Legelisten.no ikke fungerer for å finne en god fastlege”, the CEO of Legelisten.no disagrees with the conclusion of the study. He believes the overall ranking of GPs on the website reflects the underlying quality of the GPs, since their ranking algorithm takes the number of ratings into account.

It’s time to depart from the research study and instead consider an interesting question: how can we design a simple and efficient ranking algorithm for Legelisten.no?

Let’s grab some data. I manually fetched some ratings from the Legelisten.no. Here they are:

data = [[5, 5, 5, 5, 5, 5, 5, 3, 5, 5, 5], #  <- This GP was ranked highest by legelisten
        [5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 5, 1, 5, 5, 4],
        [5, 5, 5, 5, 5, 5],
        [5, 5, 5, 5],
        [5, 5, 5, 1, 1, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 5, 5],
        [1, 5, 5, 1, 5, 5, 5, 5, 5, 5, 1, 5, 1, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
        [2, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 5, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 5, 1, 1, 5, 5, 1, 2, 1, 4, 5, 5, 5, 4, 5],
        [5, 5, 2, 2, 5, 5, 2, 5, 5, 5, 5, 1, 5, 5],
        [5, 5, 5, 5, 1],
        [4, 5, 5, 2, 4, 4, 4],
        [5, 1, 5, 2, 5, 2, 5, 1, 4, 5, 1, 5, 3, 5, 1, 5, 2],
        [5],
        [5, 5, 5, 4, 5, 1, 1, 2, 1, ],
        [1, 1, 5, 1, 5, 4, 1],
        [3, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 5],
        [1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1],
        []]  #  <- This GP was ranked lowest by legelisten

Each row represents a GP. Within each row, the first rating is the most recent rating. The ordering of each row is the ranking given to the GPs by Legelisten.no. My sampling of GPs was semi-random, but not entirely—so don’t read too much into this specific data set.

So, how should we rank these GPs? What would you do?

A regularized ranking algorithm

To think clearly about the problem, it helps to find some properties that our ranking should obey.

We want higher average ratings to lead to a higher ranking, if the number of ratings is identical:

\begin{align*} \operatorname{ranking}([4, 4]) > \operatorname{ranking}([3, 3]) \end{align*}

We want more ratings to lead to a higher ranking, if the ratings are already high (the converse should hold for bad ratings):

\begin{align*} \operatorname{ranking}([5, 5, 5]) > \operatorname{ranking}([5]) \end{align*}

Also, I would probably argue that:

\begin{align*} \operatorname{ranking}([]) > \operatorname{ranking}([1, 1, 1, 1]) \end{align*}

Complications arise when the cases above are in conflict: how should we deal with this one?

\begin{align*} \operatorname{ranking}([5]) \quad \operatorname{vs.} \quad \operatorname{ranking}([5, 5, 4]) \end{align*}

To ensure \(\operatorname{ranking}([5, 5]) > \operatorname{ranking}([5])\), I propose using a psuedocount to regularize the estimate of the mean value towards the grand mean, and define

\begin{align*} \operatorname{ranking}(\mathbf{x}) = \frac{\alpha \bar{X} + \sum_i^n x_i}{\alpha + n}, \end{align*}

where \(\bar{X}\) is a “typical” rating, e.g. the grand mean of all ratings. We compute \(\bar{X}\) as the mean of all ratings. For our data set above, we have \(\bar{X} \approx 3.8\).

A question remains: how do we choose the regularization strength \(\alpha\)?

  • If \(\alpha = 0\), then our ranking is simply the mean value per GP.
  • As \(\alpha \to \infty\), all rankings are regularized heavily towards the grand mean.

The loss function

To determine the regularization strength \(\alpha\), we need a loss function. Here’s the idea: what we’re really trying to do is estimate what rating we would give the GP, if we went to visit him/her. In other words, the ranking score is the expected next rating a GP will recieve.

Consider a GP with ratings [4, 2, 3, 1]. Recall that the last rating in the list is the oldest rating.

To create a loss function, we consider every historical visit in turn:

  • First we compare our estimate \(\operatorname{ranking}([])\) with \(1\).
  • Then we compare \(\operatorname{ranking}([1])\) with \(3\).
  • Then we compare \(\operatorname{ranking}([3, 1])\) with \(2\).
  • Then we compare \(\operatorname{ranking}([2, 3, 1])\) with \(4\).

Each comparison gives a residual, which we square and sum to compute the RMSE. We do this per GP and per historical visit. If the number of GPs and ratings is large and we want speed, we could use random sampling.

Here is an implementation of the estimator \(\operatorname{ranking}(\cdot)\):

def estimator(ratings, alpha, grand_mean):
    return (sum(ratings) + alpha * grand_mean) / (len(ratings) + alpha)

And here is the loss evaluated over every possible combination of GP and visit:

def loss(alpha, data):
    
    # Compute the grand mean to regularize toward
    grand_mean = np.mean([rating for row in data for rating in row])
    
    residuals = []
    
    # For each GP in the dataset
    for ratings in data:
        
        # For each historical visit
        for i, target in enumerate(ratings):
        
            # Predict the rating for the visiting using historical data
            prediction = estimator(ratings[i+1:], alpha, grand_mean)
            residuals.append(target - prediction)

    # Use the RMSE metric as loss
    return np.sqrt(np.mean(np.array(residuals)**2))

Determining the regularization strength \(\alpha\) is now a matter of optimization. We choose \(\alpha\) that best helps us estimate the next rating.

We can plot regularization strength \(\alpha\) vs loss: It’s a nice and smooth function, and the minimum is attained at \(\alpha^{\star} = 1.94\).

The figure below compares: (1) the ranking from Legelisten.no (as shown on the horizontal axis), (2) ranking by the mean (blue dots) and (3) ranking by the regularized mean using \(\operatorname{ranking}(\cdot)\) with the optimal \(\alpha\). The black dots are the individual ratings for each GP.

Some comments on the figure above:

  • The ranking from Legelisten.no mostly agrees with our ranking.
  • A big difference is the last GP (number \(17\) in the figure). He has no ratings. With no information, our ranking places him in the middle of the pack. Legelisten.no places him last.
  • The reguarlized ranking does not believe that GP number \(12\), with rating \([5]\), deserves a ranking of \(5\). The ranking of this GP is \(4.2\), which is regularized heavily towards the mean since the GP only has one rating. However, our proposed ranking would move him up.

A temporal regularized ranking algorithm

We can extend the ranking by adding some forgetfulness. Is it really fair to weigh the least recent rating as heavily as the most recent rating? I think not—the most recent rating is probably a better indicator of the next rating.

Let’s experiment with

\begin{align*} \operatorname{ranking}(\mathbf{x}) = \frac{\alpha \bar{X} + \sum_{i=1}^n w_i x_i}{\alpha + \sum_{i=1}^n w_i}, \end{align*}

where we set \(w_i = \exp(-\ln(2) (i - 1) / \gamma)\), then we normalize the weights so that they sum to \(n\). The parameter \(\gamma\) becomes a half-life, or memory parameter. For instance, if \(\gamma = 2\), then the weights associated with ratings \(\mathbf{x}\) of length \(6\) would be \([1, 0.707, 0.5, 0.354, 0.25, 0.177]\) before normalization and \([2.008, 1.42, 1.004, 0.711, 0.502, 0.355]\) after normalization.

  • As \(\gamma \to 0\), the ranking depends only on the most recent rating.
  • As \(\gamma \to \infty\), the ranking depends equally on all previous ratings (reducing this ranking score to the previous one without the temporal element).

We also optimize for \(\bar{X}\) instead of pre-computing it, allowing it to for instance be a bit lower than the actual grand mean if the GPs tend to start out with lower ratings. The optimal parameters are \(\alpha^{\star} = 1.95\), \(\gamma^{\star} = 7.44\) and \(\bar{X}^{\star} = 3.8\). The loss function is smooth with respect to the parameters. In the figure below, two parameters are fixed at the optimal value in turn, while the final one is allowed to vary and is plotted as a function:

This ranking looks pretty similar to the previous one, but GPs \(6\) and \(7\) have switched places. This is because GP \(6\) has gotten a few bad ratings recently, while GP \(7\) got a few bad ratings early on. The previous ranking that only regularized towards the mean ignored this, but the temporal rating picks it up.

Summary

This is an interesting problem because there is no clear answer. It helps to appeal to some well defined loss function instead of arbitrary heuristics, and it helps to study simple cases.

Here are the loss scores for each of the ranking algorithms:

Algorithm                   Loss  
Mean                         1.601
Regularized mean             1.552
Temporal regularized mean   1.548

We could argue for the final ranking algorithm based on loss alone. But it also has two pleasing properties: (1) it treats new GPs with no ratings better (they don’t deserve to be last!) and (2) it’s “forgiving” in its use of the memory parameter \(\gamma\). Giving new GPs a fair chance and forgiving past errors both seem like nice properties, and they are backed up by the loss.

Many extensions are available: different loss functions, adding more “features” into the ranking scores, and so forth. If I was tasked with solving this problem in real life, I would not complicate it much more than this. The ranking above can easily be implemented in SQL once the optimal parameters are known. The parameters need only be updated annually or so (maybe even more rarely).

Code

Here is a draft of what the code might look like. The data variable is defined above in the article.

Python implementation: CPython
Python version       : 3.9.4
IPython version      : 7.22.0

numpy: 1.23.5
scipy: 1.9.3
import numpy as np
from scipy.optimize import minimize
import functools


def predict(ratings: np.ndarray, alpha: float, gamma: float, grand_mean: float):
    """Predict the expected rating, given previous ratings."""

    # If no previous ratings exist, return the grand mean
    n = len(ratings)
    if n == 0:
        return grand_mean

    # Gamma = half-life parameter, high gamma -> long memory
    weights = 2 ** (-np.arange(n, dtype=float) / gamma)  # Exponential weights
    weights = weights / np.sum(weights) * n  # Normalize

    # Alpha = regularization parameter, high alpha -> pull towards grand mean
    weighted_sum = np.sum(weights * ratings)
    return (weighted_sum + alpha * grand_mean) / (n + alpha)


def loss(args: np.ndarray, data: list):
    """RMSE over all possible predictions in the data set."""
    alpha, gamma, grand_mean = args  # Format for scipy.optimize.minimize

    # For each GP in the data set
    residuals = []
    for ratings in data:

        # For each historical visit
        for i, target in enumerate(ratings):

            # Predict the rating for this visit, using historical ratings
            prediction = predict(ratings[i + 1 :], alpha, gamma, grand_mean)
            residuals.append(target - prediction)

    # Use the RMSE metric as loss
    return np.sqrt(np.mean(np.array(residuals) ** 2))


# Convert data from list of lists to list of ndarrays
data = [np.array(d) for d in data]

# Minimize the loss function to find optimal parameters
result = minimize(
    loss,
    bounds=[(0, np.inf), (0.01, np.inf), (1, 5)],
    x0=[1, 1, 3],
    args=(data,),
)
assert result.success

# Get optimal parameters and use `predict` as the ranking function
alpha_star, gamma_star, grand_mean_star = result.x

ranking = functools.partial(
    predict, alpha=alpha_star, gamma=gamma_star, grand_mean=grand_mean_star
)

# Rank all GPs and print them
for i, ratings in enumerate(sorted(data, key=ranking, reverse=True), 1):
    print(f"GP ranked {i} has ratings {ratings}")