The average value

Presentation slides and video of a talk I gave about notions of averages (one-parameter models), loss functions and simple machine learning problems.

Below is a transcript with slides, if you prefer reading over listening.


The average value

Hello, my name is Tommy and this is a recording of a talk I gave at an IT conference. The title is “The average value” and the intended audience is intermediate-level data scientists. In this recording, I’ll go through the presentation informally. I’ll assume you have some mathematical background and are interested in data science. If you want to look at these slides or check the references, you can download them from this URL.

The average value—this is the equation for it, and if that was all there was to say, the talk would end here. This slide introduces some notation, but obviously to make a presentation out of this, we need to generalize away from this equation.

We’ll do that by looking at the optimization problem that this equation—the arithmetic average—solves. That’s the least squares optimization problem. In general, in machine learning, when we do regression tasks, it’s very common that we have squared loss and some kind of function \(f\) where we want to compute the optimal parameters for \(f\). Here, \(f\) could be a tree-based model, linear regression, or neural network, and somehow we find the best parameters \(\theta\).

However, there is typically no closed-form solution. For the arithmetic average there is a closed-form equation, but for many more complicated models, you use something like gradient descent or Newton’s method. You end up with a result, but there’s no equation you can use to just compute the results—it’s an iterative procedure. In many problems, there’s no guarantee that you have a global minimum or that you always end up with the same one.

Basically, this talk isn’t really about the average value—it’s about setting the model \(f\) to be a single parameter model. We set \(f(x,\theta) = \theta\) and ask ourselves “what else can we do?”. If we fix the model to be the simplest possible model we can think of, what else can we do? We can tweak the loss function!

Table of contents

Here’s the table of contents for the presentation:

  1. First, we’ll look at the Pythagorean averages—the arithmetic average, geometric average, and harmonic average—and see that there are loss functions that induce these averages in a very natural way
  2. Then we’ll look at the \(p\)-norm, which you might know about—it’s a way to measure distance, and we can use that to get the notion of the median and the midpoint
  3. We’ll examine Huber, Mellox, and quantiles, which are one-parameter models that go further away from the Pythagorean averages and \(p\)-norm
  4. Finally, we’ll look at some real-world examples aimed at developers: (1) ETA for file downloads and (2) Ranking products by ratings on websites

The Pythagorean averages

Let’s start with the Pythagorean averages.

The arithmetic average is the most common average, and it minimizes squared loss. There’s an assumption that if you’re twice as far off, it’s four times worse because you’re squaring the residuals. It has really nice mathematical properties—basically, if you have squared loss, you have a quadratic, so when you differentiate it, you get something linear that you can solve immediately. Newton’s method can solve this in one iteration—you can one-shot these squared problems!

This figure is important to understand—it shows for each data point (1, 3, and 6) the loss, and then shows the sum of these three losses over the three data points. That’s the solid blue line, and this is the loss function. If you minimize it, you get the arithmetic average, shown in the dashed line.

One final comment in the footnote: how do we get this squared loss in the first place? One way is to look at the normal distribution, which is essentially an exponential to the power of squares. But you also end up with averages in Poisson models and binomial models, so there are many types of assumptions about data that lead to arithmetic averages.

The geometric average minimizes multiplicative loss. If you take the logarithm of the data and the logarithm of the model and compare those, you get the multiplicative loss. Here, being one order of magnitude below is equally bad as being one order of magnitude above. The geometric average is like the arithmetic average but with multiplication—it’s the number that you have to multiply with itself \(n\) times to get the product of the data.

The harmonic average minimizes relative loss, where you measure errors relative to the true value. This is equivalent to doing weighted least squares where the weights are given by \(1/y_i\). This particular loss—the relative loss—is not the only loss function that induces this notion of average. The loss in the footnotes will also lead to the same harmonic average, but the relative loss one is more natural and commonly used.

Let’s look at these three loss functions in a different problem setting. Suppose you have resources to distribute—maybe you have bakeries and predictions for tomorrow’s bread sales. You have a central bakery where you bake bread and want to distribute it to cover every store.

Unfortunately, you don’t have enough bread to cover every store, so you need to distribute according to some rule. You would like your distribution \(\theta\) to be as close as possible to the demand of each store \(y\), so you use a loss function. Depending on which loss function you use, you get three different distributions.

I’m not sure what the correct answer would be—it would probably depend on the bakeries’ goals. Would you use squared loss, multiplicative loss, or relative loss in the real world? I’ll leave that up to you to think about.

We’ve seen several loss functions, and it might be interesting to discuss how you would choose between them. Every loss function has the property that if your prediction is very close to the target, it’s good, and if it’s far off, it’s bad.

How do you choose between loss functions? One thing I like to do is consider the null space—by that, I mean considering which types of errors are equally bad and whether that makes sense in your application. For instance, if the true values \(y\) are 2 and 8:

  • Squared errors would imply that going two up in the positive direction on either prediction is equally bad
  • Multiplicative errors mean that predicting 4 (8/2) or 16 (8*2) is equally bad
  • Relative errors mean that subtracting 2 from the first prediction or adding 8 to the second one is equally bad, because those represent equal units of true values

Here are some practical examples:

  • If you’re trying to predict medicine dosages, you probably want to use relative or multiplicative errors, since different drugs work on different dosage scales
  • If you’re predicting economic outcomes or investments, you probably don’t want to use multiplicative or relative errors because money isn’t relative—one unit of money is the same whether it comes from a small project or a huge project
  • For financial predictions, you might want to use absolute errors rather than squared errors, because being twice as far off means losing twice as much, not four times as much (unless you want to consider the utility of money, but let’s skip that for now)
  • When people look at interest rates or summarize computer performance benchmarks (CPUs, etc.), they almost always use the geometric average

The \(p\)-norm

Having finished the first part about Pythagorean averages, let’s talk about the \(p\)-norm.

We’re going to generalize this equation in a different direction by solving this optimization problem for different values of \(p\) instead of \(p=2\) (which gives us the arithmetic average).

If we set \(p\) equal to one, we get the absolute error. This is an absolute value function placed on each data point, and when you sum it up, it’s piecewise linear. If there’s an odd number of data points, there’s a unique maximum; if there’s an even number of data points, it’s flat between the two middle points—there’s no unique minimum.

The median has an interesting property: if I were to change the data point 6 and move it up to 8 or down to 4, as long as it doesn’t cross 3, the median stays the same. The median as a function completely ignores the location of every data point except for the middle one. Obviously, if 6 crosses 3, then something happens with the function, but it’s essentially blind to many of the data points.

Choosing \(p=1.5\) is one of many ways to create a function between a median and an average. It’s also the first optimization problem we’ve seen where there is no closed-form solution—here you would use an iterative optimization routine to find the minimum.

Choosing \(p=2\) is the squared error, which we’ve seen before.

As you increase \(p\) further, the functions become more and more sharp.

You get to an interesting property in the limit as \(p\) becomes very large: minimizing the \(p\)-norm becomes equivalent to minimizing the maximum. If you want to minimize the maximum distance, you want to make the worst case as good as possible, and you do that by placing yourself at the midpoint. This is one of the measures of centrality you might have learned in high school—the midpoint.

How can we combine these Pythagorean losses and \(p\)-norms? Let’s put them in a table:

  • Mean absolute error
  • Mean squared error
  • Maximum error (not used much in machine learning but used in engineering where you want to optimize the worst case, like in airplane design)
  • Mean squared logarithmic error
  • Mean absolute percentage error
  • Weighted mean squared error

Some combinations aren’t listed but could be imagined. For instance, with \(p=1\) and multiplicative error, you’d get something like the mean absolute logarithmic error. The point is that you can construct and combine these functions.

How would you do that? You can first look at an individual data point and consider what kind of loss function you’re interested in, then use a \(p\)-norm to combine them. The \(p\)-norm is a way to combine the individual losses.

Think about what properties you want in your application:

  • With absolute error, residuals of \((4,0)\) are as bad as \((3,1)\) and \((2,2)\). With money, this makes sense—if you’re off by four, you probably don’t care how it’s distributed if it’s money lost
  • It also means that \((9,0)\) is better than \((5,5)\), even though errors in \((5,5)\) are more balanced
  • Maximum error just looks at the largest value, so \((9,0)\) would be as bad as \((9,5)\) and \((9,9)\), while \((8,8)\) would be better than \((9,0)\)

Huber, mellowmax and quantiles

Now let’s look at even more esoteric notions of averages, maxima, minima, and everything in between.

One issue with squared errors is that they can be overly sensitive to outliers. One remedy for this is to use the smooth Huber loss, which acts like a square when you’re close to zero but like the absolute value when you zoom out. This makes it less sensitive to outliers while still maintaining a unique minimum even with an even number of data points.

There’s an opposite problem that might occur in some applications: squared errors are very flat around zero, so if you have a small residual and you square it, it becomes even smaller. If you’re regularizing coefficients (like the LASSO does) and want to drive them all the way to zero, it’s very nice to have something that acts like the absolute value close to zero. The reverse Huber has this effect—it’s like the absolute value when you’re close to zero and acts like a square if you’re far away.

Mellowmax and mellowmin are smooth generalizations of maximum and minimum. They are functions that:

  • As \(\alpha\) goes to infinity, the function approaches the maximum
  • As \(\alpha\) goes to negative infinity, the function approaches the minimum
  • Around zero, the function approaches the arithmetic average

If you want something that acts in between the arithmetic average and the maximum, mellowmax is a nice way to do it. The maximum only depends on the maximum value in the dataset, but sometimes when you want to minimize the maximum, you want to bring all values down. One way to do that is to drive down the maximum value, but as a function, the maximum is blind to every other value apart from the largest one. If you use mellowmax instead, it’s actually a bit sensitive to other values, which can better inform the optimization routine.

Sometimes it’s very nice to relax these functions that are piecewise linear or have derivatives that are zero for most inputs. The median only depends on one or two values for most inputs, and if you want something more smooth, this is a nice option.

Let’s quickly cover quantiles—I assume you know about them. They’re a way to be somewhere between the median and the minimum/maximum. They put an absolute value function on every data point but scale it up and down on the left and right.

A small generalization: there’s no reason we have to use the absolute value—we could use squared residuals too. If \(p=2\), you would call it expectile, but in general, you don’t have to use \(p=1\) or \(p=2\). You could in principle use any value for \(p\).

ETA for file downloads

Let’s look at our first practical example: ETA for file downloads.

Suppose you’re a developer writing a routine that will download files, copy files, or install software. The computer is doing chunks of work, and you have the time that each chunk takes. You know the amount of work done and how much work will be done in total (like the total megabytes for all files to download). You want to estimate how long this will take.

To solve this, we need to handle:

  1. Non-equidistant arrivals (times or pieces of work)
  2. We’ll assume a linear model is reasonable (parameters can change—if someone else starts downloading, your speed might drop, but after it drops we assume it’s still linear with a different rate)
  3. Very importantly, when a new data point arrives, we must be able to update our model in constant time—we can’t iterate over all previous data points, as there could be millions

Here’s one approach: we can compute the average rate of work. With two pieces of work (two files downloaded), we can compute the slope for each piece (work done over time elapsed). By minimizing this function, we get \(\theta\), which is the average amount of work done per time.

I weight this by \(\Delta t\) (time) because when I solve that optimization problem, it leads to an equation with nice symmetry in \(\theta\). It should make intuitive sense that downloading a file in 1 second contains some information, but downloading the next file in 30 seconds contains much more information—so large \(\Delta t\) should be weighted more heavily.

We use \(w_i\) to weigh the most recent data more heavily. We can split it up like this: the most recent data point is attached at the end, so we have the previous estimate and some new data. We can quickly update by keeping track of our previous computation and adding to it.

We then take a linear model constrained to go through the last data point seen. We have one degree of freedom left, which we use on the slope with this estimate. We set the weights proportional to some number to the power of \(i\) over \(\gamma\), where \(\gamma\) is the half-life. This parameterized the weights so the most recent data is weighted more heavily.

Here’s how it performs:

  • If the speed changes when we see more data, obviously it’s not a very good model
  • If it’s linear all the way through, it’s quite accurate
  • With some deviation from linearity, it picks up an average rate somewhere in between
  • With noise, it goes through the last data point and picks up an average rate that’s quite close

As more data arrives, it will actually improve because it’ll pick up on new regimes with different rates of change. This is a simple algorithm—very fast and accurate in many real-world cases. It depends on the application, but it’s probably more advanced than many file ETA estimates I’ve seen, which tend to be jittery and jump around too much.

Ranking products by ratings

Our final example is ranking products by ratings.

This is a fascinating real-world example of how you can use averages effectively. I looked at Norwegian e-commerce websites where users can give ratings. All these websites use a naive average rating, which produces some strange results:

  • The average of a single 5 is 5, but the average of many 5s and a few 4s would be slightly less than 5
  • This means a product with one 5-star rating would rank higher than a product with hundreds of 5-star ratings and a few 4-star ratings
  • I would trust the product with more ratings to be better, as one rating doesn’t give much certainty about true quality
  • Many websites also place products with no ratings at the very bottom, which I think is wrong—a new product with no ratings is likely average, not the worst product

I didn’t find any Norwegian e-commerce websites that do this correctly (I didn’t look at international ones). One of the few sites that gets this right is the IMDb Top 250 list—they account for the number of ratings and use an equation similar to what we’re going to deduce.

We want our average rating to have two properties:

  1. It should take into account the number of ratings
  2. A product without any ratings is likely typical, not really bad

Here’s a crucial idea: what are we really doing when we’re trying to rank products? We’re going to frame this as a prediction problem—we’re predicting the next rating. If you were to buy this product and rate it, what would your rating be? That would be the next rating for this product.

So we’re casting it as a prediction problem and saying that ranking products is equivalent to predicting the next rating and sorting by that prediction. We’ll do this by:

  1. Introducing a prior average \(\bar{y}\) and a pseudo count \(\alpha\)
  2. Solving this optimization problem
  3. The solution shows the meaning of the pseudo count—it’s added to \(n\) as dummy ratings, and the value of these dummy ratings is \(\bar{y}\) (the prior average rating)

To determine the hyperparameters \(\bar{y}\) and \(\alpha\), we need to do cross-validation. I assume you know about leave-one-out and k-fold cross-validation. We’re going to do “leave-next-out” cross-validation.

It’s very important that your cross-validation procedure be as similar as possible to how you want to use the model. Since we defined our notion of ranking as predicting the next value, we’ll simulate that exactly:

  1. Start with no ratings, try to predict the first one
  2. Get to know the first rating, look at the squared residual
  3. Use that first data point to predict the second one
  4. Get to know the second data point and compare
  5. Continue this process

This is essentially a time series split where you learn one more data point at a time. Both of these can be computed in linear time for these simple models, so we can do this very quickly.

Let’s apply it to a real dataset. There’s a Norwegian website called Legelisten that ranks doctors.

I found 17 doctors with 212 total ratings. I used our average with some weights, parameterizing those weights by the half-life \(\bar{y}\). Using leave-next-out cross-validation, I found:

  • Optimal number of pseudo-ratings is around \(\alpha=2\)
  • Optimal half-life is around \(\gamma =7\) years
  • Prior rating is \(\bar{y}=3.8\) (on a scale from 1 to 5)

The loss when using the arithmetic average as our predictor was 1.77. With this regularized temporal average, it’s 1.55—a very nice reduction in loss that leads to better predictions of the next rating.

You might ask why not make this into a big machine learning project using all kinds of covariates, but I don’t think people would like that. If you rank doctors using information about their gender and age, it’s probably not good. This solution is something people will understand and won’t feel discriminates unfairly based on age, gender, or other factors. It’s still super simple, just more nuanced and created from thinking through the problem more carefully.

Looking at the results (blue is the old website rating, green is my proposed rating):

  1. The third doctor: One person rated this doctor with a 5. The old ranking puts them first; we say no—one rating isn’t much, so we pull them down. The doctor with many 5s deserves to be ranked higher.

  2. The tenth doctor is interesting because they’re pulled away from the grand average. Typically without temporal weighting, you pull towards the grand average, but this doctor has gotten more positive reviews lately—lots of 5s and 4s, 44 ratings in total. This doctor has improved over the years, so the half-life is forgiving the older low ratings and pulling up towards the more recent 5s.

  3. The seventeenth doctor hasn’t been rated once. The old ranking would place them last; we say no—with no ratings, we believe you’re in the middle of the pack. We place them there, certainly better than someone with just many 1-star ratings.

Summary

Let’s summarize the key points.

  1. Only simple problems have closed-form solutions. Don’t worry so much about how to solve things when you start modeling—look at what properties you want to model instead. Solving is left to an optimization algorithm.
  2. There’s a good reason why people like squared errors—they have nice mathematical properties. Not everything I showed today can be solved efficiently for bigger models.
  3. In machine learning, the solution depends on both the model and the loss function. Many data scientists focus heavily on the model, discussing which is better in detail. While the model is important, choosing the wrong loss function can be just as problematic. For instance, if errors are truly relative in your application but you choose RMSE, or you have an economic application but use the wrong loss function, you’re probably modeling the whole thing incorrectly.
  4. Adapt your approach to the problem. Combine existing ideas but don’t reinvent—there are many good books and sources. People have thought about amazing things, so read up on that. If you find yourself inventing something you believe is totally new to solve a practical problem, you’re probably reinventing something—do a literature search.
  5. Always start simple. Never use models you don’t understand—you’ll have a hard time explaining them to others, and they might behave in ways you don’t fully understand. There’s still plenty of room to have fun with simple models. Even today, some e-commerce sites are billion-dollar businesses and don’t implement averages correctly.

The final comment I want to make is that if you go back to the first slide and remove the restriction of the one-parameter model, everything we talked about still holds. All the loss functions and modeling approaches are still relevant, but by choosing one-parameter models, I was able to create nice plots and talk about things simply. Many of these concepts are still relevant for other models, even though computationally it’s not always easy to optimize.

I really hope you enjoyed my presentation. Thanks a lot for watching, and again, you can download the slides and look at the references if you want to. Thank you.