**TLDR:** We show how Mean-Squared-Error regression models can be generalized such that the model outputs a normal distribution instead of a single prediction. Check out this notebook for the code.

Let’s say we’re building a regression model to predict tomorrow’s temperature. The model predicts 21.3 degrees Celsius. But what does this mean? Can we be sure the temperature won’t be below 20 degrees?

## Teaching Regression Models to Output Probability Distributions

In this blog post, we’ll explain how to train a regression model such that instead of outputting a single prediction, it outputs a *probability distribution*.
We choose a normal distribution, so the model will output a *mean* (e.g. 21.3 degrees) **and** a *standard deviation* (e.g. 2.7 degrees).
As a result, we can judge what *range* of possible temperature values the model thinks are likely.

Note that this is something we already have for classification models, where models typically assign a probability to each of the possible classes. With infinitely many possible values in the case of regression models, we have to resort to parametric models, such as the normal distribution. These models can be described using a fixed set of parameters. The advantages of having a probability distribution instead of single prediction include:

- We get a quantification of the certainty of the model, allowing us to better judge which predictions we can trust.
- We can use it to run simulations. For example, let’s say we’re actually interested in predicting energy costs. We could sample from our temperature probability distribution and feed it into an energy cost simulation.

## Example: Temperature Prediction on Historical Data

As an example, we’ll use a dataset of daily minimum temperatures in Melbourne. A first look at the data suggests that there is a strong yearly pattern:

For the purpose of this tutorial, we will use a super simple model architecture: A neural network with no hidden layer, essentially a linear regression. However, in addition to the temperature prediction output unit, we’ll add another one for the standard deviation:

In the following, we will provide a high-level description of how one could train such a model. If you want to see the code, you can follow along in this self-contained Jupyter Notebook.

## Defining a Custom Loss Function

Normally, the temperature prediction unit would be trained using the Mean Squared Error loss function, which ships with Deep Learning libraries such as TensorFlow. However, this wouldn’t provide any training signal to the standard deviation output unit.

Therefore, we derive our own loss function, based on the negative log-likelihood of the training data, where the likelihood is computed from the probability density function of the normal distribution:

Here, *x* is the temperature of the training sample, *μ* is the model’s temperature prediction, and *σ* is the standard deviation.
If *f(x)* is the likelihood, the negative log-likelihood is:

Looking at this equation once again, you might realize its similarity to the Mean Squared Error: If we assumed a constant standard deviation, the term is just a linear transformation of the Mean Squared Error. Hence, the temperature prediction unit is more or less trained just as it would with the regular regression training objective, but in addition, we also train the standard deviation output unit.

## Playing Around with the Model

As a toy example, let’s feed the network with nothing but the one-hot encoding of the current month as the only input feature. All the model really can do is to “learn” the monthly averages and standard deviations.

Here is a plot of 1990’s temperatures, the predictions, and confidence intervals:

The gray area corresponds to the prediction plus/minus one standard deviation, or, equivalently, to the 68% confidence interval. As expected, predictions are constant within one month and most of the actual temperatures fall into the confidence interval.

Note that these predictions are computed on a subset of the training data. Generally, it is best practice to split the data into training and development subsets in order to measure and combat overfitting.

## Going to Complex Features and Models

Now, we don’t need deep learning to compute averages and standard deviations. The beauty of the approach is that you can plug in your fancy Recurrent Neural Network and add tons of features and everything works just the same.

For example, if we append the average and standard deviation of the temperature within the last seven days to the model’s feature vector, we can see that the model already starts to more accurately predict the actual temperature:

## Conclusion

We showed how a relatively simple modification to the standard regression training objective can lead to models which are able to output parameterized probability distributions instead of isolated estimates. The approach is widely applicable and not restricted to a particular neural network architecture.

Once again, have a look at the code if you’re curious about some of the implementation details. If you try out the approach in your project, drop us a mail to let us know how it went :)