Regression is a Machine Learning task to predict continuous values (real numbers), as compared to classification, that is used to predict categorical (discrete) values. To learn more about the basics of regression, you can follow this link.
When you hear the word, ‘Bayesian’, you might think of Naive Bayes. However, Bayesian principles can also be used to perform regression. In this article, we will discuss and implement Bayesian Ridge Regression, which is not the same as regular Ridge Regression. To understand more about regular Ridge Regression, you can follow this link.
First of all, you must understand that Bayesian is just an approach to defining and estimating statistical models. Bayesian Regression can be very useful when we have insufficient data in the dataset or the data is poorly distributed. The output of a Bayesian Regression model is obtained from a probability distribution, as compared to regular regression techniques where the output is just obtained from a single value of each attribute. The output, ‘y’ is generated from a normal distribution (where mean and variance are normalized). The aim of Bayesian Linear Regression is not to find the model parameters, but rather to find the ‘posterior‘ distribution for the model parameters. Not just the output y, but the model parameters are also assumed to come from a distribution. The expression for Posterior is :
- Posterior: It is the probability of an event to occur; say, H, given that another event; say, E has already occurred. i.e., P(H | E).
- Prior: It is the probability of an event H has occurred prior to another event. i.e., P(H)
- Likelihood: It is a likelihood function in which some parameter variable is marginalized.
This is actually equivalent to the Bayes’ Theorem which says,
where A and B are events, P(A) is the probability of occurrence of A, and P(A|B) is the probability of A to occur given that event B has already occurred. P(B), the probability of event B occurring cannot be 0 since it has already occurred. If you want to learn more about regular Naive Bayes and Bayes Theorem, you can follow this link.
Looking at the formula above, we can see that, in contrast to Ordinary Least Square (OLS), we have a posterior distribution for the model parameters which is proportional to the likelihood of the data multiplied by the prior probability of the parameters. As the number of data points increase, the value of likelihood will increase and will become much larger than the prior value. In the case of an infinite number of data points, the values for the parameters converge to the values obtained from OLS. So, we begin our regression process with an initial estimate (the prior value). As we start to cover more data points, our model becomes less wrong. So for Bayesian Ridge Regression, a large amount of training data is needed to make the model accurate.
Now, let us have a quick brief overview of the mathematical side of things. In a linear model, if ‘y’ is the predicted value, then
where, ‘w’ is the vector w. w consists of w0, w1, … . ‘x’ is the value of the weights.
So, now for Bayesian Regression to obtain a fully probabilistic model, the output ‘y’ is assumed to be the Gaussian distribution around Xw as shown below:
where alpha is a hyper-parameter for the Gamma distribution prior. It is treated as a random variable estimated from the data. Here, the implementation for Bayesian Ridge Regression is given below. The mathematical expression on which Bayesian Ridge Regression works is :
where alpha is the shape parameter for the Gamma distribution prior to the alpha parameter and lambda is the shape parameter for the Gamma distribution prior to the Lambda parameter.
This is only a brief introduction to the mathematics that goes behind a Bayesian Ridge Regressor. The goal of this article is to give you a brief high-level overview of Bayesian regression; when to use it, advantages, disadvantages, and show you how to implement it. So, we have just given you a brief introduction to the mathematics behind Bayesian regression and Bayesian Ridge regression. We will not go much into the depth of how the mathematics works.
Advantages of Bayesian Regression:
- Very effective when the size of the dataset is small.
- Particularly well-suited for on-line based learning (data is received in real-time), as compared to batch based learning, where we have the entire dataset on our hands before we start training the model. This is because Bayesian Regression doesn’t need to store data.
- The Bayesian approach is a tried and tested approach and is very robust, mathematically. So, one can use this without having any extra prior knowledge about the dataset.
Disadvantages of Bayesian Regression:
- The inference of the model can be time-consuming.
- If there is a large amount of data available for our dataset, the Bayesian approach is not worth it and the regular frequentist approach does a more efficient job
Implementation of Bayesian Regression Using Python:
In this example, we will perform Bayesian Ridge Regression. However, the Bayesian approach can be used with any Regression technique like Linear Regression, Lasso Regression, etc. We will the scikit-learn library to implement Bayesian Ridge Regression. We will use the Boston Housing dataset that has information about the median value of a house in an area in Boston. You can learn more about this dataset here. For evaluation, we will use the r2 score. The best possible value of the r2 score is 1.0. If the model makes a constant prediction regardless of the attributes, the value of r2 score is 0. r2 score may also be negative for even worse models. To learn more about r2 scores, you can follow the link here.
But before we get on to the code, you must understand the important parameters of a Bayesian Ridge Regressor:
- n_iter: Number of iterations. Default value = 100.
- tol: When to stop the algorithm given that the model has converged. Default value = 1e-3.
- alpha_1: Shape parameter of the regressor line (Gamma distribution) over the alpha parameter (used for regularization). Default value = 1e-6.
- alpha_2: Inverse scale parameter for the Gamma distribution over the alpha parameter. Default value = 1e-6.
- lambda_1: Shape parameter of the Gamma distribution over the lambda parameter. Default value = 1e-6.
- lambda_2: Inverse scale parameter of the Gamma distribution over the lambda parameter. Default value = 1e-6.
NOTE: This code may not work on an online IDE. Run it on Google Colab or on your local machine.
r2 Score Of Test Set : 0.7943355984883815
We get an r2 score of approximately 0.7934 on the test set using Bayesian Ridge Regressor with all default parameters. This is an acceptable score. However, you may alter the alpha and lambda parameters discussed above to obtain better results for your dataset.
So now that you know how Bayesian regressors work and when to use it, you should try using it next time you want to perform a regression task, especially if the dataset is small.