We all know the first model we learned when learning Machine Learning: Linear Regression. It is a simple, intuitive, and stimulating our mind to go deeper into Machine Learning hole.
Linear Regression could be intuitively interpreted in several point of views, e.g. geometry and statistics (frequentist one!). Having frequentist statistics point of view, usually there should be the Bayesian counterpart. Hence, in this post, we would address the Bayesian point of view of Linear Regression.
Linear Regression: Refreshments
Recall, in Linear Regression, we want to map our inputs into real numbers, i.e.
There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with
Formally, the objective is as follows:
where
which is a linear combination of feature vector and weight matrix. The additional
The idea is then to minimize this objective function with regard to
Of course we could ignore the regularization term. What we end up with then, is a vanilla Linear Regression:
Minimization this objective is the definition of Linear Least Square problem.
Frequentist view of Linear Regression
We could write the regression target of the above model as the predicted value plus some error:
or equivalently, we could say that the error is:
Now, let’s say we model the regression target as a Gaussian random variable, i.e.
Then, to find the optimum
The PDF of Gaussian is given by:
As we are doing maximization, we could ignore the normalizing constant of the likelihood. Hence:
As always, it is easier to optimize the log likelihood:
For simplicity, let’s say
So we see, doing MLE on Gaussian likelihood is equal to Linear Regression!
Bayesian view of Linear Regression
But what if we want to go Bayesian, i.e. introduce a prior, and working with the posterior instead? Well, then we are doing MAP estimation! The posterior is likelihood times prior:
Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation. So, for non-trivial example, let’s use Gaussian prior for weight
Expanding the PDF, and again ignoring the normalizing constant and keeping in mind that
Let’s derive the posterior:
And the log posterior is then:
Seems familiar, right! Now if we assume that
That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to
Full Bayesian Approach
Of course, above is not a full Bayesian, as we are doing a point estimation in the form of MAP. This is just a “shortcut”, as we do not need to compute the full posterior distribution. For full Bayesian approach, we report the full posterior distribution. And in test time, we use the posterior to weight the new data, i.e. we marginalize the posterior predictive distribution:
that is, given the likelihood of our new data point
Intuitively, given all possible value for
And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of
Of course we could use approximate method like Variational Bayes or MCMC, but they are still more costly than MAP. As MAP and MLE is guaranteed to find one of the modes (local maxima), it is good enough.
Conclusion
In this post we saw Linear Regression with several different point of view.
First, we looked at the definition of Linear Regression in plain Machine Learning PoV, then frequentist statistics, and finally Bayesian statistics.
We noted that the Bayesian version of the Linear Regression using MAP estimation is not a full Bayesian approach, since MAP is just a shortcut.
We then noted why full Bayesian approach is difficult and often intractable, even on this simple regression model.
References
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.