The cost function for Linear Regression is Mean Squared Errors. It goes like below:

is the data point i in the training dataset. is a linear function for the weights and the data input, which is

To find the best weights that minimize the error, we use Gradient Descent to update the weights. If you have been following Machine Learning courses, e.g. Machine Learning Course on Coursera by Andrew Ng, you should have learned that to update the weights, you need to repeat the process below until it converges:

for j=0…n (n features)

In Andrew Ng’s course, it it also expanded to:

…

However, when I first studied the course a couple of years ago, I did get stuck for a little trying to figure out where that exactly came from. I wish someone had give me some more concrete expansion so I could figure it out faster. Let me do that here so you can examine the detailed breakdown and get through this stage quickly. That’s the whole objective of this blog.

Let’s make this less abstract by putting down exact data point with a small sample and feature sizes. Let’s say, we have a set of data with 4 features like below and we only select 3 samples from it for simplicity.

The prediction for each sample is:

In this case the cost function wold be:

To minimize the cost, we need to find the partial derivatives of each . The method to update the weights is

Let’s find the derivative of the here, step by step. From the cost function above, by applying Chain Rule, we have:

Do the same for the rest of the

:

:

:

:

If you are at first unclear about the functions and notations in the courses or some other documentations, I hope the expansion above would help you figure it out.

### Like this:

Like Loading...