This particular cost function

is a convex quadratic function and it only has a single global minimum low animation,

I'll show you coming up here.

If you pick some other function,

it may not have a single global minimum and you need to manually check for this.

This one here has as a single one. As Andrew did it.

He said, imagine you're up on a side of a hill,

a side of a mountain and you started

up here with your initial guess and you want to as a algorithm iterates,

you work your way down and then hopefully,

you reach this final lowest value for your cost function,

which is the global minima.

But if you pick other functions it may have

a little nonlinearity in there and if you start out here,

you can get stuck in there and you get out

of this so the algorithm will converge here instead of converging here.

I had to write my own convergence code.

Andrew didn't in his notes didn't tell us how to do that.

So, I had to figure that out on my own and I'll share with you here.

So, here's our cost function and here we're

up on the side of the hill and as this algorithm iterates.

We start up here and because it's this convex shape,

it has just a single global minimum and no other local minimums.

So, as the algorithm runs,

you see that your error function eventually it's down to

this very small amount and then you stop, and you know you're done.

So, as I said the test for convergence is

a user-defined function and the downside to batch gradient descent is that,

every theta update needs to run through all m training samples.

So, if your dataset is enormous,

it's computationally expensive because every pass through,

you might have 10 million training amples.

So, you got to cruise through 10 million training samples update every single theta.

So, it's computationally expensive.

Is there a better way to do It?

Yes. Second method.

It's called stochastic gradient descent.

For each training example,

so we just go through the train data one time and we update our theta values as we go.

It's faster for large training sets but potentially less

accurate and we'll see that that is in fact the case.

So, we go through all the math that I'm not going to do here

for you, we end up with this.

For i equals 1 to m,

the new theta's equal to the current theta times again this learning

rate times the target value minus the hypothesis,

times x for j 0 to n. Don't iterate through all,

it could be i 0 to n minus 1 or i 1 to m, either way.

Point is we're sampling each example from our training data

once and making a calculation computing our theta values.

The third method, if you would've told me this was possible to do this,

I would have said no way.

Shows you what I know, so I'm not a mathematician, I'm a chip designer.

Third method is pretty stunning.

Is it possible to directly calculate the theta_j's?

The answer is, it turns out.

Yes. The closed form equation with no iterations at all and in this method,

we minimize J of theta by explicitly taking

the derivatives of all of the individual thetas and setting them to zero.

Then Andrew did this wild crazy linear algebra went on on his chalkboard at

Stanford for couple of white boards full

of these manipulations and I was thinking these aren't just arrays of numbers,

how can you take the derivative of a derivative of a constant as zero?

How can this possibly work? Sure enough it does.