In this lecture, we are going to learn how we can compute an estimate of the Gaussian model parameters from observed data. Remember that a Gaussian model has two parameters, mean and variance. We are going to use the term likelihood often throughout the course. Let's talk about its definition. Likelihood is the probability of an observation given the model parameters. The subscript i indicates one particular observation xi among multiple observations of x. The important thing I want to point is that we have the data, what is to be determined are the parameters. In our case, we are using a Gaussian model. Thus, the parameters are mu and sigma. We are interested in obtaining the parameters of our model that maximizes the likelihood of a given set of observations. If we express what I just said mathematically, we can write like this. We are using our hat accent to indicate an estimate for mu and sigma. The likelihood function we are going to maximize is the joint probability of all the data, which can be intractable if each instance of x depends on the other observations. However, if we assume that each observation, xi is independent of each other. The joint likelihood can be simply expressed, as a product of individual likelihood. If the concept of independence does not sound familiar and please review the supplementary material on basic probability, which you can download. With this notation, we now look to compute the maximum likelihood estimate of mu and sigma. Fortunately, there is an analytic solution of Gaussians. Another reason to choose the Gaussian Distribution. The full derivation of the solution can be found in the supplementary material, but here are a few points. For the estimate, we are going to apply properties of the logarithmic function. Let me visualize a log function for you. It is a monotonically increasing function. So if some value, x star is the maximum of the domain and the log of the same value, log x star is also the maximum. Using this property instead of maximizing the likelihood, we can try to find the parameters that maximize the log likelihood. The objective functions are different. The arguments that maximize the objective functions are equivalent. It will turn out that maximizing log likelihood is simpler in many cases. Another property of the log function we are going to use is the log of the products equals the sum of the logs and we can write the problem, as finding mu and sigma that maximizes the sum of all the log likelihoods of the individual measurements. The next thing I want to remind you is that we are dealing with a Gaussian, which has this specific form. Using a property of the logarithmic function. We can write a log likelihood exactly like this. With the expanded notation of the log likelihood, we may rewrite the problem like this. We're going to further simplify this. We can ignore the constant term, log square of 2 pi. In this case, because it does not vary with the parameters. So, it does not affect the solution. And we can change the formula into a minimization problem by changing max to min and taking the negatives of all the terms. Why switch to minimization? Well, two problems are equivalent, but writing optimization problem as a minimization is the standard form and we are following that to be consistent in notation. Let's call this whole parts J. This is a common symbol to represent a cosine function that we want to minimize. If we apply the optimality condition for convex optimization, the first order derivative of J with respect to mu should be zero. From this, we can compute the maximum likelihood estimate of mu and we are going to write it as mu hat. Once again, we apply the same optimality condition to compute the estimate of sigma. For this, we can use the value of mu hat in place of mu as a parameter. The final solution we get for computation is relatively simple. Mu hat is exactly the sample mean the average of the data, which is a natural estimate of data. Also, sigma hat square is just a sample variance. You'll find the derivation of the solution in supplementary materials. Try to understand the principles behind what we have obtained here. Now that we have seen how to compute the MLE of the parameters, let's get back to our ball color distribution. Using our results, the maximum likelihood estimate, the ball color model is computed like this. Based on what we've learned so far, we'll start talking about using Gaussian models for more than one features in our next lecture.