In this lecture, we are going to talk about the EM Algorithm. As compared to the previous lecture, the target audience here will be advanced learners with some background in probability and convex optimization. Let us consider the EM Algorithm as a maximization of a lower bound of an object function. I will review some background and then describe the iterative steps of the algorithm. Let's first learn an important inequality that is fundamental to the algorithm. The main idea of the EM Algorithm can be explained using Jensen's inequality. Consider a convex function, such as this 2D graph. Let's take two points, x1 and x2, in the domain. A middle point will be the mean of the two points. Now, we are going to compare two values. First, the function value of the mean of the two points which relies on the function curve. Second, a mean of the two function values of the points which lies on the straight line that connects Fx1 and Fx2. You may notice that the straight line lies above the function curve. The mean of Fx1 and Fx2 is always larger than the function value of the mean. We can generalize this idea. Inequality holds for any weighted mean of the two points. In fact, this applies to any positively weighted cases or multiple points, even in higher dimensions, as long as the function is convex. If the function is concave, we only need to flip the inequality. The type of function we are going to handle is a log function, which is always concave. Thus, we can use Jensen's inequality as a lower bound. Now, we'll introduce a latent variable which we can never truly know. Interpretation of latent variable becomes clear in specific applications such as the estimation of GMM parameters from the previous lecture. What we are going to do with this latent variable is take the marginal probability over the variable. As before, we consider the log of the probability. Using Jensen's inequality, we can provide a lower bound of the log-likelihood as shown. As we have seen, we cannot directly maximize a log of a sum of probabilities. So instead, we will use a lower bound of the function which we can compute. Finding a lower bound implies finding a distribution of z, the latent variable. The lower bound turns out to be a function of parameter given some old parameter value. Now, let's see how this applies to. We have an initial guess for the parameter, theta nought, and we can obtain a lower bound g by finding some q using that guess. As I mentioned, a low likelihood can be seen as a function of the parameters. Suppose f and g are visualized as a function of theta on this graph. G does not move exactly like F, but they are locally similar in the neighborhood of theta nought. Remember we want to find a maximum likelihood estimate of the parameter theta. Given the current lower bound G, we can find a better parameter denoted as theta star. This is still not a maximum of F but is improved from the previous estimate. Since we now have a better guess of the parameter, we can compute an improved lower bound to G, which is shown in the plot. Again, the new G is not quite the same as F. But it locally agrees with F around theta one. But we can compute an even better theta based on the better G. If we repeat this process, the parameter eventually converges to a local maximum. To summarize, in the E-step, we compute a lower bound of the log-likelihood using a given set of parameters. Again, this implies that we obtain the distribution of the latent variable based on the given parameter theta hat. The M step, we obtain the maximizing parameters of the current lower bound. This process is repeated until the updates become less than our threshold. We have learned the EM Algorithm for finding maximum likelihood solutions for complex problems. The estimation of GMM parameters from the previous lecture was an example of applications of this algorithm. You will see another application of EM we talk about localization in week four.