Hi, this video is about the relationship between directed acyclic graphs or DAG's and probability distributions. We have two main objectives. One is to understand how DAGs and code information about probability distributions. And the other is, we would like to be able to decompose a joint distribution from a DAG. So to begin, we'll think about DAGs and probability distributions. Well in general DAGs won't code assumptions about variables, or about node differences. So a DAG will tell us which variables are independent from each other. It will also tell us which variables are conditionally independent from each other. And it will also tell us ways that we can factor and simplify a joint distribution. So here is an example. So here we have four nodes or four sets of variables D, A, B and C. We see it's a proper directed acyclic graph as its only directed path. And what we're going to take a look at now is how this particular DAG encodes information about the joint distribution of A, B, C, and D. So you can think of all of these A, B, C, and D as random variables or sets of random variables. And so they must have a joint distribution. This DAG is encoding assumptions about that joint distribution. So what does this DAG actually tell us? So first, and I'm just going to go through some examples. This DAG implies that the conditional distribution of C given everything else, is just equal to the probability of C. In other words, C is independent of all variables. The reason we know that is because C is just sitting there off by itself. So C is a random variable that has some distribution, but nothing is affecting it and it's not affecting anything else. So it's just by itself, it's independent. So therefore, the conditional distribution of C given A, B, and D, is just equal to the probability of C. So conditioning not A, B, and D. It doesn't tell us any information about the probability of C. It's just totally off by itself and independent. So you'll see that just from this DAG, we were able to identify that C is independent of the other variables. Next, we can look at B, and in fact we'll look at the conditional distribution of B given the other variables. So B is effected directly by A and it's effected indirectly by D. So because D effects A, which then effects B. So in this case, if we condition out everything C, D, and A, all that really matters is A here because A is directly affecting B. So as long as I know A, I know everything that there is to know about the probability for B. D is not telling us anything additional because D affected A but I'm conditioning on A, so I already know A. So D is not telling us anything in addition to that. And of course, C is off by itself. C doesn't affect B in anyway. So we can simplify this conditional distribution, the probability of B given A, C and D as the probability of B only given A. Another way to write that is using this notation here. So, B is independent of D, I'm sorry, B is independent of D and C, conditional on A. So we're able to learn that just by looking at this graph. Here's another example where supposed you were interested in the distribution of B given D. What I write here is that the probability of B given D is not equal to the probability of B. Another way you could think of that, is that B and D are actually marginally dependent. And this is because D affects A which affects B. So D, therefore, has to be related to B. So they're not independent, they're not marginally independent. D and B are going to have to be associated with each other. So this is another thing we can learn from the graph. Now here's another example where we're looking at now the distribution of D, conditional on everything else. And we can simplify that strictly as the probability of D given A for the same kind of reason. B is also related to D, but only through A. So D affects A, which affects B. But as long as we condition on A that's all that we'll actually need to know about D. And again, C is sitting off there by itself. So these are a few examples of ways that you can learn about probability distributions from a DAG. I imagine that some of these ideas are still at little fuzzy, so we'll look at another example. So here's another example of a DAG. So what does this DAG imply? So one example here, is if you look at the probability of A given everything else, we can simplify that as the probability of A given D. And the way you can basically tell that is because the only thing that's affecting A is D. So as long as we condition on D, we'll basically know everything there is to know about A. Next we can think about the probability of D given the other variables. And in this case we can drop C from the conditioning. But we need to still condition on A and B. What that implies is that D is independent of C given B. But why is that the case? Well first of all, why can we drop C from this conditional probability? Well, once we condition on B, Then there's no information in C about A. So C could potentially tell us something indirectly about A if we didn't condition on B. But as long as we're conditioning on B, the C doesn't give us any useful information because C is basically sitting off by itself. Except it's affected by B. Well, we've already conditioned on B, so C is not going to tell us anything new. But we do need to condition on both A and B, in other words we can't drop A or B, because D affects both A and B. So both A and B should independently be telling us something about D. So next we'll consider another example, this should look just like the previous DAG except to add on an arrow from A to C. So what does this imply? So now, if we look at the probability of A given everything else, we can drop B from that conditioning but nothing else. So A is independent of B, conditional one C and D. So if you look at A, D is directly affecting it, so we're certainly going to have to keep conditioning on that. A is directly affecting C, so we're going to have to keep conditioning on that because A and C are clearly related to each other. But B is only related to A indirectly through other variables that were already conditioning on. So we're learning enough through C and D, where we can drop conditioning on B. We have the same kind of idea if we look at D where really all that matters is conditioning on A and B. C is going to give us no new information about the distribution of D, because the only paths from D to C are through either A or B, but we are already controlling for that or conditioning on that. So D is conditionally independent of C given A and B. Next, we will look at the decomposition of a joint distribution. So, given a DAG, we can actually use that to decompose a joint distribution by sequentially conditioning on only a set of parents. To do that, we'll start with roots. So those are nodes with no parents. Then we'll proceed from roots down the descendant line, where we will always condition on the parents. Next, we'll look at an example to make that clear. So in this example, there's two roots, C and D. So remember, a root is a node that has no parents. So C and D have no parents, so we'll begin with those. Anytime you want to decompose a joint distribution, so the joint distribution in this case, is here, as a probability of A, B, C and D. If we want to decompose it in the way described in the previous slide, we'll start with roots. And roots are just independent varying variables so where we're going to do the decomposition. So we just have the probability of C times the probability of D. And next, what we'll do is we'll look for any children of these roots. So C doesn't have any children, so we're actually done with C. But D does have a child, which is A. So then we'll multiply by the probability of A given D. Next we'll look for any children of A, there's only one which is B. So we multiply by the probability of B given A. So what this DAG is telling us is that we can write the joint distribution of A, B, C, and D as this particular product, probability of C times probability of D times the probability of A given D and probability of B given A. So this decomposition of the joint distribution is something that is implied by the DAG. So of course this decomposition does not hold in general, but it does hold if your DAG is correct. So if this DAG is correct, this is a proper decomposition of the joint distribution. So to make it more clear, we'll work through another example. So you'll notice here that there's only one route, only one variable that has no parents, and that's D. So we'll begin there. So we'll begin with the probability of D and that's just independent probability distribution off by itself. And next we'll look for its children. It happens to have two children, A and B. So then we'll multiply by the probability of A given D, and the probability of B given D. Next we'll look for children of A and B. A does not have any children, so we're actually done with that node or that variable. But B does have one child, and that's C. So, we can multiply by the probability of C given B. So again, based on this DAG, we can decompose the joint distribution of A, B, C and D as the probability of D times the probability of A given D, times the probability of B given D times a probability of C given B. So we're able to learn that from the diagram itself. Finally, we'll look at the other decomposition and this is a one way now added an arrow from A to C. But again, there's only one route and that's D. So we have the probability of D. Again, D has two children, A and B, so these two terms are the same as in the previous slide. What's different now is that A and B are both parents of C. So now when we look at the probability of C, we need to condition on both of its parents, A and B. So that's how adding that arrow affected our decomposition of the joint distribution. So in the previous slide, there was no arrow from A to C. So here, we didn't have to condition on A. But now, C has two parents, A and B, and we need to condition on both of them. So we can see that the DAG is directly telling us how we can decompose a joined distribution. So there is this kind of compatibility between DAGs and distributions. So if we return to this particular DAG we see that it admits a particular factorization. So we could think of this particular probability decomposition here, as being compatible with this particular DAG. So if I started with this particular decomposition of the joint distribution, and then you showed me the DAG here. I would say that those are compatible. That joint distribution, that decomposition is compatible with that particular graph. So next we're going to flip things around and we'll see that particular probability functions do not necessarily imply a unique DAG. So here's a simple example of two DAGS. The first one involves two nodes, A and B, but there's an arrow from A to B. So A is affecting B, whereas in DAG 2, there's the same two nodes, A and B, but B is affecting A. So the arrow in DAG 2 is going from B to A. So this convey different information. In the first case, they convey that A is affecting B. In the second case they convey that B is affecting A. So they also are telling us that A and B are dependent. However, imagine that instead of starting with a DAG I started with some kind of probability statement. So for example, suppose I told you that the probability of A and B does not equal the probability of A times the probability of B. Which, another way of saying that is that A and B are dependent, or A and B are not independent. So if we started with just this statement, you could write either DAG 1 or DAG 2. You wouldn't know which one is correct, both of those would be compatible with the probability statement you've written down. So the probability statement would not imply a unique DAG. Whereas, if you started with a particular DAG, that would imply something about a probability statement.