Today we're going to talk about another way of measuring centrality in a network. This one is called betweenness centrality. And it makes the assumption that important nodes are those who connect other nodes. Recall that the distance between two nodes is the length of the shortest path between them. So for example, in this network, nodes 34 and 2 have a distance 2 because there are multiple ways of getting from node 34 to 2 in two steps. For example, we can take the path 34, 31, and 2. We can take the path 34, 14, 2. Or we can take the path 34, 20, and 2. Notice that nodes 31, 14, and 20 are in the shortest paths between node 34 and 2. And so, this is the kind of question we're looking at with betweeness centrality. What are the nodes that show up in the shortest paths between two different nodes? The way we are going to measure centrality, the betweenness centrality of a node v is by taking nodes s, t and finding all the shortest paths between nodes s and t. That, we're going to call sigma s, t. Sigma s, t is going to be the number of shortest paths between nodes s, t. And then, we're going to look at how many of those shortest paths actually contain node v in the middle? That's this value here. So, sigma s, t is the number of shortest paths between nodes s and t. And sigma s, t(v) is going to be the number of shortest paths between s and t that contain node v. And the betweenness centrality of node v is going to be the sum of these ratios overall possible s and t's. Actually, we're going to find that there are different ways in which we can pick the specific s and t's that we use to compute the centrality node v. But we'll talk about that next. Basic idea here is that a node v has high betweenness centrality if it shows up in many of the shortest paths of nodes s and t. One of the questions, one of the options that we have when we do this is whether or not we actually include node v as a possible node s or t. Let's say for example that we exclude node v from being a possible node s or t. So then we'll have the following. If we measure the betweenness centrality of node B in this simple network, we'll find that is the sum of three different things. First, we're going to have the number of shortest paths between nodes A and D. And there is only one shortest path between node A and D, which is the path A, B, and D. And so because B is involved in this path, then that is going to have value 1 over 1. Next, we look at the two nodes A, C. And we find that the shortest path between notes A and C is the path A, B, C. And of course, that involves node B, so that contributes the value of 1 over 1. Finally, if we look at the pair of nodes C and D, we find that its shortest path is just simply going from node C to D directly, since they're connected. And that does not involve node B, so that contributes 0. And so betweenness centrality of node B when we exclude B from playing the role of s and t, we find that it has betweenness centrality of 2. Now if we actually include node v as one of the endpoints here, then we find that there are many more options to look at, right. So first, we have to look at the pair of nodes A and B, which of course involves node B, so that contributes 1. We look at A, C, which has the shortest path A, B, C. And that involves B, therefore that contributes 1. A, D also passes through node B, so that contributes 1. B, C, well, this one involves the node B itself, so of course node B is involved in the shortest path from node B to C, and so that contributes 1. B, D same story, it's one of the endpoints, so it contributes 1. And then finally, C, D. C, D, again, they're connected to each other. So to get from one to the other, they don't have to pass through node B and so that contributes 0. And so when we include node B in the computation, we find that it has a higher betweenness centrality, in this case 5. We'll find that when we use network x to compute this, we'll have the option of either including or excluding the node as one of the endpoints in the pair of nodes. Now, here we have another concern, which is that sometimes some pairs of nodes are actually not connected to each other, they cannot reach each other. So what happens to the computation when we have this case? Again, this tends to happen more often in graphs that are directed. And so let's look at this example. Note here that node D has no nodes that are actually pointing to it. So actually, no node can actually reach node D. And so how do we compute this given that no node can actually reach D? If we were simply to apply the definition as we stated it and we actually include pairs for example A, D, then sigma A, D would be 0 because they're no shortest paths between A and D. And we have that sigma A, D in the definition a=is in the denominator. And so this would make this undefined. And so, we have to fix that in some way. And what we do is that we simply only consider the nodes that actually have at least one shortest path between them when we're considering nodes s and t. So, in this case, what is there betweenness centrality of node B? Let's say we're not including node B in the computation as one of the endpoints. Well, then we have to see which ones are the nodes that actually have at least one shortest path between them. And we'll use those in the computation. So A, C has a shortest path, thus A,B, C. And B involved in that, so that it contributes 1. C, A, C can reach A in just one step by connecting directly to it. That does not involve node B, so that contributes 0 to the computation. D, C, again, they're connected without including B, so that's contributes 0 to the computation. And then D to A, that has the shortest path D, C, A, which does not involve node B. And therefore, it contributes 0 to the computation. Notice that you can also go from D to A passing through B. You can go D, B, C, A. But that is a longer path, so it's not the shortest path. And so B is not involved in the shortest path between D and A. And so in this case, node B has a centrality of 1. Let's look at the same question for node C. And so again, we have to look at all the pairs of nodes that actually have a shortest path between them. And we're not including C as one of the endpoints. And so the first one is A, B, which there is a direct connection between them, and does not involve C, so that contributes 0. B to A, the shortest path from B to A is B, C, A. And that involves node C so it contributes 1 to the centrality of C. There's D, B, they're directly connected, so that contributes 0. And then there is D, A. And again the path from D to A, the shortest path passes through node C, so that contributes 1 to the computation. And overall we've find that node C has a betweenness centrality of 2. So, so far we haven't talked about normalizing the betweenness centrality in any way. And the problem with this is that nodes that are in graphs that have a larger number of nodes will tend to have higher centrality than nodes of graphs that are smaller in terms of the number of nodes. That's simply because in large graphs, there are more nodes, s and t, to choose from to compute the centrality of the nodes. And so for example, if we look at these friendship network in the 34 person karate club, the nodes there are going to have lower centrality than the nodes in this larger network of 2200 people. And so, sometimes if we want to compare betweenness centrality across networks, it's useful to normalize. And the way we normalize is simply by dividing the betweenness centrality of our node v by the number of possible pairs of nodes in the network, not including node v. So for undirected graphs, you would divide them betweenness centrality of v by (N-1)(N-2) over 2. That's the number of pairs that you could have in an undirected graph excluding the node that you're currently looking at. And in directed graphs, you have twice the number of pairs because for any pair s, t, you could have a path from s to t, but also a potentially different path from t to s. So you would divide the betweenness centrality of node v by (N-1)(N-2). And in network x, you can use the function betweenness centrality to find the centrality of every node in the network. And you have the various options that we've discussed here. So for example, you can choose to normalize or not. And you can also choose the question of the endpoints, whether you use the node that you're computing the centrality of as one of the endpoints in the computation of its centrality. So you can choose to do this in any way you want here. So for example, if we look at the karate club and look at betweenness centrality, compute the betweenness centrality of all the nodes and then find the five largest, the five nodes with the largest betweenness centrality, we find that these are the nodes 1, 34, 33, 3, and 32. Now one of the issues with betweenness centrality is that it can be very computationally expensive. Depending on the specific algorithm you're using, this computation can take up to order number of nodes cubed time. And just for us to get some idea about this, look at the network of friendship, marital tie, and family tie among these 2,200 people. This is a relatively small network, yet when you look at the number pairs of nodes that it can have, you have a new order of 4.8 million pairs of nodes. And so even small networks have lots and lots and lots of pairs of nodes. And so computing the betweenness centrality of these networks becomes pretty expensive. So one of the things that you can do is rather than the computing betweenness centrality based on all the nodes s and t, all the possible nodes s, t in the network, you can approximate it by just looking at a sample of nodes, instead of looking at all the nodes. And in network x, you can do this by using the parameter k that says how many nodes you should use to compute the betweenness centrality. And so here, I'm computing the betweenness centrality of the nodes in the karate club network using only 10 nodes rather than 34 nodes. And so this gives you an approximation for what the betweenness centrality of the nodes actually is. And if I look at the five nodes with the largest approximated betweenness centrality, we find that these are nodes 1, 34, 32, 3, and 2. So we get almost exactly the same list as we did when we didn't approximate, when we find the actual betweenness centrality, except that we now get 2 as one of the top five and we lose 33 as one of the top five. So it gives us something that is close to the actual. But of course, there can be some differences since now you're only using 10 rather than 34 nodes to compute the centrality The other thing that sometimes is useful is that sometimes you rather compute the betweenness centrality based on two subgroups in the network, not necessarily looking at all potential pairs of nodes. But you maybe really care about two groups communicating with each other. So you want to find what are the most important nodes in this network that tend to show up in the shortest paths between a group of source nodes and a group of target nodes? And so to do this in network x, you can use the function betweenness centrality subset, in which you pass the graph and then you pass the set of source nodes and the set of target nodes. And you can choose to normalize or not. In this case, I'm normalizing. And I'm just sort of here selecting two groups of nodes, pretty arbitrarily, just to kind of give you an example here. So we're going to see, based on these source nodes and target nodes, what are the most important nodes? And again, here what the meaning of these source nodes and target nodes is that when we select the nodes s, t to compute the centrality of all the nodes, we're always going to choose s from the set of source nodes, and t from the set of target nodes, rather than selecting all possible pairs. And so when we find the top nodes here with the highest betweenness centrality in this setup, with these source nodes and these target nodes, we find that nodes 1, 34, 3, 33, and 9 are the most important nodes. Now notice that these tend to be the nodes that also have highest centrality when you don't restrict to source and subset of source and target nodes. But there are some changes. So for example, we had not seen that node number 9 was important before. Now we see that it's important in connecting this particular sets of nodes. The other thing you can do is you can define the betweenness centrality of an edge, rather than the betweenness centrality of a node, in much the same way that you defined betweenness centrality for a node. So if you're defining the betweenness centrality of an edge, you're going to again look at pairs of nodes as t. And you're going to take the ratio of the number of shortest paths in going from s to t that involve the edge e divided by all shortest paths between nodes s and t. So it is the exact same definition. But now rather than asking is this particular node showing up in the shortest path between s and t, we are asking is this particular edge showing up in the shortest path? And in network x, you can use the function edge betweenness centrality to find the betweenness centrality of all the edges in the network. And so here, if we find the top five edges with the highest betweenness centrality, we find that these are it. So they all tend to be edges that are connected to node number 1, which if you remember, node number 1 here is the instructor of the karate club. In the same way that you could define a specific set of source nodes and a specific set of target nodes, you can do the same thing when you compute the edge betweenness centrality rather than node betweenness centrality. And for this, you can use the function edge betweenness centrality subset. And you pass again the graph and the source nodes and the target nodes. And if we find here the top five edges with the highest betweenness centrality for this particular choice of source and target nodes, we find that these are the the most important ones. And notice that most of them tend to be edges that go from inside the target or inside the source set to the outside. And that make sense because these are the ones that actually end up showing up in the shortest paths between the source and the targets. And they also tend to be connected to very important nodes in the network, namely node number 1, which is the instructor of the karate club and node number 34, which is the instructor of the new karate club after these club splits in two. So in summary, we say that betweenness centrality makes the assumption that important nodes tend to connect the other nodes. And this is the formula that we use to compute it. In general, it's the sum of the fraction of the number of shortest paths that involve a particular node v divided by all the possible shortest paths between the nodes s and t. We talked about normalizing this, especially if we're comparing betweenness centrality among different networks of different sizes. So we divide by the number of pair of nodes. We also talked about approximating this because sometimes we're unable compute it exactly because it can be computationally expensive. So we can approximate it by selecting a subset of nodes rather than all the nodes. And we showed you how to do this in network x. We also talked about choosing specific sets of target nodes and specific sets of source nodes rather than using all possible pairs. That's if you have a particular sets of nodes that you care about and that you want to know, who are the important nodes that are connecting nodes in this two specific sets. And then we talked about how we can generalize this a bit more and talked about the betweenness centrality of not only the nodes but also the edges. Much in the same way that we define it for nodes, we can also define for edge. This is all for this video. Thank you for watching and I see you next time.