This week, you have been learning a variety of spatial data analysis method. In this lecture, the topic of your study is Network Analysis. You'll start with the definition of network analysis, and a few method will be introduced, including Geocoding, map-matching, and shortest path finding. Network is defined as global interconnected things or objects. And there are many examples of spatial data represented by network. The best example is transportation network, such as road network and railroad network, and so on. Network data can be stored in network data model, which is basically a graph data structure, which you studied in the third week. Network analysis can be defined as data processing and analysis of network data. It includes data pre-processing, such as Geocoding and map-matching and data analysis, such as routing, service area delineation, resource allocation, and so on. In fact, you already used network analysis in proximity and accessibility analysis, in which floating catchment analysis defines service area by travel time or travel distance, which are based on transportation network analysis. Geocoding is the first example of network analysis. Precisely speaking, one method of Geocoding is based on network analysis, but others are not. The definition of Geocoding is the process of exchanging between coordinates and description of place. The example is that, the exchange between coordinates and addresses, also known as address matching, which is based on network analysis. Geocoding includes not only the address matching but also other methods, such as c-squares and GeoHash. Address matching is a critical issue in spatial data analytics where address-based data and coordinate-based data are integrated. C-squares and GeoHash are not exactly related to network analysis but they are very important particularly in spatial big data management and processing. So, I, on purpose, added them here for more explanation. The conversion to address with respect to a given coordinate, is to interpolate the position of the address based on its numeric distance, between the starting and the ending addresses on the block of road network. In urban area, each road segment corresponds to a block. And it has attributes of the left from to and the right from to addresses. In the example, the left from and to is 100 and 200. On the other hand, the right from and to is 101 and 201. So, if the coordinate is located on the left side, and the proportion of the location on the road network is 60 percent, then the address is expected to be 160 main street by interpolation. This is a simple example of address matching from coordinate to address. It looks simple and straightforward. However, in reality, it will not work out very well in rural area. At the same time, it should be noted that, the example is for road-based address matching in the US. The opposite way around, conversion to coordinate with respect to given address should take a series of text processing steps. The process includes parsing the input address into address components such as street names, street types, city, zip code, standardizing abbreviated values, assigning each address element to category known as match key, searching the reference data, assigning a score to each potential candidate, and delivering the best match and each coordinate as the outcome of the whole process. This is not domain-specific process. In other words, different countries have different Geocoding solutions. Surprisingly, the accuracy of the address matching is not that high. In most countries, 90 percent, more or less, is the typical range. C-squares stands for Concise Spatial Query and Representation System, which provides the basis for simple spatial indexing of geographic features. C-squares divides the surface of the earth into rectangular cells at a different level of scales, and assigns a unique c-squares code to each cell. The code is basically encoding of latitude and longitude. Here's an example of c-squares code for one degree cell with origin at 37 degree north and 127 degree east. The first number represents the global sector, one of northeast, southeast, southwest, and northwest. The second digit stands for tens of degrees in the latitude. And third and fourth digit, collectively represent tens of degrees in the longitude. After colon, the first digit stand for five degree quadrant. The second and the third digit denote additional degrees in latitude and longitude, respectively. In such way, all the rectangular cell can have its unique c-squares code. Now you are looking at one degree by one degree cell around Seoul in Korea. Where 1312:477 is highlighted. GeoHash is another Geocoding system, and similar to c-squares, it divides the surface of the earth into rectangular cells, and assign a unique ID which can be automatically generated by transforming latitude and longitude to symbol, composed of numbers and alphabets. It is hierarchical spatial data structure based on Z-order curve which is one of spaced-filling curves. The symbol is based on Base32 which is composed of 32 numbers and characters. It is easy to program and it is a popular Geocoding method. The table shows the matching between base 32 and decimal number 0 to 31. For example, wydm8 in base 32 is 28, 30, 12, 19, and 8 in decimal number, which can be converted to a binary code. With respect to the given binary code, digits on odd position represents longitude, and digits on even position does latitude. The slide explains the decoding methods of GeoHash. Each binary code is used to determine a series of dividends of a given interval. The first digit represents the interval of minus 90 to plus 90, and it is divided by two. producing 2 intervals, minus 90 to zero and zero to plus 90. Since the first digit is one, the higher interval zero to plus 90 is chosen. The procedure is repeated for all the bits in the code. Longitudes are processed in the same way. Keeping in mind that the initial interval is minus 180 to plus 180. The table summarizes the decoding of a given code for latitude 1 0 1 1 0 1 0 1 0 1 1 0. The first digit is 1. So the interval zero to plus 90 degree is chosen in yellow color, and zero to plus 90 degree interval is divided by 2 intervals, and the second digit is zero, so the lower intervals, which is 0 to plus 45 is chosen. The third digit this 1, so the higher interval 22.5 to 45 is chosen. The procedure is applied for the other digits, and finally, the interval of the code is between 37.529297 and 37.573243. That figure illustrates the areas of which GeoHash code wydm8 represents, region around Seoul Korea. The given longitude and latitude are the center coordinates of the given cell of wydm8. The value of GeoHash as well as C-squares is that they can provide one-dimensional indexing for two-dimensional spatial data. Why is this important? Because spatial big data process GeoHash or C-squares can be used for the key value pairs in the Hadoop MapReduce. Particularly GeoHash is popularly used for the propose, because GeoHash code contains spatial semantics, locations, and it is rather easy to calculate. There is a value of GeoHash in spatial big data processing. Map matching is the process of mapping 2-D coordinates to locate on the road network. It is a basic process for transportation related analysis, such as a vehicle trajectory analysis, shortest path algorithm, and travelling time analysis on network because it can locate any arbitrary points to a network, so that network analysis can be started. The figure show an example of map matching with the GPS coordinates. And map links. A series of GPS coordinates in red dots should be appropriately located road network for further analysis. For the given example, GPS points 1, 2, 3, 5, and 6 are easy to locate on the map links only with simple distance computation. However, GPS point 4 would cause some difficulty with the method in choosing either link 1330020 or 1329196. If you have information with road connectivity and vehicle trajectory, then the solution can be simply determined to link 1330020. So, there are many map matching algorithms, and they are mainly categorized into 3 groups, which are geometric algorithms, topological algorithms, and probabilistic algorithms. The simpler methods are based on geometric map matching algorithms, which consider only geometric distance between features. The most commonly used geometric map matching algorithm is a simple search algorithm. In this approach, arbitrary point such as GPS coordinates are matched to the closest vertex of a road segment or the closest load segment itself and corresponding points, as the figure illustrates. They are easy and simple to implement and generally very fast. However, it would have accuracy issues in intersection areas, or in very dense metropolitan road network. An alternative solution is to utilize topological information of road network. Topology, which you learned in the first week and the third week, refers to the relationship among special entities, such as points, lines, and polygons. Topological algorithm considers not only geometric information between GPS coordinates and road network, but also topological information of the connectivity and contiguity of road links for map matching. The previous example with topological information could make map matching algorithm more accurate than simple geometric approach. The GPS point 4 should be located only on link 1330020 because previous GPS point 3 is located on link 1330020, and the link 1329196 is not connected to the previous link. The next topic is shortest path finding, which is the problem to find the path between two vertices and network, such that the sum of predefined cost of edges is minimized. That cost could be distance, time, fuel consumption, and so on. The problem can be categorized based on the number of origins and destinations. If it has a single origin and single destination, it is one-to-one. Likewise, one-to-all, all-to-one, all-to-all cases. Among them, you will learn an algorithm named, Dijkstra's algorithm for one-to-all case. The given pseudo code gives you the detail of Dijkstra's algorithm. It requires only one origin O and a network data which is composed of node and edges, also known as paths, and corresponding cost of each path. The algorithm maintains two list variables, S and D, where S is set of nodes whose minimum cost is known. D is a set of minimum distance from the origin O to the other nodes. The main body of the program is to keep updating list variable S and D. Initially, a list S contains the origin O and the list D contains the cost between the origin and directly connected nodes. Then iterations starts. Such that, first, adding new node W to the list S which has the minimal cost in the list D. For all the nodes v, update D[v] with comparison between the current cost and the cost of the path from the node W to node V. This operation is repeated until the list S is filled up with all the nodes in the network. Difficult? Yes it is. So it would be better to understand this algorithm with an example. The figure illustrates one to all shortest path problem of airline network of nine major cities in the US. You'll find the shortest distance from the origin, Baltimore, BWI, to the rest of the cities. The first step is to initialize the list S and the list D. For S, the origin Baltimore is added, and for D, the cost. Here the distance is between Baltimore and connected city. Here New York, JFK, Chicago, ORD and Miami, MIA are updated. The rest of cities have distance of infinity because they are not connected. Now, get ready to start iteration. First, pick the node with the minimum cost from the list D which is JFK. And add the node to the list S, then update the list D with respect to all the nodes. With comparing the current cost and the cost of the path through JFK to each node. The third column of the table shows the cost comparison. Here the cost to Boston, BOS, is changed from infinity to 371, so that the list D is updated. The next iteration starts. First, pick the node with a minimum cost from the list D which is Boston, BOS, and add the node Boston to the list S. Then in order to update the list D with respect to all the nodes, compare the current cost and the cost of the path through Boston to each node. Again, the third column of the table shows the cost comparison. Here, the cost to San Francisco, SFO, is changed from infinity to 3075. So that the list D is updated. Again, next iterations starts. First, pick the node with the minimum cost from the list D which is Chicago, ORD, and add the node ORD to the list S. Then update the list D, compare the current cost and the cost of the path through Chicago to each node. Again, the third column of the table shows the cost comparison. Here, the cost to Dallas Fort Worth, DFW, Denver, DEN, and again, San Francisco, SFO, are changed. So that the list D is updated with respect to the three nodes. Again, next iteration starts. First, pick the node with minimum cost from the list D which is Miami, MIA, this time and add Miami to the list S, then compare the current cost and the cost of the path through Miami to each node. Again, the third column of the table shows the cost comparison. Here, the cost to Los Angeles, LAX, is changed, so that the list D is updated with respect to LAX. Next iteration start. Dallas Fort Worth, DFW, is selected which has the minimum distance this time. So add the DFW list S, then cost comparisons is conducted the same way. This time LAX is updated from 3288 to 2658. In the next iteration, Denver is selected and DEN is added to list S. The cost comparison is conducted. This time the cost to LAX is changed from 2658 to 2368. The list D is updated with the value for LAX. In the next iteration, LAX is selected as the node with the minimum cost and LAX is added to the list S. The cost comparison conducted again this time no update is made. In other words, current cost distance is maintained. With respect to the same D, the node with the minimum cost is San Francisco, SFO. So add SFO to the list S. Now, all the nodes are in the list S, which is the stop criteria of the iteration. The list D represents the minimum distances from Baltimore to all other airports in the network. You can also keep track of the routes of minimum distances to each airport. This is Dijkstra's algorithm which can present minimum cost paths from the origin to all the node in the network. In this lecture, you have studied network analysis including the concept, Geocoding, map matching, minimum path finding on network. In fact, what you have studied is just small fraction of network analysis. There are numerous analysis method applicable to not only spatial data, but also big data such as social network data. You just finished the fifth week of this course about spatial data analytics. You have only one more week to go. For the sixth week, we will study how to make use of what you have learned so far for real world applications.