[MUSIC] Modern day science builds on the prior accomplishments of several generations of scientists. We often compile data from various sources and only when they are combined do they reveal something exciting, and new. In this lecture, we're going to explore how to setup your own database using data from astronomical catalogs. Virtually, every scientist stores some of their data in flat files, such as plain text files or CVS files and this format has lots of advantages. Flat files are quite efficient in their space usage and they're easily readable by humans, and machines alike. However, once you start working on some larger and more complex projects, there are also some disadvantages. Firstly, it's hard to manage data redundancy and inconsistency. If the brightness of a star ends up being duplicated in ten different files. But later, we want to update that brightness, because of improved measurements. We're likely to have a problem. If just one of the duplicated values is accidentally missed, the data now contains inconsistencies. It's also very time consuming to work in this manner. Secondly, depending on the format, the data can only be accessed with custom software. In other words, all queries and filters have to be pre-implemented. It's generally not possible for another scientist to do an arbitrary query on the data without writing their own code. Thirdly, in large projects, you may need to restrict access to parts of the day to some of the researchers within the collaboration which then leads to security problems. It isn't easy to manage access with permissions with flat files. There are also a number of disadvantages of flat files that are maybe less important for science, but critical in other domains. For example, in banking databases, it's critical that if a system fails during a transaction say, a transfer of money from one account to another, then the previous state can be restored. Database systems insure atomicity, either the transaction occurred or it didn't. Database management systems provide a solution to all of these issues and are increasingly popular for sharing large survey data. They are a massive concurrent access for large numbers of users. For example, the Skyserver database of the Sloan Digital Sky Survey was accessed over 240 million times in 2016. For databases to do all this, they have to be able to retrieve data efficiently. This requires complex data structures, but the whole point is that the database is doing this for you. You don't have to know what's going on under the bonnet, you can just focus on the science. So when we think about databases, we're thinking about abstraction. There's a physical layer, which is how the data is actually stored and indexed on the computer. As a scientist, we don't generally want to worry about this layer, then there's the logical layer. This is where we think about what data is stored in the database and how those data are organised, and indexed. This is important when you're setting up your own database. But when you're using someone else's, you can find out this information through the published schema. Finally, there's the view layer which is just a view of the database for a particular purpose. For example, collaboration members may need a view which shows which of the data have been publicly released. So if you think a database might be useful for the projects you're working on, what do we actually have to do to set one up? Firstly, we have to collect the data. In this activity, we'll use data from existing projects to demonstrate, but you can follow the same process from any dataset that you have. We'll start with data from the hypothos catalog. Hypothos was a scientific satellite that measured the power lapses of stars for the purposes of knowing how far away they were. It also measured their brightnesses and their colors. We'll combine hypothos data with cluster membership list from bazeer, which gave the probability of each star being physically associated with that cluster. The second thing you have to do is decide on the database management system or DBMS that you want to use. Some of the common platforms are SQL Server, MySQL and PostgreSQL or Postgres. For this course, we'll use Postgres, because it's a widely used, free open source DBMS with support for spatial queries. Next and most importantly, you have to design the schema that represents your data. The schema sets out the attributes of each table in your database and how the tables relate to each other. It lists the type of and constraints on the attributes in each table, and identifies the primary key for each table which ensures that each row has a unique key. It also describes the relationship between the tables. For example, whether there's a one to one or a many to one correspondence. For a simple dataset like this, it's pretty easy to design your own schema. For complex datasets, it can be more challenging. However, going through this process can also help you think more deeply about how you organize your scientific data. So now we've designed the schema, the final steps of how to setup the database tables and then load the data which is what we're going to do in the next activity. After that, we'll be ready to do some science. [MUSIC]