Welcome to course five on combining and analyzing complex data. This one will be taught by me, Richard Valliant and Frauke Kreuter. So we've got quite a bit of material covering this course. And I'd like to give you a brief overview of what's going on in the first module. Basic estimation we'll talk about things like estimating means totals and quantiles and how you do that in complex samples. In the second module, we'll talk a bit about model fitting. How do you estimate things like the parameters of linear regression models and none linear models like logistic? And we'll look at software for both these. In model three, we're going to look at some basic methods of record linkage which is becoming more important topic these days. And associated with that, in module 4 are some ethical issues in linking data and these may vary from one country to another. So it's worthwhile that you know something about those. Now in module 1, basic estimation we'll start out within this video, and talk about totals and means. So in a complex design a key thing is that you need to account for the weights. Now if your population, or your sample was a little miniature population, then everything would have the same weight, and a lot of analysis would be simplified. But because of things like varying selection probabilities in the sample designs or non-response adjustments or calibration to external control counts. We typically have different weights for different units. And you shouldn't ignore that, because the weights mean something. And then another thing that you need to account for in complex sample analysis. Is things like weights, strata and multiple stages of selection affect the standard errors that should be estimated. So we need to properly accounts for those. And fortunately for us there's a software out there available for analyzing complex samples. And we'll give you number of examples of how to use that software. Now, totals are the easiest thing to talk about. So let's think about those. If you've got weights that are scaled in such a way that they take the sample up to the population they project the small set that you've got up to the big set of the population. Then you estimate totals in the following way. You sum over the sample units. And that's what i an element of s means. i and x is the units, s is the set of sample units, sum all of those and we take the weight for each unit times its data value. And that'll be an estimated total for whatever this y variable is, income. If y is zero or one, it could be a number of people who have got diabetes or number of people whose water supply is somehow contaminated, it can be all sorts of things. Now, for the mean, all we do is we take that estimated total here and we divide by the sum of the weights. Now, again, if the weights is scaled to estimate population totals, the sum of the weights is going to be an estimate of the number of units in the population. Also, if we were to sum over just the subset like the males in your sample, if you're sampling people. That will be an estimate of the number of males in the population. It will be an estimate of the count of units in whatever subgroup you sum over. So that's very handy thing about the standard way of constructing complex survey weights. Now, model parameter estimates typically depend on estimated totals. So if you can figure out how to estimate totals you can typically figure out how to estimate model parameters. And there are routines in the software that will do that for you. Quantiles are a little bit different and the software choices are more limited, but here's how the algorithm goes. What we do is we first identify the variable we want a quantile on. So this would be the quantitative variable like income. Where years of education or something like that. So we sort the file from low to high based on that y variable. And then associated with each unit we've got a weight. So we cumulate the weights until we reach a certain point. So in the case of estimating the median we want cumulate the weights until we get to the 50% point. And 50% of the sum of all weights is reached. And then what you do is you look for the y value for the first unit that's got a cumulative of 50% or more of the total weight. And that'll be your median value. And sometimes that requires some rounding off because of the discreteness of the sample. But that sort of thing is built into the software also. So you could do other things, like the first and third quartiles. Just cumulate the weight until you get to the appropriate point, 25% or 75% on that cumulated total weight. And look at the y value that goes with that unit and then we will be your estimate of the first through the third quantiles. So in that way, it's fairly straight forward. The hard part with quantiles is estimating a measure of precision which we'll talk about also. So in the next video in this module, we'll get into software that's available for some of this analyses.