Welcome to the class Statistics for Genomic Data Science.

I'm your instructor Jeff Leek and

I just wanted to make this video to introduce you a little bit to myself and

to the course that we're going to be covering over the next four weeks.

So the first thing I thought I'd do is tell you a little bit about myself and

I love statistics a lot.

You might not like statistics very much I hope you like it at least a little bit and

that's why you're taking this class but I like it so

much that I do lots of different things related to statistics.

The first thing that I do is I write a blog called simplystats, it's also on

Twitter @simplystats, and we talk a lot about statistical issues related to

genomics, related to personalized health, related to a lot of other issues as well.

I also teach in another large set of MOOCs,

the Johns Hopkins Data Science Specialization,

which is one of the largest MOOC sequences that's ever existed.

In that sequence of classes, we talked a lot about statistics and data science but

from kind of a more general perspective.

And then finally, I also do research and teach here at Johns Hopkins.

I work mostly in statistical genomics, the topic of this class.

Most of the research in my group focusses on developing methodologies and

applying it to data that comes in from RNA sequencing experiments, so

you'll probably see a lot of examples from that area popping up again and again in

the class because that's the area where I have most of my actual research expertise.

So what are we going to be covering in this class?

We are going to be covering sort of the whole spectrum of statistical genomics and

genetics.

The idea here is were going to talk about how to you go from raw data, data that

hasn't been processed, all the way to the point where you are reporting results and

communicating those results?

These are the four key areas,

in which we will be doing a lot of talking in this class.

One is exploratory data analysis.

How to make plots, how to visualize data,

how to summarize data in a way that you can communicate it with people, and

how to be skeptical when you observe patterns that may or may not be real.

Then we're going to talk about normalization and preprocessing.

So in genomic data, it's very high throughput,

it's very high dimensional, and often it's subtly different between samples, or

across studies, or across labs.

And we're going to talk a lot about how do you process the data in way so

that its all comparable.

Then we'll talk about statistical modeling and so that will involve talking about

things like linear models, for building actual statistical models for

the way that genomic data relates to meta data.

Things like phenotypes you care about like cancers versus controls.

We'll also be talking about statistical significance, p values, and

things like multiple testing.

Finally, we'll be talking a lot about statistical summarization,

how do you take the results that come out of a large high throughput study,

like a statistical genomics study and summarize them, communicate them and

try to make sense out of the biology behind them.

So that's what we're going to be covering in the class.

It's going to be a bit of a whirlwind since we only have four weeks, but

I'm really excited about it and I hope you are too.

I'm looking forward to seeing you in class.