In this session, we start our journey into the causal relations between variables.

A well-known method to run

these experiments is called the difference-in-difference method.

The best way to illustrate it is to consider

a simple example from clinical practice where it is most commonly adopted.

We typically want to measure the effect of a therapeutic treatment,

for example a drug, on a set of patients.

You first need to set a performance measure or a variable you want to treat.

This will be your dependent variable or the variable whose effect you are interested in.

Suppose that this variable is the body temperature of our patients.

The experiment works in this way.

You have n patients and in an initial period

which we call time 1

you measure their temperature.

In a second period,

which we call time 2,

you pick a group of patients randomly,

typically half of them, and provide them with a treatment.

That is, with the drug.

The others take a placebo.

You then measure the temperature of all patients again in the second period.

The simplest form of

the difference-in-difference estimator is the difference between two differences.

One: the difference in the average temperatures

of the treated patients after and before the treatment.

That is, in time 2 versus time 1 and two:

the same difference for the non-treated patients.

The rationale of the approach is that by administering the treatment randomly

you expect no systematic difference

between patients who received the treatment and patients who did not.

Thus, if we observe a difference,

we attribute it to the only thing that changed systematically.

That is, the treatment.

Take the simple dataset in the Excel file, Temperature.xlsx.

In this dataset, we have 20 patients

identified by a number from 1 to 20 and two periods,

time 1 and time 2 identified by the variable time.

The dataset also lists three dummy variables:

treated individual, which takes the value

1 for the patients that get the drug in the second period;

treatment time, which takes the value 1 in time 2 and 0 in time 1;

treatment treated, which takes the value 1 for

the treated individuals in time 2 that is in the treatment period.

Treatment treated is the product of treated individuals and treatment time.

You then run a simple OLS regression with temperature

as your dependent variable and two independent variables,

treatment treated and time.

The reason why we need time is that it is correlated with treatment treated,

which also occurs only in the second period even if only for some of the patients.

By adding time, you ensure that

the variable treatment treated does not pick the effect of time.

After you input the Excel file in Stata,

you first tell Stata that the data are panel by running the command xtset patient time.

Then, you run xtreg Temperature treatment treated time.

The command xtreg is equivalent to the reg command,

but takes into account the panel nature of the data.

You obtain that the difference-in-difference estimator is -1.01,

which is also well measured since the p-value is basically 0.

To see that this is the difference-in-difference estimator,

make the four predictions of a treated or

controlled patient after and before the treatment.

Since the constant term will be the same in all four cases,

the after minus before difference of the dependent variable for

the treated patients will be -0.16, -1.01.

While for the control patients,

will be -0.16.

The difference-in-difference is then -1.01.

According to this result,

the treatment works, in that it reduces the temperature of

the treated patients relatively to the non-treated patients.

You can also check that if you run the xt regression without time,

the effect is overestimated,

which ought to convince you that you need to

control for time in difference-in-difference regressions.

An important challenge to randomization is sample size.

The larger the size of the sample,

the less likely it is that any attempt to allocate

our observations to the treatment and control group produces a truly random allocation.

This also explains why we need

a difference-in-difference approach rather than a difference.

We could measure whether the patients who were offered

the drug performed better than those who received the placebo.

However, because of poor randomization, there could be initial differences.

For example, the control patients may show a lower average temperature to start with.

The simple difference would then underestimate the effect,

while the difference-in-difference corrects this problem.

Suppose that in period 1,

the average temperature of the patient in the treatment group is 39 degrees Celsius,

while the average for the control group is 38.

And in period 2, the treatment groups average falls to 37,

while the control group's average falls to 36.5.

If we only look at the difference,

we find a positive difference of 0.5 degrees.

This suggests that the medicine does not work because it does not reduce the temperature.

However, if we look at the difference-in-difference,

we obtain that the drug reduces the temperature by 0.5 degrees.

The difference-in-difference takes into account that

the initial conditions of the two groups are not the same.

The treated group's temperature falls by

a larger amount than the control and this is what matters.

Another way to deal with this problem is to

introduce the so-called individual fixed effects.

Individual fixed effects are simply dummy variables that take the value

1 for each individual in the sample and 0 for the others.

Individual fixed effects control for

any factor in your regression that vary across individuals,

but not over time.

Like for example, whether a patient did sports in the past or where they come from.

In Stata, individual fixed effects can be introduced in a convenient way.

If you call the panel data with the command xtset patient time,

Stata understands that the first variable is the cross-sectional dimension of the panel.

You can then introduce fixed effects by adding comma

and fe at the end of the xtreg regression command.

In our example, we would then have: xtreg Temperature treatment treated Time, fe.

If you run this regression with our patient data,

you can see that the results do not change.

This suggests that the regression with no fixed effect was already randomized properly.

Most often in these analyses,

you have more than two time periods.

For example, in the dataset Temperature long panel data example.xlsx,

you have the same variables as in the Temperature.xlsx dataset,

but now you have six periods.

In the first three periods,

you have no treatment.

And in the second three periods,

a random set of patients obtain the treatment.

If you import the longer panel dataset in Stata,

you first xtset the data by setting xtset patient time.

You can then run the same xtreg regression we ran before,

including the individual fixed effects.

The only difference is that now you do not have

one time dummy time equal to 0 in the first period and 1 in the second period.

You want as many dummies as the periods minus the baseline period.

In Stata, you do so by writing the variable i.time in the list of xtreg variables.

That is, the command is xtreg Temperature treatment treated i.time, fe.

You can try this regression by yourself and check the results.

From a managerial point of view,

the bottom line of this session is that we learned how to identify causal effects.

The difference with respect to mere correlations in

regression analysis is that now the data inform us not only about predictions,

but also about actions.

In our examples in this section,

we know that by administering the drug to patients similar to those in the sample,

we expect to observe a similar effect.

Thus, we can take the action in order to obtain the effect.

In the next session,

we discuss more specific managerial examples.