# COMPUTING DIFFERENCE-IN-DIFFERENCES ESTIMATE between two years

context: focusing on the effect of mergers on delivery costs
using the data of one year (2007) which is the pre-treatment period and 2008 (the first year of the post treatment period) i would like to compute the difference in difference estimates of the effect of "municipal mergers" on public service delivery costs.

does anyone know the relevant formula to compute this?

Hi @Hash and welcome to the RStudio Community

Quick tip, whenever you ask a coding question on here, it helps people reading it (your potential helpers) if you can also include your data (or any data that has the features of your original data... in case you can't share it). I highly suggest you take a look at this awesome article: FAQ: How to do a minimal reproducible example ( reprex ) for beginners

Having said that, the difference-in-differences (DID) in R is actually fairly simple to implement. It requires just a bit of manipulation of your data and the standard `lm()` function. I could have provided help if you had shared a sample dataset

hi Gueyenono, thank you for the reply.

As the data set is quite large and a csv i am not able to upload it, is there any other way i could upload it?

kind regards.

Yes, you can share just a subset of the data. Let's assume that you import the data in a variable called `mydata`. Run the code: `dput(mydata[1:50, ]` and paste the result here. This will be only the first 50 rows of your dataset.

dput(sample[1:50, ])
structure(list(year = c(2005L, 2006L, 2007L, 2008L, 2009L, 2010L,
2011L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2005L, 2006L, 2007L,
2008L, 2009L, 2010L, 2011L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2005L), Y = c(4173.217045,
4277.996451, 4319.290767, 4549.83007, 4435.450694, 4280.088368,
4020.781806, 5877.274684, 5976.119041, 6014.478399, 6216.265093,
6236.914349, 6114.028861, 5889.205762, 6079.081436, 6205.04293,
6146.156668, 6404.103999, 6459.669141, 6366.698522, 6068.932465,
6046.077215, 6147.887524, 6109.346106, 6361.920033, 6372.34715,
6235.21232, 6079.461274, 5685.307721, 5647.604075, 5694.551862,
5985.826031, 6017.33036, 5964.862862, 5760.725342, 5704.078622,
5431.702292, 5582.624809, 5883.925628, 5832.585978, 5891.687208,
5638.702178, 5869.447414, 5945.162792, 5954.481229, 6159.511579,
6189.853063, 6019.20501, 5841.450154, 6081.946856), municipality = c("mu_1",
"mu_1", "mu_1", "mu_1", "mu_1", "mu_1", "mu_1", "mu_2", "mu_2",
"mu_2", "mu_2", "mu_2", "mu_2", "mu_2", "mu_3", "mu_3", "mu_3",
"mu_3", "mu_3", "mu_3", "mu_3", "mu_4", "mu_4", "mu_4", "mu_4",
"mu_4", "mu_4", "mu_4", "mu_5", "mu_5", "mu_5", "mu_5", "mu_5",
"mu_5", "mu_5", "mu_6", "mu_6", "mu_6", "mu_6", "mu_6", "mu_6",
"mu_6", "mu_7", "mu_7", "mu_7", "mu_7", "mu_7", "mu_7", "mu_7",
"mu_8"), region = c("re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1", "re_re_1",
"re_re_1", "re_re_1", "re_re_1", "re_re_1"), treatment = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), row.names = c(NA, 50L), class = "data.frame")

Here are links to a few resources on Difference-in-Differences in R:

https://ds4ps.org/PROG-EVAL-III/DiffInDiff.html

https://www.econometrics-with-r.org/13-4-quasi-experiments.html

@Hash It turns out that the chunk of the dataset that I asked you to share is not very representative of the whole dataset. For example, the `region` variable has a single value: `re_re_1`. The `treatment` variable as well has a single value: `0`.

Would you please answer these questions for me to be exactly sure of what you are trying to achieve:

• Do you want the pre-treatment period to be the year 2007 only or all the years up to 2007 (i.e. 2005, 2006 and 2007)? In the same way, should the post-treatment period be the year 2008 only or all the years from 2008 onward (i.e. 2008, 2009, 2010 and 2011)?

• What are the control and treatment groups? I suspect that they are in the `region` column. Do you have a second region (`re_re_2` maybe?) which corresponds to the `treatment` column being equal to 1?

Hi @Hash,

After providing me with the data privately, I was able to look at it and write code that will help you. Here, I assume that the `treatment` column refers to the control group (when `treatment` is 0) and to the treatment group (when `treatment` is 1). I add many comments to the code in order to guide you through the process.

``````# Download the full data

# Subset the data to keep 2007 and 2008 data only
cost0708 <- cost[cost\$year %in% c(2007, 2008), ]

# Create a dummy variable for the time (2007: 0, 2008: 1)
cost0708\$time <- ifelse(cost0708\$year == 2007, 0, 1)

# Create a variable for the interaction between treatment and group
cost0708\$interaction <- cost0708\$treatment * cost0708\$time

# Run the difference-in-differences estimator (explicit method)
mod_did <- lm(Y ~ treatment + time + interaction, data = cost0708)
summary(mod_did)

Call:
lm(formula = Y ~ treatment + time + interaction, data = cost0708)

Residuals:
Min       1Q   Median       3Q      Max
-1590.84  -108.27     3.91   129.22   497.07

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  5886.83      48.06 122.501  < 2e-16 ***
treatment    -450.51      59.37  -7.588 2.04e-12 ***
time          253.84      67.96   3.735 0.000256 ***
interaction  -322.76      83.96  -3.844 0.000171 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 263.2 on 170 degrees of freedom
Multiple R-squared:  0.5732,	Adjusted R-squared:  0.5657
F-statistic: 76.12 on 3 and 170 DF,  p-value: < 2.2e-16
``````

Your variable of interest in this regression output is `interaction` (also known as your difference-in-differences estimator). It has a very low p-value, which shows significance at the 1% significance level. In other words, there is strong evidence in the data that the treatment (whatever it is... it was not specified in your question) has an impact on the outcome variable `Y`.

Just for the sake of completeness, there is another way you can run this regression. You do not really need to calculate the `interaction` variable before running the DID estimator. You can just use the interaction operator `*` in the `lm()` function:

``````# Run the difference-in-differences estimator (implicit method)
mod_did2 <- lm(Y ~ treatment*time, data = cost0708)
summary(mod_did2)

Call:
lm(formula = Y ~ treatment * time, data = cost0708)

Residuals:
Min       1Q   Median       3Q      Max
-1590.84  -108.27     3.91   129.22   497.07

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     5886.83      48.06 122.501  < 2e-16 ***
treatment       -450.51      59.37  -7.588 2.04e-12 ***
time             253.84      67.96   3.735 0.000256 ***
treatment:time  -322.76      83.96  -3.844 0.000171 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 263.2 on 170 degrees of freedom
Multiple R-squared:  0.5732,	Adjusted R-squared:  0.5657
``````

I hope this helps you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.