Create a subset of a panel data set

MLent · November 15, 2018, 10:05am

I created a panel dataset. The final goal is to run a panel regression on a subset of the data, creating this subset is the issue.

Data example:

ID Time Variable ManyOtherVariables
1 1 123
1 2 1001
1 3 90
2 1 1111
2 2 222
2 3 2222

etc.

The subset I want is: all observations of all ID's for which at time=2 Variable>1000 (here that would be row 1,2, and 3).

I ran:

reg <- plm(y~x, data=subset(df, ID[Variable>1000]), model="within")

and variations such as:

reg <- plm(y~x, data=subset(df, Variable>1000 & Time==2), model="within")

However in this way I lose the observations of the IDs that I want to select in other time periods that time =2

I would have loved to send a reproducable example, but the issue is that I am working with data on a secure computer without access to internet (accept Rstudio itself).

I hope may question is clear? If more detail is needed, please let me know.

mfherman · November 15, 2018, 2:40pm

Hi, @MLent! Thanks for including some of your data. There's a couple things you can do to make it easier for folks here to help with your question. The first is formatting your code as code so it's easier to read and copy and paste into an R console. Basically, you just enclose your code between three back ticks like this:

``` r
reg <- plm(y~x, data=subset(df, ID[Variable>1000]), model="within")
```

Also, to make it easier for folks here to read and work with, it's better to create an R object with your sample data and post it here. This post has some good tips for how to include sample data:

Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset? tidyverse

@EconomiCurtis split this out of FAQ: What's a reproducible example (`reprex`) and how do I do one?. Curious if you have anything additional to add specifically on "how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset." I think @jessemaegan's post is about 80% there. The piece it is missing, if your average stack overflow post is any indication, is an explanation about how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset. Some handy things to know for this situation: deparse() The ugly as sin, gold standard: head(my_data, 2) %>% depa…

So, with your example, I would do something like the following:

# create sample data
my_data <- tibble::tribble(
 ~ID, ~Time, ~Variable,
 1, 1, 123,
 1, 2, 1001,
 1, 3, 90,
 2, 1, 1111,
 2, 2, 222,
 2, 3, 2222,
 3, 1, 200,
 3, 2, 2000,
 3, 3, 4000
 )

(I added more fake data to make the example a bit more clear.)

To manipulate data, I like to use the the dplyr package, which is part of the tidyverse. It can sometimes be a little more verbose than other ways of coding in R, but I think it makes the code easier to understand!

So here is how I would create a subset of the data you describe. First I find which IDs meet the conditions you define, and then I use those IDs to subset the full dataset.

library(dplyr)

# create vector of IDs meeting condition
my_ids <- my_data %>%
  filter(Time == 2 & Variable > 1000) %>%
  pull(ID)
my_ids
#> [1] 1 3

# subset data using that vector
my_subset <- my_data %>%
  filter(ID %in% my_ids)
my_subset
#> # A tibble: 6 x 3
#>      ID  Time Variable
#>   <dbl> <dbl>    <dbl>
#> 1     1     1      123
#> 2     1     2     1001
#> 3     1     3       90
#> 4     3     1      200
#> 5     3     2     2000
#> 6     3     3     4000

^{Created on 2018-11-15 by the reprex package (v0.2.1)}

MLent · November 16, 2018, 7:11am

Thank you very much for your help, this fully solved the problem!

system · November 23, 2018, 7:12am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.