Panel data cleaning

Cgreen042389 · February 25, 2022, 3:13am

Have a set of panel data where each observation has a city assigned to it, the years run 2006-2020. I would like to remove cities that do not have all years included (might be missing an observation from 2011, not an NA situation). Can someone enlighten me on the easiest way to accomplish this?

FJCC · February 25, 2022, 3:58am

It is hard to be sure without seeing your data. If you can rely on the number of rows to detect if all years are present, i.e. years are never duplicated, you can do it with something like the following code. If that does not meet your needs, please post some of your data. You can post the output of the dput function. If you want to show 30 rows of your data, post the output of

dput(head(DF,30))

where DF is the data frame storing your data. Put a row with three back ticks just before and after the posted data.

Here is the code if the number of rows can be used.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(City=c(rep("Paris",15),rep("Berlin",14)),
                 Year=c(2006:2020,2006:2010,2012:2020))
table(DF$City)
#> 
#> Berlin  Paris 
#>     14     15
DF <- DF |> group_by(City) |> mutate(N=n()) |> 
  filter(N==15)
table(DF$City)
#> 
#> Paris 
#>    15

^{Created on 2022-02-24 by the reprex package (v2.0.1)}

system · March 18, 2022, 3:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.