how to count months observations based on every person?

chiaoyi · March 13, 2020, 5:58pm

I know my question is unclear. So let me state my question clearly. My data set is a panel. I have a lot of person (id). And I also have a month variable (month).

Some people have the complete 1-12 month. However, some people only have part of months. My problem is that I only keep those who have complete 1-12 months.

Tony_Duan · March 13, 2020, 6:26pm


library(dplyr)

your_data %>% group_by(id) %>% summarise(month_count=n_distinct(month)) %>% filter(month_count==12)

andresrcs · March 13, 2020, 6:50pm

Please be aware that a REPRoducible EXample (reprex) is much more useful than a textual description of your problem and it is also a polite thing to do when asking people to help you with coding questions.

chiaoyi · March 13, 2020, 7:01pm

okay, is this the way?

data.frame(
          id = c(1202L,1202L,1202L,1202L,1202L,1202L,
                 1202L,1202L,1202L,1202L,1202L,1202L,1202L,1202L,1202L,
                 1202L,1202L),
   bill.date = as.factor(c("2011-01-01",
                           "2011-02-01","2011-03-04","2011-04-03","2011-05-02",
                           "2011-06-01","2011-07-01","2011-07-31","2011-08-30",
                           "2011-09-30","2011-10-29","2011-11-29","2013-02-01",
                           "2013-03-03","2013-04-01","2013-04-30",
                           "2013-06-02"))
)

chiaoyi · March 13, 2020, 7:03pm

Thanks. I upload my data. And I need to keep who (id) have complete 12 months in 2011. In my data set, each id has observations both in 2011 and in 2013. There's no restriction for year 2013, only strict restrictions for year 2011.

chiaoyi · March 13, 2020, 7:53pm

I did this to create 2 new variables: month and year. And the next thing I need to do is to keep only those id who have 12 months observations in year 2011. In other words, if id only has 11 months in year 2013, that's okay and should be kept as long as this id has 12 months in year 2011.

newdata <- mutate(newdata, month=month(bill.date))
newdata <- mutate(newdata, year=year(bill.date))

nirgrahamuk · March 13, 2020, 11:52pm

newdata <- data.frame(
  id = c(
    1202L, 1202L, 1202L, 1202L, 1202L, 1202L,
    1202L, 1202L, 1202L, 1202L, 1202L, 1202L, 1202L, 1202L, 1202L,
    1202L, 1202L
  ),
  bill.date = as.factor(c(
    "2011-01-01",
    "2011-02-01", "2011-03-04", "2011-04-03", "2011-05-02",
    "2011-06-01", "2011-07-01", "2011-07-31", "2011-08-30",
    "2011-09-30", "2011-10-29", "2011-11-29", "2013-02-01",
    "2013-03-03", "2013-04-01", "2013-04-30",
    "2013-06-02"
  ))
)
library(tidyverse)
library(lubridate)
newdata <- mutate(newdata, month = month(bill.date))
newdata <- mutate(newdata, year = year(bill.date))

ids_with_12_in_2011 <- filter(
  newdata,
  year == 2011
) %>%
  group_by(id) %>%
  count() %>%
  filter(n == 12) %>%
  select(id)

ids_with_12_in_2011_with_data <- left_join(
  ids_with_12_in_2011,
  newdata
)

chiaoyi · March 13, 2020, 11:59pm

Your codes looks wrong. For example, id 1202 doesn't have 2011-12. Thus, 1202 should be dropped.

dromano · March 14, 2020, 12:00am

I think there can be multiple entries per month, which is making it trickier.

nirgrahamuk · March 14, 2020, 12:01am

ok, thanks for pointing it out, I assumed too much originally.
its a small tweak :>


ids_with_12_in_2011 <- filter(
  newdata,
  year == 2011
) %>%
  select(-bill.date) %>%
  distinct %>%
  group_by(id) %>%
  count() %>%
  filter(n == 12) %>%
  select(id)

ids_with_12_in_2011_with_data <- left_join(
  ids_with_12_in_2011,
  newdata
)

chiaoyi · March 14, 2020, 12:02am

You make a good point. But let's assume only one entry per month-year.

dromano · March 14, 2020, 12:03am

In that case, @nirgrahamuk's first solution works already.

chiaoyi · March 14, 2020, 12:05am

why do I still see 1202? id 1202 just have 11 months in year 2011.

nirgrahamuk · March 14, 2020, 12:10am

what do you mean you still see it ?
the content of ids_with_12_in_2011 is now empty as you require it to be given the example data provided...

dromano · March 14, 2020, 12:12am

Your posted data has July 2011 appear twice.

chiaoyi · March 14, 2020, 12:13am

Aha, you're right. I miss this point.

chiaoyi · March 14, 2020, 12:20am

Can you explain your code? I can understand first we filter the data which year == 2011, then what?

nirgrahamuk · March 14, 2020, 12:23am

Here is split out the multipipe statement into single steps for you to analyse:

library(tidyverse)
library(lubridate)
newdata <- mutate(newdata, month = month(bill.date))
newdata <- mutate(newdata, year = year(bill.date))

a <- filter(
  newdata,
  year == 2011
)
a
b<- a%>%
  select(-bill.date)
b
c <- b %>%
  distinct 
c
d <- c %>%
  group_by(id) 
d
e<- d %>%
  count() 
e
f <- e %>%
  filter(n == 12) 
f
g <-f %>%
  select(id)
g

ids_with_12_in_2011_with_data <- left_join(
  g,
  newdata
)

system · March 21, 2020, 12:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.