Hi all
I’m an R newbie working on secondary data and I need help, please. Trying to merge a selection of 6 subsets of numerous variables each but for the same group of individual persons - I had downloaded it from Nesstar (longitudinal study of individuals within family). Whilst merging I’m selecting chosen variables only, to create my final dataframe with around 18k rows and 6 columns. It's supposed to be 6 variables per each individual.
Thing is, every subset but one can be merged easily by variable 1 "ID number" of each individual person I need. The last one is suddenly about the whole family unit that lives in the household of the individual, and every person living in that household is given the ID of the individual.. so individual ID becomes household ID in this one subset only. Obviously when merging by ID the dataframe ended up being quite wildly increased in rows (from 18k to 76k).
In regular 5 subsets it looks like this (n=18k):
Var1 Var2 Var3 Var4
ID1 y 1999 11
ID2 n 1988 3
ID3 y 1977 9
ID4 y 1989 17
Etc etc etc
In the last 1 subset it suddenly looks like this (n=76k):
Var1 Var2 Var3 Var4
ID1 y 1999 1
ID1 n 1990 8
ID1 n 1997 11
ID2 n 1989 3
ID2 n 1999 6
ID3 y 1910 9
ID3 y 1954 17
ID3 y 1999 7
ID3 y 1977 17
The last subset is key for me, because a variable (Var4) that is there is needed to select my cases and controls from the final dataset of 18k. If the Variable 4 value is 17 I need to keep it for cases, if it’s anything else it’s a potential control, as I am doing propensity score matching. These individuals in the household are not given any individual ID, and are only entered into the dataframe via relation to the main individual. I can’t establish my cases sample size without using this Variable 4.
I’m at a loss as to how I can merge all 6 subsets into my final dataframe without losing the data that I need for case-control PSM.
Would anyone please advise?