Hi everyone
I have a dataset with a huge amount of columns and would like to replace those with variables that only contain the more frequent values across the original columns in each row.
My example dataset:
df <- data.frame(Name = c("Sophie", "Max", "Sarah"),
GroupID_1 = c(2, 3, 2),
GroupName_1 = c("Lex", "Radon", "Lex"),
GroupTopic_1 = c(1, 2, 1),
GroupID_2 = c(4, 1, 3),
GroupName_2 = c("Lakemore", "Istar", "Radon"),
GroupTopic_2 = c(1, 3, 2),
GroupID_3 = c(5, 2, 6),
GroupName_3 = c("Truke", "Lex", "Suave"),
GroupTopic_3 = c(2, 1, 1),
GroupID_4 = c(7, 8, NA),
GroupName_4 = c("Luva", "Yellow", NA),
GroupTopic_4 = c(2, 3, NA))
Each Group has a Topic attributed to it, coded as a numeric variable. For each person, i only want to keep (in new columns) the values of the Topics that are attributed to more than 30% of the groups a person is part of. For instance, 50% of Max's 4 Groups have Topic "3", so this is the main Topic and should be kept in the new dataset, while the other two groups with their respective Topics should be dropped.
The final dataset should look as follows:
df_new <- data.frame(Name = c("Sophie", "Max", "Sarah"),
MainTopic_1 = c(1, 3, 1),
MainTopic_2 = c(2, NA, 2))
Name MainTopic_1 MainTopic_2
1 Sophie 1 2
2 Max 3 NA
3 Sarah 1 2
As you can see, Sarah has a second main Topic "2" despite only being in one group with this Topic, because she is only part of 3 groups total, so it is over 30%.
I hope you understand my issue and I am thankful in advance to anyone who might be able to help