Hi, thought I would ask for some pointers. I have a question, I have a grouped data set with 7 columns, and in one column of data there is a variable that returns differing values, and I need to find the 5 most common product numbers in that column. I need to split of all seven columns just with those 5 most common product numbers, but the other columns would be in there, just the rows containing non-most common would not be in there.
I then need to get the data organized so that just observations from that set are used.
Thanks in advance.
To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:
In case you prefer to learn how to do this sort of thing by yourself, here is a valuable resource.
From the replies I got I was able to improve my code for top 1 MostCommon. Should be possible for you to change to top N
MostCommon <- function(x) {
ux <- unique(x)
uxnotna <- ux[which(!is.na(ux))]
if(length(uxnotna) > 0) {
tab <- tabulate(match(x, uxnotna))
candidates = uxnotna[tab == max(tab)]
if (class(x)[1] == "logical") {
any(candidates) # return TRUE if any true. max returns an integer
} else {
max(candidates) # return highest (ie max) value
}
} else {
ux # this returns the NA with the right class. ie that of x
}
}
library(tidyverse)
#make up data
set.seed(42)
(exdf <- tibble(
prods=c(sample(letters[1:26],size=100,replace = TRUE),
sample(letters[c(4,5,12,25,26)],size=100,replace = TRUE)),
s1 = sample.int(100,size=200,replace=TRUE),
s2 = sample.int(100,size=200,replace=TRUE),
s3 = sample.int(100,size=200,replace=TRUE),
s4 = sample.int(100,size=200,replace=TRUE)
))
# detect the 5 most common
(top_5_prods_df <- count(exdf,prods) %>% arrange(desc(n)) %>% head(n=5))
# go back to original data and only keep the rows for these 5
(result_df <- filter(exdf,
prods %in% top_5_prods_df$prods))
#check
table(result_df$prods)