Finding the most common occurring character vector (Finding the mode for a non numerical variable)

SMUCH97 · January 11, 2021, 5:35pm

Can anyone help me with finding a way how to calculate the most common occurring value from this example data frame i have created with dput().

structure(list(X = c("Camden", "Bromley", "Croydon", "Bromley",
"Croydon", "Camden"), Y = c("Theft", "Theft", "Theft", "Burglary",
"Theft", "Robbery")), class = "data.frame", row.names = c(NA,
-6L))

In this data frame X represents The location, and Y represents the type of offence corresponding to that location.

My aim is to find out the most common offence for a particular type of location and then present the answer of that in a summary table. How can i go about doing this when I have thousands of data points?

For this example, of course the most common offence in croydon is theft. But how can I use code to find this answer.

p.s. I am a new user to R and have only had experience with tidyverse, dply and ggplot2.

meitei · January 11, 2021, 8:44pm

I used df as name of the dataset.

df=structure(list(X = c("Camden", "Bromley", "Croydon", "Bromley",
                     "Croydon", "Camden"), Y = c("Theft", "Theft", "Theft", "Burglary",
                                                 "Theft", "Robbery")), class = "data.frame", row.names = c(NA,
                                                                                                           -6L))
df_table=table(df)

df1=cbind(rownames(df_table),colnames(df_table)[apply(df_table,1,which.max)])
colnames(df1)=c("x","y")

Here, df1 is the final datasaet.

kaijabean · January 11, 2021, 10:24pm

Here is a way to do it using the tidyverse. No guarantees that this is the best/most efficient way, but it works!

As above, I'm calling your sample data frame df (thanks for providing reproducible sample data, by the way!

# For clarity, rename X and Y to more descriptive names
df <- structure(list(X = c("Camden", "Bromley", "Croydon", "Bromley",
"Croydon", "Camden"), Y = c("Theft", "Theft", "Theft", "Burglary",
"Theft", "Robbery")), class = "data.frame", row.names = c(NA,
-6L)) %>% rename("location" = X, "offense" = Y)

# Calculate the number of times each offense occurred in each location
tallied <- df %>% group_by(location, offense) %>% 
    summarize(frequency = n()) %>% 
    ungroup()

# Filter down to the most common offense for each location
summ <- tallied %>% 
    group_by(location) %>% 
    filter(frequency == max(frequency)) %>%
    ungroup()

You could definitely combine the steps above into a single pipeline if you prefer; I just left them separate so it's a little easier to read.

Note also that my solution above preserves both answers if there are ties. If you want to select just a single offense in the case of a tie, you might try using slice() to take e.g. only the first or only the last row after grouping by location.

I hope that helps!

SMUCH97 · January 16, 2021, 9:37pm

Thank you this was helpful!

system · January 23, 2021, 9:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.