I'm very new to R community and could really use some help. I have this column contains unique values of:
corn
good corn
bad corn
corn fine
Now, I want to find out how many rows contain %corn% including the ones with space.
I tried many options but to no avail:
nrow(subset(df, col_name == '\\bcorn\\b'))
and
nrow(subset(subset(df, col_name == '\\<corn\\>'))
They all return zero.
This code right here return only 1; which is the first row
nrow((subset(subset(df, col_name == 'corn'))
How can I make it return the number of all that contains 'corn' including space? Please let me know if I can provide more information. Thanks
FJCC
June 9, 2022, 5:57pm
2
Here are two methods. One uses the grepl function from base R and the other uses a function from the stringr package.
DF <- data.frame(Things = c("corn", "barn", "good corn",
"yellow", "corn bad"))
DF
Things
1 corn
2 barn
3 good corn
4 yellow
5 corn bad
#method 1
sum(grepl("corn", DF$Things))
[1] 3
#method2
library(stringr)
sum(str_detect(DF$Things, "corn"))
[1] 3
Thank you so much. You're my life saver. I have been not able to sleep for 2 days.
Although, I would like to add that the second method doesn't work for me (I've installed and loaded stingr). Is there any limitation on how to use it?
And also, can we make a table for future use out of the output?
example:
sum(grepl("corn", DF$Things))
sum(grepl("barn",DF$Things
will return:
corn 3
barn 1
Thank you.
FJCC
June 9, 2022, 6:49pm
5
You can make a named vector of results like this.
Words <- c("corn", "barn")
Results <- sapply(Words, function(x) sum(grepl(x, DF$Things)))
Results
corn barn
3 1
I can't say why the stringr version of my code is not working for you. Can you make a small example of it not working, similar to what I put in my first post?
Is there any way I can make it as a new table for future reference for plotting? I want to be able to define x and y axis from the table.
The second method only shows '' as a result. I did exactly like you wrote there
sum(str_detect(DF$Things, "corn"))
[1] <NA>
I honestly have no idea why
FJCC
June 10, 2022, 5:23am
7
The str_detect version is returning NA because one of the values in the Things column is NA. The grepl function seems to ignore NA values. If you set the na.rm argument of sum() to TRUE, you will get the desired result.
DF <- data.frame(Things = c("corn", "barn", "good corn",
NA, "corn bad"))
sum(grepl("corn", DF$Things))
[1] 3
sum(stringr::str_detect(DF$Things, "corn"))
[1] NA
sum(stringr::str_detect(DF$Things, "corn"), na.rm = TRUE)
[1] 3
To make a data frame of the counts, you culd use the data.frame function, though there is no reason you cannot store the vector that the original code produced.
Words <- c("corn", "barn")
Results <- sapply(Words, function(x) sum(grepl(x, DF$Things)))
Results <- data.frame(Words, Results)
Results
Words Results
corn corn 3
barn barn 1
Wow, thanks!!! I do have NA values in my dataset, sorry I didn't mention it in the first place.
I have made a data frame of the counts but unfortunately, when plotting, R doesn't recognize it as data frame.
Here's how I plot:
ggplot(data = Results) %>%
geom_bar(mapping = aes(x = Words, y = Results))
It returns:
Error in `fortify()`:
! `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class gg/ggplot.
I'm sure it's data frame by now but I wonder why R doesn't recognize it
[1] "data.frame"
Tried inspecting structure and all, I'm positive it's data frame
FJCC
June 10, 2022, 6:42am
11
Please post the output of
dput(Results)
If Results is large enough to make that unwieldy, you can post the output of
dput(head(Results, 20))
I would also change the name of the y column so it doesn't match the data frame name. That shouldn't be a problem but it strikes me as dangerous, though I did it myself.
It's late here, so I will not be able to respond for several hours. Someone else will, I hope.
This is what it returns (it's symptoms in a disease data)
structure(list(name = c("fever", "headache", "muscle pain", "backache",
"lymph nodes", "fatigue", "lesion", "pustule", "blister", "cough",
"rash", "ulcer"), number = c(38L, 5L, 4L, 0L, 2L, 2L, 66L, 3L,
2L, 1L, 15L, 73L)), class = "data.frame", row.names = c(NA, -12L
))
I've changed column names into: name and number.
It's alright. Thank you for the help, you've been so kind
I'm sorry to confuse you but apparently what was wrong from the above problem is my plot.
It should've been:
ggplot(data = Results) +
geom_col(mapping = aes(x = Words, y = Results))
I was not careful.
Thanks for the help. All good now.
system
Closed
June 17, 2022, 8:03am
14
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.