Is there a way to use summarise function in the following way:
df %>% groupby(ColumnA) %>% summarise(count_colB = n()) # count only if ColumnA == "str1"
Is there a way to use summarise function in the following way:
df %>% groupby(ColumnA) %>% summarise(count_colB = n()) # count only if ColumnA == "str1"
What exactly are you trying to do?
From your description above you can filter df
before/after grouping to only include rows with str1
in them and it will produce the result you want. But I assume there is more to your question, so can you please clarify.
Yes, exactly, you have interpreted the situation behind my question right, given what I had explained. Now, in more detail:
df
ColumnA -------- ColumnB
str1 --------------- Employed
str1 --------------- Employed
str1 --------------- Not Employed
str2 --------------- Employed
str2 --------------- Not Employed
str2 --------------- Not Employed
... (40,000
more rows)
Actually what I want is, is this:
after filtering df and grouping by columnA (which has cases like 'str1', 'str2', 'str3', etc), I want R to calculate counts and percentages of "Employed" in ColumnB for every str1, str2, str3, etc.
I don't want to slice data by 'str1', 'str2', 'str3', etc. since it would be very tedious, and I want all the results in same file.
I am new to R so I don't know how to go about it. Would appreciate if you could help.
Thanks!
I am not perfectly sure what you want either, but does this solve your problem?
suppressPackageStartupMessages(library(tidyverse))
# Generate random data to use in this example
set.seed(100)
df <- data.frame(ColumnA = paste0("str", sample(1:3, 15, replace = TRUE)),
ColumnB = sample(c("Employed", "Not Employed"), 15, replace = TRUE),
stringsAsFactors = FALSE)
df # print
#> ColumnA ColumnB
#> 1 str1 Not Employed
#> 2 str1 Employed
#> 3 str2 Employed
#> 4 str1 Employed
#> 5 str2 Not Employed
#> 6 str2 Not Employed
#> 7 str3 Not Employed
#> 8 str2 Not Employed
#> 9 str2 Not Employed
#> 10 str1 Employed
#> 11 str2 Employed
#> 12 str3 Not Employed
#> 13 str1 Not Employed
#> 14 str2 Not Employed
#> 15 str3 Employed
# Using dplyr, if you need to keep the data for further calculations this will be preferred, also calculates the precentage
df %>%
group_by(ColumnA) %>%
summarise(count_employed = sum(ColumnB == "Employed"),
count_not_emplyed = sum(ColumnB == "Not Employed"),
employed_percentage = count_employed / n() * 100)
#> # A tibble: 3 x 4
#> ColumnA count_employed count_not_emplyed employed_percentage
#> <chr> <int> <int> <dbl>
#> 1 str1 3 2 60
#> 2 str2 2 5 28.6
#> 3 str3 1 2 33.3
# Very simple approach using base that does the same if you only need the results dislayed and do not need percentages
table(df)
#> ColumnB
#> ColumnA Employed Not Employed
#> str1 3 2
#> str2 2 5
#> str3 1 2
Thanks for replying.
Actually, ColumnA has names of centers, which I just mentioned as 'str1'...'str10'... Those centers already have a numeric code.. from 1 to 150. Anyways, the question was if I can tabulate the Employment percentage per center through summarize function.
I just realized, it can be done by simple table() and prop() table function also.
Now, the next problem is to save the data in user-friendly way in pdf.
Thanks,
There is this overview by the author of huxtable
package that might be helpful for you.
Most common packages to use are kable
and xtable
. huxtable
can be a bit of overkill for your needs, but it is a nice package that makes reporting fairly easy.