I am trying to create a pipeline by using dataset gss_sm, my code is as below:

Load the libraries we will be using


Load the dataset:


create a pipeline

rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))

it gave me an error warning:
Warning message:
Factor "religion" contains implicit NA, consider using "forcats::fct_explicit_na"

Do you any idea how to fix it?

Thanks a lot,

Welcome @Sophialai!

First of all -- thanks for making a reprex -- I was able to copy your example well!

Second of all - it's worthwhile to note that the message you got was an "warning", and not an "error". The important difference is that your code did "work"! That means that rel_by_religion is a data frame that you can use without a problem - so, if you want, you can ignore that warning message.

The message itself discusses the function fct_explicit_na (from the forcats) package.

Specifically, the religion variable in your data has 5 levels (Protestant, Catholic, Jewish, None and Other). However, there are 18 rows that have none of those levels -- they are just NA. In certain modeling/plotting functions, this could mean that those rows would be silently dropped or ignored, which may not be what you want.

The referenced function turns all those missing into a new factor, called "(Missing)" -- so they don't get silently dropped.

library(dplyr, warn.conflicts = FALSE)

gss_sm %>%
  group_by(bigregion, religion) %>%
  summarize(N = n()) %>%
  mutate(freq = N / sum(N),
         pct = round((freq*100), 0))
#> Warning: Factor `religion` contains implicit NA, consider using
#> `forcats::fct_explicit_na`
#> # A tibble: 24 x 5
#> # Groups:   bigregion [4]
#>    bigregion religion       N    freq   pct
#>    <fct>     <fct>      <int>   <dbl> <dbl>
#>  1 Northeast Protestant   158 0.324      32
#>  2 Northeast Catholic     162 0.332      33
#>  3 Northeast Jewish        27 0.0553      6
#>  4 Northeast None         112 0.230      23
#>  5 Northeast Other         28 0.0574      6
#>  6 Northeast <NA>           1 0.00205     0
#>  7 Midwest   Protestant   325 0.468      47
#>  8 Midwest   Catholic     172 0.247      25
#>  9 Midwest   Jewish         3 0.00432     0
#> 10 Midwest   None         157 0.226      23
#> # … with 14 more rows

gss_sm %>% 
  mutate(religion = forcats::fct_explicit_na(religion)) %>%
  group_by(bigregion, religion) %>%
  summarize(N = n()) %>%
  mutate(freq = N / sum(N),
         pct = round((freq*100), 0))
#> # A tibble: 24 x 5
#> # Groups:   bigregion [4]
#>    bigregion religion       N    freq   pct
#>    <fct>     <fct>      <int>   <dbl> <dbl>
#>  1 Northeast Protestant   158 0.324      32
#>  2 Northeast Catholic     162 0.332      33
#>  3 Northeast Jewish        27 0.0553      6
#>  4 Northeast None         112 0.230      23
#>  5 Northeast Other         28 0.0574      6
#>  6 Northeast (Missing)      1 0.00205     0
#>  7 Midwest   Protestant   325 0.468      47
#>  8 Midwest   Catholic     172 0.247      25
#>  9 Midwest   Jewish         3 0.00432     0
#> 10 Midwest   None         157 0.226      23
#> # … with 14 more rows

For a more explicit example on how the two are treated differently:


#> Protestant   Catholic     Jewish       None      Other 
#>       1371        649         51        619        159

#> Protestant   Catholic     Jewish       None      Other  (Missing) 
#>       1371        649         51        619        159         18

Thank you so much for the quick response and walked me through the confusion.
No wonder when I type:
It showed the data frame out of it.

Thanks again and have a great day,

