Merging two sequences of function into one (%>%)

aabbccwyt · March 1, 2020, 11:27pm

Hi. I was asked to display the five most popular female names and five most popular male names by top_n, using the r package "babynames". Here is what I have:

babynames %>% 
  filter(sex == "M") %>% 
  select(name, year, n) %>% 
  arrange(desc(n)) %>%
  top_n(5)

babynames %>% 
  filter(sex == "F") %>% 
  select(name, year, n) %>% 
  arrange(desc(n)) %>%
  top_n(5)

I got the outputs that I expected and (hopefully) they are right. However, I was wondering if there was any easier or smarter way to do it since I used two functions that look almost the same. Thank you in advance!

technocrat · March 1, 2020, 11:50pm

Always a good thought! This one is actually too hardwired to be useful, but shows the way

suppressPackageStartupMessages(library(dplyr))
library(babynames)
kids <- function(x) {
babynames %>% 
  filter(sex == x) %>% 
  select(name, year, n) %>% 
  arrange(desc(n)) %>%
  top_n(5)
}
kids("F")
#> Selecting by n
#> # A tibble: 5 x 3
#>   name   year     n
#>   <chr> <dbl> <int>
#> 1 Linda  1947 99686
#> 2 Linda  1948 96209
#> 3 Linda  1949 91016
#> 4 Linda  1950 80432
#> 5 Mary   1921 73982
kids("M")
#> Selecting by n
#> # A tibble: 5 x 3
#>   name     year     n
#>   <chr>   <dbl> <int>
#> 1 James    1947 94756
#> 2 Michael  1957 92695
#> 3 Robert   1947 91642
#> 4 Michael  1956 90620
#> 5 Michael  1958 90520

^{Created on 2020-03-01 by the reprex package (v0.3.0)}

mfherman · March 2, 2020, 1:23am

You can definitely make it into a function as @technocrat shows, but this seems like a good time to use `group_by(). You can run the exact same code you have, but before you take the top 5, first group by sex so you get top 5 female names and top 5 male names.

library(dplyr, warn.conflicts = FALSE)
library(babynames)

babynames %>% 
  group_by(sex) %>% 
  top_n(5, n) %>% 
  ungroup() %>% 
  select(sex, name, year, n) %>% 
  arrange(sex, desc(n))
#> # A tibble: 10 x 4
#>    sex   name     year     n
#>    <chr> <chr>   <dbl> <int>
#>  1 F     Linda    1947 99686
#>  2 F     Linda    1948 96209
#>  3 F     Linda    1949 91016
#>  4 F     Linda    1950 80432
#>  5 F     Mary     1921 73982
#>  6 M     James    1947 94756
#>  7 M     Michael  1957 92695
#>  8 M     Robert   1947 91642
#>  9 M     Michael  1956 90620
#> 10 M     Michael  1958 90520

Just a note, the babynames data set has the number of children with each name each year, so by doing top_n(), you are actually getting the year with the largest number of kids with the same name.

Perhaps you might be interested in the most popular 5 names for all the years in the babynames data. To get that you would do something like this:

babynames %>% 
  group_by(sex, name) %>% 
  summarize(n = sum(n)) %>% 
  top_n(5, n) %>% 
  ungroup() %>% 
  select(sex, name, n) %>% 
  arrange(sex, desc(n))
#> # A tibble: 10 x 3
#>    sex   name            n
#>    <chr> <chr>       <int>
#>  1 F     Mary      4123200
#>  2 F     Elizabeth 1629679
#>  3 F     Patricia  1571692
#>  4 F     Jennifer  1466281
#>  5 F     Linda     1452249
#>  6 M     James     5150472
#>  7 M     John      5115466
#>  8 M     Robert    4814815
#>  9 M     Michael   4350824
#> 10 M     William   4102604

And maybe you're interested in the top 5 names by sex by decade. For that, you can add a decade term into your group_by().

babynames %>% 
  group_by(sex, decade = year %/% 10 * 10, name) %>% 
  summarize(n = sum(n)) %>% 
  top_n(5, n) %>% 
  ungroup() %>% 
  select(sex, decade, name, n) %>% 
  arrange(sex, decade, desc(n))
#> # A tibble: 140 x 4
#>    sex   decade name           n
#>    <chr>  <dbl> <chr>      <int>
#>  1 F       1880 Mary       91668
#>  2 F       1880 Anna       38159
#>  3 F       1880 Emma       25404
#>  4 F       1880 Elizabeth  25006
#>  5 F       1880 Margaret   21799
#>  6 F       1890 Mary      131136
#>  7 F       1890 Anna       55261
#>  8 F       1890 Margaret   37938
#>  9 F       1890 Helen      37802
#> 10 F       1890 Elizabeth  33879
#> # … with 130 more rows

^{Created on 2020-03-01 by the reprex package (v0.3.0)}

aabbccwyt · March 2, 2020, 2:13am

Getting the year with the largest number of kids with the same name is actually what is required. The group_by perfectly solved my question and I learned a lot more from what you added at the end! Thank you!

mfherman · March 3, 2020, 2:28am

3 posts were split to a new topic: Plotting babynames with geom_col()

mfherman · March 3, 2020, 2:29am

A post was merged into an existing topic: Plotting babynames with geom_col()

mfherman · March 3, 2020, 2:30am

Hi, just quick note that I split this into a new topic to cover the questions about ggplot. We can continue the conversation there!

https://forum.posit.co/t/plotting-babynames-with-geom-col/55193/4