Filtering nested tibbles

CALUM_POLWART · July 22, 2022, 6:40pm

I'm stuck.

I'm pretty good with normal rectangular data. But I think I want to restructure my data to be nested.

My experiments so far are proving useless. Keeping things fairly simple. If I wanted to filter for something inside the nest, is there a tidy way to do it?

Simple dataset:

starwars |>
group_by(homeworld, species)|>
nest()

If I want to search inside the "data" column this creates to find any tibble with gender == "masculine" and return it with the homeworld and species - Can I?

And what if I want to search inside films for "The force Awakens" ( So a list within a tibble within a tibble...?)

jrkrideau · July 22, 2022, 7:04pm

I think there is something on this in the tutorial Learn to purrr

I will note that my cat still does not purr.

CALUM_POLWART · July 23, 2022, 12:31am

Yeah, my cat sort of squeels! Using that tutorial I have:

require(tidyverse)
require(gapminder)
gapminder %>% 
  group_by(continent) %>% 
  nest() %>% #Nested data by continent
# Next line is from the tutorial and is calculating life expectency by continent
mutate(avg_lifeExp = map_dbl(data, ~{mean(.x$lifeExp)})) %>%
# This is my dreadfully crude filtering method!
#   find 'country = Ireland' this will create a list column with T/F.  If you sum Trues they are counted. 
#  if that is then unlisted you can have a numerical answer for number of times Ireland is a country 
#  in each continent.
mutate(filt = unlist(map(data, ~{sum(.x$country == "Ireland")})) ) %>%
# I can then filter for anything > 0
filter(filt > 0) %>%
# And then drop the filter bits
select(-filt)

Results in:

> # A tibble: 1 x 3
> # Groups:   continent [1]
>  continent data               avg_lifeExp
> <fct>     <list>                   <dbl>
> 1 Europe    <tibble [360 x 5]>        71.9

BUT - it is utterly dreadful code...! It certainly doesn't feel like tidyverse readability. I feel there should be a command like

gapminder %>% 
  group_by(continent) %>% 
  nest() %>% 
mutate(avg_lifeExp = map_dbl(data, ~{mean(.x$lifeExp)})) %>%
filter_nest(data, country == "Ireland")

There might be two use cases for filter - one to find a nest that contains the item of interest (like I have done crudely) and one that filters within the nest and returns only Ireland in the nest.

There is a keep function. Maybe it can do this, but I haven't got it to work at all!

EconProf · July 23, 2022, 2:33am

Hadley recently retweeted the announcement of the nplyr package for dplyr-like manipulations of nested data frames. Might be worth checking.

CALUM_POLWART · July 23, 2022, 10:13am

@EconProf thank you. I either missed the tweet, or thought "nested data is for fools!" and then 2 weeks later thought "lets nest my data!"

So it seems Hadley is on the ball as usual.

require(tidyverse)
require(gapminder)
gapminder %>% 
  group_by(continent) %>% 
  nest() %>%
  nplyr::nest_filter(data, country == "Ireland")

(Even the syntax is simillar to what I suggested!) This code will actually select JUST Ireland data from the nests (i.e. 18 rows in Europe), where as my horrible code will select the whole nest (i.e. 360 rows in Europe) that contains Ireland. Which we can achieve with:

gapminder %>% 
  group_by(continent) %>% 
  nest() %>%
  nplyr::nest_filter(data, any(country == "Ireland"))

Thank you for the pointer to look at nplyr, and thanks to @hadley for tweeting and @markjrieke for making it.

system · July 30, 2022, 10:14am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.