removing every-other row from certain groups of listed df's

bryanrt · March 17, 2021, 8:18pm

Goal

I am trying to take my data frame, with nested data in the sac_EV column, and check the sampling rate variable to determine if I need to remove every other row from the unnested data. If the sampling_rate == 100 then I want to leave it alone. If the sampling_rate is 200 I want to get rid of every other row, essentially making it equivalent in size to the 100 sampling rate sets. I know this should be super simple, but for the life of me, I can't think of a simple function that will grab every other row that I can cleanly implement into my piping. Any and all help is greatly appreciated.

Data

I think this URL will automatically download the RDS file from my repository when you put it in your browser,

https://gitlab.com/Bryanrt-geophys/sac2eqtransformr/-/raw/master/sample_data/pre_hdf5.rds

If not, here is a small portion of the data unnested

pre_hdf5 <- read_csv(file = "https://gitlab.com/Bryanrt-geophys/sac2eqtransformr/-/raw/master/sample_data/pre_hdf5_resampling_reprex.csv", col_names = T)

Starting Point/REPREX

pre_hdf5_unnested <- pre_hdf5 %>%
  unnest(cols = c(sac_EV))

pre_hdf5_unnested %>%
  group_by(event, reciever_code) %>%
  group_split() %>%
  map(.f = function(x){
    x %>% mutate(
      id = row_number()  
    ) %>%
      if(x$sampling_rate == 200){
        if((x$id %% 2) == 0) {
          rows_delete()
        }
      }
  })

williaml · March 17, 2021, 9:21pm

Something like this?

df %>%
 filter(row_number() %% 5 == 1)

CALUM_POLWART · March 17, 2021, 9:29pm

@williaml I think the issue is that he only want to select every nth from the rows with 200 in them

require(tidyverse)
pre_hdf5 <- read_csv(file = "https://gitlab.com/Bryanrt-geophys/sac2eqtransformr/-/raw/master/sample_data/pre_hdf5_resampling_reprex.csv", col_names = T)

pre_hdf5_unnested <- pre_hdf5 %>%
    unnest(cols = c(sac_EV))

omitEven <- function(x) {
    if (x$sampling_rate[1] == 200) {
        tibble(x %>% mutate(
            id = row_number()  %>%
                filter(is.even(id)))
        )
    } else {
        tibble( x%>% mutate(
            id=row_number()
        ))
    }
}

pre_hdf5_unnested %>%
    group_by(event, reciever_code) %>%
    group_map(~ omitEven(.x))

I think does what he wants it to. might need to add %>% bind_rows() at the end to bring the data back together in one tibble if that is the aim?

bryanrt · March 17, 2021, 9:50pm

CALUM_POLWART:

omitEven <- function(x) {
    if (x$sampling_rate[1] == 200) {
        tibble(x %>% mutate(
            id = row_number()  %>%
                filter(is.even(id)))
        )
    } else {
        tibble( x%>% mutate(
            id=row_number()
        ))
    }
}

pre_hdf5_unnested %>%
    group_by(event, reciever_code) %>%
    group_map(~ omitEven(.x))

This looks very much like what I am shooting for, however when I run this with the small edit

pre_hdf5_unnested %>%
  group_by(event, reciever_code) %>%
  group_map(~ omitEven(.x)) %>%
  map(.f =  function(x){
    sprintf("%s, %s",nrow(x),
    first(x$sampling_rate))
    })

I expected to see a list of elements with 6001 rows each. However, some are still 12001 despite having a sampling rate of 200. Your function looks like it should have caught these cases though.

The goal is to do a bind_rows() at the end to bring this back into a singular data frame and renest the sac_EV column, good call on that. I think for that part I would use the dplyr::nest_by() function?

CALUM_POLWART · March 17, 2021, 10:12pm

OH! is.even() is not a base R function!

is.even <- function(x) x %% 2 == 0

Will add it!

But x$sampling_rate[1] == 200 is never evaluating... Let me play

joels · March 17, 2021, 10:31pm

If I understand your question, I think the following code will do the filtering:

pre_hdf5_unnested_filtered = pre_hdf5_unnested %>% 
  group_by(event, reciever_code) %>% 
  slice(if(as.integer(sampling_rate[1])==200) seq(1, n(), 2) else 1:n())

CALUM_POLWART · March 17, 2021, 10:34pm

@joels method seems rather neater than mine!

My version has become:

require(tidyverse)
pre_hdf5 <- read_csv(file = "https://gitlab.com/Bryanrt-geophys/sac2eqtransformr/-/raw/master/sample_data/pre_hdf5_resampling_reprex.csv", col_names = T)

pre_hdf5_unnested <- pre_hdf5 %>%
    unnest(cols = c(sac_EV)) %>%
    select(1:10)

is.even <- function(x) x %% 2 == 0

omitEven <- function(x) {
    
        x %>% mutate(
            id = row_number() %% (as.integer(first(x$sampling_rate)/100 ))
        ) %>%
        filter (id == 0)

}

pre_hdf5_unnested %>%
    group_by(event, reciever_code) %>%
    group_map(~ omitEven(.x)) %>%
    bind_rows()

bryanrt · March 18, 2021, 2:02am

Thank you kindly @joels, @CALUM_POLWART, and @williaml. I've learned a bunch from each of your inputs!

system · March 25, 2021, 2:02am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.