summarise/groupby

SimonG · November 17, 2022, 10:14pm

Hi,
I have a df like this
df<-data.frame(SYMBOL= c(rep("RET",4),rep("ROS",5),rep("ALK",3)),
region = c("Promoter1","Promoter2",'intronic',"intronic","NCR","intronic","Promoter1","intronic","NCR","intronic","Promoter1","Promoter2"),
value = sample(x=1:15,size=12))

I want a summarised dataframe with for each SYMBOL, the mean value of Promoter region (1 or 2) divided by the mean value of non-promoter region.

like

SYMBOL Value
RET X
ROS Y
ALK Z

Thank you

Simon

rene_at_coco · November 18, 2022, 12:11am

Is this what you are asking?

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2

df<-data.frame(SYMBOL= c(rep("RET",4),rep("ROS",5),rep("ALK",3)),
               region = c("Promoter1","Promoter2",'intronic',"intronic","NCR","intronic","Promoter1","intronic","NCR","intronic","Promoter1","Promoter2"),
               value = sample(x=1:15,size=12))

df %>%
  group_by(
    SYMBOL,
    region
  ) %>%
  summarise(
    mean(value)
  )
#> `summarise()` has grouped output by 'SYMBOL'. You can override using the
#> `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups:   SYMBOL [3]
#>   SYMBOL region    `mean(value)`
#>   <chr>  <chr>             <dbl>
#> 1 ALK    intronic            8  
#> 2 ALK    Promoter1           9  
#> 3 ALK    Promoter2          12  
#> 4 RET    intronic           10  
#> 5 RET    Promoter1           5  
#> 6 RET    Promoter2           3  
#> 7 ROS    intronic            5.5
#> 8 ROS    NCR                 6.5
#> 9 ROS    Promoter1          15

^{Created on 2022-11-17 with reprex v2.0.2}

SimonG · November 18, 2022, 4:57pm

Hi,
this is the first step, but the final step that I dont succeed to reach is to have for each SYMBOL a value of the mean of all Promoter value (Promoter1 and Promoter2), divided my the mean of all non promoter (intronic and NCR). Final would be

SYMBOL RESULTS
ALK. mean(Promoter1,Promoter2)/mean(intronic,NCR)

etc...

Considering that in reality my df has much more values so it is necessary to use a grep("Promoter") and -grep("Promoter") to identify the 2 classes...
Thanks

rene_at_coco · November 18, 2022, 5:37pm

In that case, I think this strategy will work:

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2

df<-data.frame(SYMBOL= c(rep("RET",4),rep("ROS",5),rep("ALK",3)),
               region = c("Promoter1","Promoter2",'intronic',"intronic","NCR","intronic","Promoter1","intronic","NCR","intronic","Promoter1","Promoter2"),
               value = sample(x=1:15,size=12))

df %>%
  mutate(
    group = case_when(
      str_detect(string = region, pattern = "Promoter") ~ "promoter",
      # str_detect(string = region, pattern = "intronic") ~ "intronic",
      # str_detect(string = region, pattern = "NCR") ~ "ncr",
      TRUE ~ "other"
    )
  ) %>%
  group_by(
    SYMBOL,
    group
  ) %>%
  summarise(
    average = mean(value)
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = group,
    values_from = average
  ) %>%
  mutate(
    symbol_mean = promoter / other
  )
#> `summarise()` has grouped output by 'SYMBOL'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 × 4
#>   SYMBOL other promoter symbol_mean
#>   <chr>  <dbl>    <dbl>       <dbl>
#> 1 ALK     6         7.5       1.25 
#> 2 RET    14         2.5       0.179
#> 3 ROS     7.75      1         0.129

^{Created on 2022-11-18 with reprex v2.0.2}

SimonG · November 18, 2022, 6:42pm

rene_at_coco:

df %>%
  mutate(
    group = case_when(
      str_detect(string = region, pattern = "Promoter") ~ "promoter",
      # str_detect(string = region, pattern = "intronic") ~ "intronic",
      # str_detect(string = region, pattern = "NCR") ~ "ncr",
      TRUE ~ "other"
    )
  ) %>%
  group_by(
    SYMBOL,
    group
  ) %>%
  summarise(
    average = mean(value)
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = group,
    values_from = average
  ) %>%
  mutate(
    symbol_mean = promoter / other
  )

That's it, thanks a lot !

system · December 9, 2022, 6:42pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.