Select Top 5 values and names per row

LNV1NL · February 3, 2021, 2:34pm

Hi,

I have a huge dataframe with values of several substances at different sampling sites (row = site, col = substances) and I want to create a new dataframe with all sites as rows but just the TOP 5 substances per row (5 highest value per row) and the respective column name (= substance name).
At the moment I have no idea how to start. Could someone give me a tip?

I used max() and function(x) names(x)[which.max(x)])) before to get the highest value, but is there an equivalent to get the TOP 1- 5?

This is how my df basically looks like:

set.seed(123456)
example <- dplyr::data_frame(
    Site = 1:10,
    Substance_A = sample(0:50, 10, rep = TRUE),
    Substance_B = sample(0:50, 10, rep = TRUE),
    Substance_C = sample(0:50, 10, rep = TRUE)
)

And I should look like this...

set.seed(123456)
TOP <- dplyr::data_frame(
    Site = 1:10,
    TOP1 = sample(0:50, 10, rep = TRUE),
    TOP1_Name = sample(c("a", "b","c"), 10, rep = TRUE),
    TOP2 = sample(0:50, 10, rep = TRUE), 
    TOP2_Name = ""
)

I would very much appreciate any tip or comment!

Thanks
Nele

lars · February 3, 2021, 3:46pm

My first approach would be to use the tidyverse package and convert the input "wide" dataset into an intermediate "long" output dataset. The slice_head() can easily be adjusted to the requested top 5 values.

``` r
 library(tidyverse)
  
  set.seed(123456)
  example <- dplyr::data_frame(
    Site = 1:10,
    Substance_A = sample(0:50, 10, rep = TRUE),
    Substance_B = sample(0:50, 10, rep = TRUE),
    Substance_C = sample(0:50, 10, rep = TRUE)
    )
#> Warning: `data_frame()` is deprecated as of tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
    
  output_long <- 
    example %>%
    pivot_longer(cols = 2:NCOL(.),
                 names_to = "substance",
                 values_to = "value"
                 ) %>% 
    arrange(Site, desc(value)) %>% 
    group_by(Site) %>% 
    slice_head(n = 2)
  
  output_long
#> # A tibble: 20 x 3
#> # Groups:   Site [10]
#>     Site substance   value
#>    <int> <chr>       <int>
#>  1     1 Substance_B    44
#>  2     1 Substance_A    41
#>  3     2 Substance_A    41
#>  4     2 Substance_B    29
#>  5     3 Substance_A    48
#>  6     3 Substance_B    46
#>  7     4 Substance_C    45
#>  8     4 Substance_B    22
#>  9     5 Substance_C    46
#> 10     5 Substance_A    44
#> 11     6 Substance_A    35
#> 12     6 Substance_B    27
#> 13     7 Substance_A    37
#> 14     7 Substance_B    37
#> 15     8 Substance_C    34
#> 16     8 Substance_A     9
#> 17     9 Substance_B    15
#> 18     9 Substance_C    15
#> 19    10 Substance_C    22
#> 20    10 Substance_B    20

^{Created on 2021-02-03 by the reprex package (v1.0.0)}

The above output dataset would be easier to work with.

However, to give it a try to put it back into a wider format, I've added the top_n and top_substance variables and then needed to pivot_wider() and pivot_longer() a couple of times to get the result you're looking for.

  output_long %>% 
    mutate(top_n = str_c("Top_", row_number()),
           top_substance = str_c(top_n, "_Name")
           )  %>%
    pivot_wider(names_from = top_n,
                values_from = value
                ) %>% 
    pivot_wider(names_from = top_substance,
                values_from = substance
                ) %>% 
    select(Site, sort(names(.))) %>% 
    ## pivot_longer of multiple columns require that all values are of same class()
    mutate(across(.cols = everything(), .fns = as.character)) %>% 
    ## get rid of NA values
    pivot_longer(cols = 2:NCOL(.),
                 names_to = "col_name",
                 values_to = "col_value",
                 values_drop_na = TRUE
                 ) %>% 
    pivot_wider(names_from = col_name,
                values_from = col_value
                )
#> # A tibble: 10 x 5
#> # Groups:   Site [10]
#>     Site Top_1 Top_1_Name  Top_2 Top_2_Name 
#>    <int> <chr> <chr>       <chr> <chr>      
#>  1     1 44    Substance_B 41    Substance_A
#>  2     2 41    Substance_A 29    Substance_B
#>  3     3 48    Substance_A 46    Substance_B
#>  4     4 45    Substance_C 22    Substance_B
#>  5     5 46    Substance_C 44    Substance_A
#>  6     6 35    Substance_A 27    Substance_B
#>  7     7 37    Substance_A 37    Substance_B
#>  8     8 34    Substance_C 9     Substance_A
#>  9     9 15    Substance_B 15    Substance_C
#> 10    10 22    Substance_C 20    Substance_B

If there's an easier / quicker way to achieve the requested results then I'm curious what it might look like

LNV1NL · February 3, 2021, 7:31pm

Thanks! That works perfectly!

system · February 10, 2021, 7:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.