Extracting string after last underscore

Dallak · March 6, 2022, 6:44am

Hi all,
I have this dataset, and would like to create a column label which takes as an input the word after the last underscore and before .wav in the stim column.


 stim                              kurt       kurtval
1  abdul_mohd_1281_3-_su3uud.wav kurtosis01 131.60382
2  abdul_mohd_1299_3-_su3uud2.wavkurtosis01 151.46565
3  abdul_mohd_1409_f_faatiH.wav  kurtosis01 235.92852
4  abdul_mohd_1435_f_faatiH.wav  kurtosis01 337.57584
5  abdul_mohd_1462_T_t-aamir.wav kurtosis01  77.71517
6  abdul_mohd_1487_T_t-aamir.wav kurtosis01 214.47318
7  abdul_mohd_1514_D_d-aabil.wav kurtosis01  82.94311
8  abdul_mohd_1542_D_d-aabil.wav kurtosis01 145.74446

This is the data.

data <- structure(list(stim = c("abdul_mohd_1281_3-_su3uud.wav", "abdul_mohd_1299_3-_su3uud2.wav", 
                             "abdul_mohd_1409_f_faatiH.wav", "abdul_mohd_1435_f_faatiH.wav", 
                             "abdul_mohd_1462_T_t-aamir.wav", "abdul_mohd_1487_T_t-aamir.wav", 
                             "abdul_mohd_1514_D_d-aabil.wav", "abdul_mohd_1542_D_d-aabil.wav"
), kurt = c("kurtosis01", "kurtosis01", "kurtosis01", "kurtosis01", 
            "kurtosis01", "kurtosis01", "kurtosis01", "kurtosis01"), kurtval = c(131.603817955143, 
                                                                                 151.465653115077, 235.928519783803, 337.575842059023, 77.7151703927855, 
                                                                                 214.473178497778, 82.9431075503998, 145.744458586239)), row.names = c(NA, 
                                                                                                                                                       8L), class = "data.frame")

The output I am looking for is similar to this. Note that the (extracted) word should be deleted from the stim column, and moved? to the new column label.


 stim                     kurt       kurtval      label
1  abdul_mohd_1281_3-.wav kurtosis01 131.60382    su3uud
2  abdul_mohd_1299_3-.wav kurtosis01 151.46565    su3uud2
3  abdul_mohd_1409_f.wav  kurtosis01 235.92852    faatiH
4  abdul_mohd_1435_f.wav  kurtosis01 337.57584    faatiH
5  abdul_mohd_1462_T.wav  kurtosis01  77.71517    t-aamir
6  abdul_mohd_1487_T.wav  kurtosis01 214.47318    t-aamir
7  abdul_mohd_1514_D.wav  kurtosis01  82.94311    d-aabil
8  abdul_mohd_1542_D.wav  kurtosis01 145.74446    d-aabil

I have tried using the following but no success.

data %>% mutate(label = gsub('_[^_]*$', '', stim))

Thank you in advance!

pieterjanvc · March 6, 2022, 1:49pm

Hi,

Here is a way of doing that using the extract functions and some regex

data <- structure(list(stim = c("abdul_mohd_1281_3-_su3uud.wav", "abdul_mohd_1299_3-_su3uud2.wav", 
                                "abdul_mohd_1409_f_faatiH.wav", "abdul_mohd_1435_f_faatiH.wav", 
                                "abdul_mohd_1462_T_t-aamir.wav", "abdul_mohd_1487_T_t-aamir.wav", 
                                "abdul_mohd_1514_D_d-aabil.wav", "abdul_mohd_1542_D_d-aabil.wav"
), kurt = c("kurtosis01", "kurtosis01", "kurtosis01", "kurtosis01", 
            "kurtosis01", "kurtosis01", "kurtosis01", "kurtosis01"), kurtval = c(131.603817955143, 
                                                                                 151.465653115077, 235.928519783803, 337.575842059023, 77.7151703927855, 
                                                                                 214.473178497778, 82.9431075503998, 145.744458586239)), row.names = c(NA, 
                                                                                                                                                       8L), class = "data.frame") 
library(tidyverse)

data %>% extract(stim, c("stim", "label"), "(.*)_([^_]+).wav")
#>                 stim   label       kurt   kurtval
#> 1 abdul_mohd_1281_3-  su3uud kurtosis01 131.60382
#> 2 abdul_mohd_1299_3- su3uud2 kurtosis01 151.46565
#> 3  abdul_mohd_1409_f  faatiH kurtosis01 235.92852
#> 4  abdul_mohd_1435_f  faatiH kurtosis01 337.57584
#> 5  abdul_mohd_1462_T t-aamir kurtosis01  77.71517
#> 6  abdul_mohd_1487_T t-aamir kurtosis01 214.47318
#> 7  abdul_mohd_1514_D d-aabil kurtosis01  82.94311
#> 8  abdul_mohd_1542_D d-aabil kurtosis01 145.74446

^{Created on 2022-03-06 by the reprex package (v2.0.1)}

Hope this helps,
PJ

andresrcs · March 6, 2022, 1:50pm

This gets you the label column, I don't see the point in modifying the original column though.

library(tidyverse)

data <- structure(list(stim = c("abdul_mohd_1281_3-_su3uud.wav", "abdul_mohd_1299_3-_su3uud2.wav", 
                                "abdul_mohd_1409_f_faatiH.wav", "abdul_mohd_1435_f_faatiH.wav", 
                                "abdul_mohd_1462_T_t-aamir.wav", "abdul_mohd_1487_T_t-aamir.wav", 
                                "abdul_mohd_1514_D_d-aabil.wav", "abdul_mohd_1542_D_d-aabil.wav"
), kurt = c("kurtosis01", "kurtosis01", "kurtosis01", "kurtosis01", 
            "kurtosis01", "kurtosis01", "kurtosis01", "kurtosis01"), kurtval = c(131.603817955143, 
                                                                                 151.465653115077, 235.928519783803, 337.575842059023, 77.7151703927855, 
                                                                                 214.473178497778, 82.9431075503998, 145.744458586239)), row.names = c(NA, 
                                                                                                                                                       8L), class = "data.frame") 

data %>% 
    mutate(label = str_extract(stim, "(?<=_)[^_]+(?=\\.wav$)"))
#>                             stim       kurt   kurtval   label
#> 1  abdul_mohd_1281_3-_su3uud.wav kurtosis01 131.60382  su3uud
#> 2 abdul_mohd_1299_3-_su3uud2.wav kurtosis01 151.46565 su3uud2
#> 3   abdul_mohd_1409_f_faatiH.wav kurtosis01 235.92852  faatiH
#> 4   abdul_mohd_1435_f_faatiH.wav kurtosis01 337.57584  faatiH
#> 5  abdul_mohd_1462_T_t-aamir.wav kurtosis01  77.71517 t-aamir
#> 6  abdul_mohd_1487_T_t-aamir.wav kurtosis01 214.47318 t-aamir
#> 7  abdul_mohd_1514_D_d-aabil.wav kurtosis01  82.94311 d-aabil
#> 8  abdul_mohd_1542_D_d-aabil.wav kurtosis01 145.74446 d-aabil

^{Created on 2022-03-06 by the reprex package (v2.0.1)}

Dallak · March 6, 2022, 4:27pm

Thank you both @pieterjanvc and @andresrcs for your help. Both ways work nicely. The reason why I am also modifying the original column is that I want to get rid of the redundancy.
Thank you again!

andresrcs · March 6, 2022, 9:02pm

But that is a file name if you chop parts out of it, it's going to lose context and meaning. Since it seems the file name encodes data, maybe it would be better to extract all the useful data and then drop the column completely.

Dallak · March 6, 2022, 9:21pm

I totally agree with you and will follow your suggestion, @andresrcs.
Many thanks!

system · March 13, 2022, 9:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.