match, regular expressions

SimonG · September 15, 2022, 7:55pm

Hi,
I have a list of samples under the form : TCGA-CC-A9FW
and another like this TCGA-CC-A9FW-01

I would like to match all samples with the same letters excluding all after the third -
I dont manage to find the regular expression to do it.
Best

Simon

scottyd22 · September 15, 2022, 9:03pm

Assuming the samples all have the same form, is it possible to use substr()? The example below takes a list of samples and creates a sample_group column that is the first 12 characters. Thus, each sample is now assigned to a like group.

library(tidyverse)

df = tibble(
   sample = c('TCGA-CC-A9FW', 
              'TCGA-CC-A9FW-01', 
              'TCGA-CC-A9FW-02',
              'TCGA-CC-A4FW',
              'TCGA-CC-A9FW-03',
              'TCGA-CC-A4FW-01'
              )
   )

df %>%
   mutate(sample_group = substr(sample, 1, 12)) %>%
   arrange(sample_group)
#> # A tibble: 6 × 2
#>   sample          sample_group
#>   <chr>           <chr>       
#> 1 TCGA-CC-A4FW    TCGA-CC-A4FW
#> 2 TCGA-CC-A4FW-01 TCGA-CC-A4FW
#> 3 TCGA-CC-A9FW    TCGA-CC-A9FW
#> 4 TCGA-CC-A9FW-01 TCGA-CC-A9FW
#> 5 TCGA-CC-A9FW-02 TCGA-CC-A9FW
#> 6 TCGA-CC-A9FW-03 TCGA-CC-A9FW

Created on 2022-09-15 with reprex v2.0.2.9000

zivan · September 16, 2022, 4:01pm

This one uses regular expression and discards everything after the 3rd "-"

df <- data.frame(
    sample = c(
        'TCGA-CC-A9FW',
        'TCGA-CC-A9FW-01',
        'TCGA-CC-A9FW-02',
        'TCGA-CC-A4FW',
        'TCGA-CC-A9FW-03',
        'TCGA-CC-A4FW-01'
    )
)

df$sample_group <- gsub("^(([[:alnum:]]+-){2}[[:alnum:]]+)(-[[:alnum:]]+)*$", "\\1", df$sample)
df

system · October 7, 2022, 4:02pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.