SimonG
September 15, 2022, 7:55pm
1
Hi,
I have a list of samples under the form : TCGA-CC-A9FW
and another like this TCGA-CC-A9FW-01
I would like to match all samples with the same letters excluding all after the third -
I dont manage to find the regular expression to do it.
Best
Simon
Assuming the samples all have the same form, is it possible to use substr()
? The example below takes a list of samples and creates a sample_group column that is the first 12 characters. Thus, each sample is now assigned to a like group.
library(tidyverse)
df = tibble(
sample = c('TCGA-CC-A9FW',
'TCGA-CC-A9FW-01',
'TCGA-CC-A9FW-02',
'TCGA-CC-A4FW',
'TCGA-CC-A9FW-03',
'TCGA-CC-A4FW-01'
)
)
df %>%
mutate(sample_group = substr(sample, 1, 12)) %>%
arrange(sample_group)
#> # A tibble: 6 × 2
#> sample sample_group
#> <chr> <chr>
#> 1 TCGA-CC-A4FW TCGA-CC-A4FW
#> 2 TCGA-CC-A4FW-01 TCGA-CC-A4FW
#> 3 TCGA-CC-A9FW TCGA-CC-A9FW
#> 4 TCGA-CC-A9FW-01 TCGA-CC-A9FW
#> 5 TCGA-CC-A9FW-02 TCGA-CC-A9FW
#> 6 TCGA-CC-A9FW-03 TCGA-CC-A9FW
Created on 2022-09-15 with reprex v2.0.2.9000
zivan
September 16, 2022, 4:01pm
3
This one uses regular expression and discards everything after the 3rd "-"
df <- data.frame(
sample = c(
'TCGA-CC-A9FW',
'TCGA-CC-A9FW-01',
'TCGA-CC-A9FW-02',
'TCGA-CC-A4FW',
'TCGA-CC-A9FW-03',
'TCGA-CC-A4FW-01'
)
)
df$sample_group <- gsub("^(([[:alnum:]]+-){2}[[:alnum:]]+)(-[[:alnum:]]+)*$", "\\1", df$sample)
df
system
Closed
October 7, 2022, 4:02pm
4
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.