How can I count the number of times a string appears in column X, based on column Y ?

cwright1 · August 5, 2020, 8:02pm

I have the dataframe df which is contains 3 columns, like so :

cluster_testset and clusters_trueset are both character columns.

My question : How many times does one of the entries within clusters_testset match one of the entries within clusters_trueset (by row) ?

The answer for 'blood' should be 4, because there are 4 matches. (bone =1, central_nervous_system=1, skin=1, soft_tissue=2)

FJCC · August 5, 2020, 8:32pm

Here is one method.

DF <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
                 clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "64 68"),
                 clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "1", "64 68"))
library(stringr)
testset <- str_split(DF$clusters_testset, pattern = " ")
trueset <- str_split(DF$clusters_trueset, pattern = " ")

library(purrr)
map2_dbl(testset, trueset, function(x, y) sum(x %in% y))
#> [1] 4 1 1 1 2

^{Created on 2020-08-05 by the reprex package (v0.2.1)}

technocrat · August 5, 2020, 8:44pm

suppressPackageStartupMessages(library(dplyr))
col1 <- ("0 0 6 65 73 41")
col2 <- ("43 35 6 65 73 41")
stringr::str_split(col1," ") %>% unlist() -> vec1
stringr::str_split(col2," ") %>% unlist() -> vec2
sum(vec2 %in% vec1)
#> [1] 4

^{Created on 2020-08-05 by the reprex package (v0.3.0)}

cwright1 · August 5, 2020, 9:05pm

This is so simple and elegant thank you! How can I modify it to say "is in x but not in y? I tried this :

map2_dbl(testset, trueset(x, y) sum(x !%in% y)

but got the unexpted '!' error

technocrat · August 5, 2020, 9:13pm

suppressPackageStartupMessages(library(dplyr))
col1 <- ("0 0 6 65 73 41")
col2 <- ("43 35 6 65 73 41")
stringr::str_split(col1," ") %>% unlist() -> vec1
stringr::str_split(col2," ") %>% unlist() -> vec2
sum(vec2 %in% vec1)
#> [1] 4
sum(!vec2 %in% vec1)
#> [1] 2

^{Created on 2020-08-05 by the reprex package (v0.3.0)}

cwright1 · August 5, 2020, 9:26pm

This method doesn't seem to be able to allow adding a column to the original dataframe, right?

For example if df contains col1 and col2:
df$col1 <- ("0 0 6 65 73 41")
df$col2 <- ("43 35 6 65 73 41")

how can I add a column that does tsum(vec2 %in% vec1) for each row? When I try I just get a single number for all entries, trying something like:

df$result <- sum(vec2 %in% vec1)

FJCC · August 5, 2020, 10:01pm

The code I originally posted could be modified to end with

DF$result <- map2_dbl(testset, trueset, function(x, y) sum(x %in% y))

technocrat · August 5, 2020, 10:08pm

No, code was just to illustrate the underlying process. In practice, you'd be doing this with a function and mutate most easily.

suppressPackageStartupMessages(library(dplyr))
col1 <- ("0 0 6 65 73 41")
col2 <- ("43 35 6 65 73 41")
dat <- as.data.frame(cbind(col1,col2))
matcher <- function(x,y) {
  stringr::str_split(x," ") %>% unlist() %>% as.numeric() -> a
  stringr::str_split(y," ") %>% unlist() %>% as.numeric() -> b
  sum(b %in% a)
}
dat %>% mutate(matches = matcher(col1,col2))
#>             col1             col2 matches
#> 1 0 0 6 65 73 41 43 35 6 65 73 41       4

^{Created on 2020-08-05 by the reprex package (v0.3.0)}

cwright1 · August 6, 2020, 3:01pm

Here is where I'm at - starting with progress from your example:

#Dataframe of lineage, testset, and truest
DF <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
                 clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                 clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))

#str_split on both columns
library(stringr)
testset <- str_split(DF$clusters_testset, pattern = " ")
trueset <- str_split(DF$clusters_trueset, pattern = " ")

Ultimate goal: Make and attach to DF the following 4 columns:
(Using blood lineage as an example, but I need to do this for each row)

a_result = # of blood clusters in testset that DO match with blood clusters in trueset
b_result = # of blood clusters in testset that DO NOT match with blood clusters in trueset
c_result = # of any NON-blood clusters of testset that DO match with blood clusters in trueset
d_result = # any NON-blood clusters that DO NOT match with blood clusters in trueset (which will be the biggest number)

First manually find the correct answers to later check against:

DF
  lineage clusters_testset clusters_trueset
1   blood   0 0 6 65 73 41 43 35 6 65 73 41
2    bone               42               42
3 central            90 53          53 7 60
4    skin                1               73
5    soft            65 68            60 68

Again, using just blood as the example:
a_result = 4 (4 blood in testset match with trueset)
b_result = 2 (2 testset blood clusters do no match anywhere with blood trueset)
c_result = 1 (Only 1 non-blood cluster in the testset column matches with one of the blood trueset clusters [it is skin, cluster 65])
d_result = 5 (There are 5 non-blood clusters in testset that do not match with blood clusters of trueset)

Approach so far:
a_result:

DF$a_result <- map2_dbl(testset, trueset, function(x, y) sum(x %in% y))

b_result:
According to this R-bloggers post you can define a %notin% operator to negate %in%:

`%notin%` <- Negate(`%in%`)
DF$b_result <- map2_dbl(testset, trueset, function(x, y) sum(x %notin% y))

So far, this works for a_result and b_result:

DF
  lineage clusters_testset clusters_trueset a_result b_result
1   blood   0 0 6 65 73 41 43 35 6 65 73 41        4        2
2    bone               42               42        1        0
3 central            90 53          53 7 60        1        1
4    skin                1               73        0        1
5    soft            65 68            60 68        1        1

I can't seem to make anything work for c_result, d_result : any ideas?
Thanks for your help!

siddharthprabhu · August 6, 2020, 3:48pm

@cwright1 I'm sorry to barge in on this thread but I think you should see my response in the first thread you posted. That I believe is a much simpler way to do what you are looking for.

technocrat · August 6, 2020, 6:09pm

Which are the blood and non-blood clusters? Are they the complete universe you are dealing with?

cwright1 · August 6, 2020, 6:54pm

the rows in DF$lineage that say "blood" or any rows which do not say "blood", respectively.

cwright1 · August 6, 2020, 6:58pm

you're not barging in - i'm very thankful for your help! I'll reply back over there with how I tried applying your example to my data.

technocrat · August 6, 2020, 8:13pm

Not following. Please enumerate.

cwright1 · August 6, 2020, 8:19pm

@technocrat This is the example I'm trying to complete. I'm trying to find some method to get a/b/c/d_result columns, as defined above.

Here the lineage column has 5 unique entries. This is what I was referring to by "blood" and "non-blood", where "non-blood" in this case is bone, central, skin, and soft.

Example : The non-blood clusters of testset = 42, 90, 53, 1, 65, 68

If you could think of any way to complete the remaining two I'm having difficulty with (c_result and d_result) I'd be super grateful! Thank you! got part of the way there from response by you and @FJCC

technocrat · August 6, 2020, 9:58pm

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(stringr)
})

`%notin%` <- Negate(`%in%`)

# non-blood
nb <- c(42,90,53,1,65,68)

#Dataframe of lineage, testset, and truest
DF <- data.frame(lineage = c("blood", "bone", "central", "skin", "soft"),
 clusters_testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
 clusters_trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))

#str_split on both columns
testset <- str_split(DF$clusters_testset, pattern = " ")
trueset <- str_split(DF$clusters_trueset, pattern = " ")

DF$a_result <- map2_dbl(testset, trueset, function(x, y) sum(x %in% y))
DF$b_result <- map2_dbl(testset, trueset, function(x, y) sum(x %notin% y))
DF$c_result <- map2_dbl(testset, trueset, function(x,y) sum(x %in% nb))
DF$d_result <- map2_dbl(testset, trueset, function(x,y) sum(x %notin% nb))
DF
#>   lineage clusters_testset clusters_trueset a_result b_result c_result d_result
#> 1   blood   0 0 6 65 73 41 43 35 6 65 73 41        4        2        1        5
#> 2    bone               42               42        1        0        1        0
#> 3 central            90 53          53 7 60        1        1        2        0
#> 4    skin                1               73        0        1        1        0
#> 5    soft            65 68            60 68        1        1        2        0

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

cwright1 · August 6, 2020, 10:17pm

This is exactly what I'm trying to accomplish just without having explicitly pre-define the nb numeric. This example works beautifully for DF$lineage=="blood", but I want it to work for every row.

so a_result and b_result are perfect, they match the definitions of what I want them to be for each row. In this specific case, c_result and d_result are correct for "blood" only, but I want them to be correct for every row, just like a_result and b_result.

The c_result for bone should be 0. "How many non-bone clusters of testset (aka all testet clusters of blood, central, skin, and soft) match with the bone cluster of trueset (which is only the cluster 42). None of them do, answer = 0.

I wonder if there's some way to automate this for each row.....none-the-less, you've given me a different way of thinking about it. I'm still trying to find the solution - you are helping greatly!

technocrat · August 6, 2020, 10:27pm

The problem statement is being dribbled out. Always more helpful to be sure of what before moving on to how.

For this more general case, you should prepare vectors like nb for each lineage, then use a case_when in the mutate to match the lineage column with the appropriate %in% or %notin% test.

dhanushx012 · August 7, 2020, 3:36pm

the rows in DF$lineage that say "blood" or any rows which do not say "blood", respectively.

cwright1 · August 7, 2020, 7:19pm

Hi @technocrat

I've been reading over examples and trying to accomplish this. I want to use your advice...finding some way to define the non-x lineages but for each row:

For blood : what are all the trueset clusters of any lineage that is not blood
For bone : what are all the trueset clusters of any lineage that is is not bone
etc...

Do you have one example of how this can be executed? I was trying to use a for() loop but it will only define the first row (aka the non-blood clusters of trueset)

many many thanks