how to do iterative match over dataframe rows in R?

genetist · August 4, 2021, 11:02am

Hi to all
i have data like this

df <- read.delim(sep="|", text=c("
M1 M2
A/A G/G
T/T NA
NA C/C
A/A G/G
A/A A/A
"))

now I would like to match row1 of M1 column with itself and with other rows of the same column (iterative matching) if it matches then 0 otherwise 1 but any row of M1 column has NA I do not want include that row in matches and I do not match this row with other rows of M1 column and I would like to repeat this with M2 column too and finally I will sum column at bottom (6 for M1 and 10 for M2). any help in this regard is highly appreciated
finally my results looks like this

M1 M2
0 0
1 0
0 1
0 0
0 1
1 0
0 0
0 0
1 0
1 0
0 1
0 0
0 0
0 1
0 1
0 0
1 0
0 1
0 0
0 1
0 1
1 0
0 1
0 1
0 0
6 10
Thanks in advance
Genetist

pieterjanvc · August 5, 2021, 11:55am

Hi,

Welcome to the RStudio community!

Your data and explanation are a bit confusing. You show a very small example of input and output, but then you start talking about B3 - B5 which is mentioned nowhere in the data.

I suggest you create a reprex so we can try and help you out. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck,
PJ

DavoWW · August 6, 2021, 8:42am

Hi @genetist,
I suggest you first split your columns into "first_letter" and "second_letter", and then do the comparisons to see if they are the same:

# Add to your data to show various extra data combinations
df <- read.delim(sep="|", text=c("
|IND|M1|M2|
|1|A/A|G/G|
|2|T/T|NA|
|3|NA|C/C|
|4|T/C|A/T|
|5|G/T|A/A|
"))

df
#>    X IND   M1   M2 X.1
#> 1 NA   1  A/A  G/G  NA
#> 2 NA   2  T/T <NA>  NA
#> 3 NA   3 <NA>  C/C  NA
#> 4 NA   4  T/C  A/T  NA
#> 5 NA   5  G/T  A/A  NA

suppressPackageStartupMessages({
  library(tidyverse)
})

df %>% 
  select(!c(1,5)) %>% 
  separate(col=M1, sep="/", into=c("M1_1", "M1_2")) %>% 
  separate(col=M2, sep="/", into=c("M2_1", "M2_2")) %>%
  mutate(m1 = ifelse(M1_1 == M1_2, 1, 0),
         m2 = ifelse(M2_1 == M2_2, 1, 0)) -> out.df
out.df
#>   IND M1_1 M1_2 M2_1 M2_2 m1 m2
#> 1   1    A    A    G    G  1  1
#> 2   2    T    T <NA> <NA>  1 NA
#> 3   3 <NA> <NA>    C    C NA  1
#> 4   4    T    C    A    T  0  0
#> 5   5    G    T    A    A  0  1

out.df %>% 
  select(!c(2,3,4,5))
#>   IND m1 m2
#> 1   1  1  1
#> 2   2  1 NA
#> 3   3 NA  1
#> 4   4  0  0
#> 5   5  0  1

^{Created on 2021-08-06 by the reprex package (v2.0.0)}

I hope this is what you wanted since your description was unclear as @pieterjanvc rightly pointed out.

genetist · August 6, 2021, 10:28am

Dear PJ
Good Afternoon
Sorry my bad post and thanks for your valuable time, Now I edited it I hope It will be better now. I would like to upload my out file being a new member i do not have permission to do that.
Thanking you very much

genetist · August 6, 2021, 10:30am

Dear DavoWW
Good Afternoon
Thanks for your kind reply and for your valuable time to help me. Now i edited my post hope it will be better now. sorry for my bad post

nirgrahamuk · August 6, 2021, 12:15pm

I don't believe you can omit the NA's on a per column basis and still end up with a rectangular data.frame necessarily. So i've had to keep them in.

df <- read.delim(sep=" ", text=c("
M1 M2
A/A G/G
T/T NA
NA C/C
A/A G/G
A/A A/A
"))

names(df)

library(tidyverse)

process_a_column <- function(x){
  expand.grid(x,x) %>% 
    mutate(check = as.integer(Var1 != Var2))  %>%
    pull(check) 
}

base_df <- map_dfc(names(df),~process_a_column(df[[.]])) %>% set_names(names(df))
(df_with_sum <- bind_rows(base_df,summarise_all(base_df,sum,na.rm=TRUE)) %>% as.data.frame)

genetist · August 9, 2021, 4:59am

Dear nirgrahamuk
Good Morning
Thank you very much for reply. No problem even if we keep NA in final data, i am much interested in sum row i.e. 26th Row in base_df output and your code working like i expected
thanking you very much for your valuable time spent to help on my problem
Regards

system · August 25, 2021, 11:04am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.