replace values in a dataframe based on a match in another dataframe

MN_AGRICO · April 10, 2023, 8:36pm

I have two dataframes, df_A and df_B. Below are two smaller examples of the two. The amount of 'Am', 'Pi' and 'Wr' columns is variable per analysis-set. (These are called the 'Markers').

I want the Marker values from df_A replaced by zero if: the plate_Number of column df_A$Plate_Number is present in the respective Marker column of df_B.
If the plate_Number in df_A for a given AM, Pi or Wr column is NA in df_B, the value of df_A should be unchanged.

The columns with the (variable) Am, Pi and Wr values can be referred to in the script using a list.
Markers <- c(Am01, Am02d, Am03c, Am04b, Am05, Mc01, Pi02, Pi03b, Pi04, Pi08c, Pi10b, Pi12, Pi14, Wr02)

Any help would be very appreciated!

>df_A

    row plate Am01 Am02d Am03c Am04b Am05 Mc01 Pi02 Pi03b Pi04 Pi08c Pi10b Pi12 Pi14 Wr02 plate_Number
1     A     1   12     0     0     0    0    0    0     0    0    12     0    0    0    0       Plate1
102   B     1   12     0     0     0    0    0    0     0    0    12     0    0    0    0       Plate1
203   C     1   12     0     0     0    0    0    0     0    0    12     0    0    0    0       Plate1
304   D     1   12     0     0     0    0    0    0     0    0     1     0    0    0    0       Plate1
405   E     1   12     0     0     0    0    0    0     0    0     0     0    0    0    0       Plate1
506   F     1   12     0     0     0    0    0    0     0    0     0     0    0    0    0       Plate1
607   G     1   12     0     0     0    0    0    0     3    0     0     0    0    0    0       Plate1
708   H     1   12     0     0     0    0    0    0    12    0     0     0    0    0    0       Plate1
14    A     2   12     0     0     0    0    0    0    12    0     0     0    0    0    0       Plate2
115   B     2   12     0     0     0    0    0    0    12    0     0     0    0    0    0       Plate2
216   C     2   12     0     0     0    0    0    0    12    0     0     0    0    0    0       Plate2
317   D     2   12     0     0     0    0    0    0    12    0     0     0    0    0    0       Plate2
418   E     2   12     0     0     0    0    0    0    12    0     0     6    0    0    0       Plate2
519   F     2   12    10     0     0    0    0    0    12    0     0     2    0    0    0       Plate2
620   G     2   12     5     0     0    0    0    0    12    0     0     0    0    0    0       Plate2
721   H     2   12     0     0     0    0    4    0     8    0     0     0    0    0    0       Plate2
25    A     3   12     0     0     0    0   12    0     0    0     0     0    0    0    0       Plate3
126   B     3    5     7     0     0    0    5    0     0    0     0     0    0    0    0       Plate3
227   C     3    0    12     0     0    0    0    0     0    0     0     0    0    0    0       Plate3
328   D     3    6     6     0     0    0    0    0     0    0     0     0    0    0    0       Plate3

>df_B

   plateNumber    Am01   Am02d Am03c   Am04b Am05 Mc01 Pi02  Pi03b Pi04 Pi08c  Pi10b   Pi12   Pi14 Wr02
1       Plate1  Plate1    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
2       Plate2  Plate2    <NA>  <NA>    <NA> <NA> <NA> <NA> Plate2 <NA>  <NA>   <NA>   <NA>   <NA> <NA>
3       Plate3    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
4       Plate4  Plate4    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
5       Plate5    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
6       Plate6    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA> Plate6   <NA> <NA>
7       Plate7    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA> Plate7   <NA>   <NA> <NA>
8       Plate8    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA> Plate8   <NA>   <NA> <NA>
9       Plate9    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA> Plate9 <NA>
10     Plate10    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
11     Plate11 Plate11 Plate11  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
12     Plate12 Plate12 Plate12  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
13     Plate13 Plate13 Plate13  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
14     Plate14 Plate14    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
15     Plate15 Plate15    <NA>  <NA> Plate15 <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
16     Plate16    <NA>    <NA>  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
17     Plate17 Plate17 Plate17  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
18     Plate18    <NA> Plate18  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
19     Plate19    <NA> Plate19  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>
20     Plate20    <NA> Plate20  <NA>    <NA> <NA> <NA> <NA>   <NA> <NA>  <NA>   <NA>   <NA>   <NA> <NA>

FJCC · April 11, 2023, 6:08am

I think I would do this by pivoting both data frames to a long form (pivot_longer() from tidyr) with one column showing the marker and another the value. I would then join the two by the plate number and use ifelse() to change the value from df_A if the value from df_B is not . You can then pivot wider to the original shape if keep that is important.
Please post the output of dput(df_A) and dput(df_B) if you need more help with the code.

MN_AGRICO · April 11, 2023, 5:26pm

Dear FJCC,

Thanx a lot for your suggestions. My knowledge in R does not reach the level to be able to write the code from your suggestions.

I created the dput files of df_A and df_B, but they can not be uploaded in this topic...

However, the example dataframes I uploaded above, include all the columns of the actual df's (but the actual df's are much longer).

(note that the amount that the amount of marker_columns may vary per experiment, but that there is always a marker list available if needed for the code (see markerlist in the initial topic above).

I also included a scheme of what to match and what to return for the both ifelse situations.

If for row one of df_A, the plateNumber value "Plate1" is present in a given marker in df_B (in this example marker Am01), it should return "0".

If for example in row 25 in df_A the "Plate1" value is NOT present for Am01 in df_B, it should be kept "12" in df_A.

Please note that Plate1, Plate2 etc are unique values in df_B, but are NOT unique values in df_A. Indeed, it is important to keep the original shape of df_A.

Thanks a lot!!! If you need more info just let me know.

FJCC · April 11, 2023, 6:07pm

This is what I was thinking of. I put in several head() steps only to illustrate the process.

library(tidyr)
library(dplyr)

df_A <- read.csv("~/R/Play/df_A.csv")
df_B <- read.csv("~/R/Play/df_B.csv")
A_long <- pivot_longer(data = df_A, cols = -c("row", "plate", "plate_Number"), 
                       names_to = "marker", values_to = "Value")
head(A_long)
#> # A tibble: 6 × 5
#>   row   plate plate_Number marker Value
#>   <chr> <int> <chr>        <chr>  <int>
#> 1 A         1 Plate1       Am01      12
#> 2 A         1 Plate1       Am02d      0
#> 3 A         1 Plate1       Am03c      0
#> 4 A         1 Plate1       Am04b      0
#> 5 A         1 Plate1       Am05       0
#> 6 A         1 Plate1       Mc01       0
B_long <- pivot_longer(data = df_B, cols = -plateNumber, 
                       names_to = "marker", values_to = "Value_B")
head(B_long)
#> # A tibble: 6 × 3
#>   plateNumber marker Value_B
#>   <chr>       <chr>  <chr>  
#> 1 Plate1      Am01   Plate1 
#> 2 Plate1      Am02d  <NA>   
#> 3 Plate1      Am03c  <NA>   
#> 4 Plate1      Am04b  <NA>   
#> 5 Plate1      Am05   <NA>   
#> 6 Plate1      Mc01   <NA>

Joined <- left_join(A_long, B_long, by = c("plate_Number" = "plateNumber","marker"="marker"))
head(Joined)
#> # A tibble: 6 × 6
#>   row   plate plate_Number marker Value Value_B
#>   <chr> <int> <chr>        <chr>  <int> <chr>  
#> 1 A         1 Plate1       Am01      12 Plate1 
#> 2 A         1 Plate1       Am02d      0 <NA>   
#> 3 A         1 Plate1       Am03c      0 <NA>   
#> 4 A         1 Plate1       Am04b      0 <NA>   
#> 5 A         1 Plate1       Am05       0 <NA>   
#> 6 A         1 Plate1       Mc01       0 <NA>
Joined <- mutate(Joined, Value = ifelse(Value_B != "<NA>", 0, Value))
head(Joined)
#> # A tibble: 6 × 6
#>   row   plate plate_Number marker Value Value_B
#>   <chr> <int> <chr>        <chr>  <dbl> <chr>  
#> 1 A         1 Plate1       Am01       0 Plate1 
#> 2 A         1 Plate1       Am02d      0 <NA>   
#> 3 A         1 Plate1       Am03c      0 <NA>   
#> 4 A         1 Plate1       Am04b      0 <NA>   
#> 5 A         1 Plate1       Am05       0 <NA>   
#> 6 A         1 Plate1       Mc01       0 <NA>

Wide <- Joined |> select(-Value_B) |> 
  pivot_wider(names_from = marker, values_from = Value)
Wide
#> # A tibble: 20 × 17
#>    row   plate plate_Num…¹  Am01 Am02d Am03c Am04b  Am05  Mc01  Pi02 Pi03b  Pi04
#>    <chr> <int> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 A         1 Plate1          0     0     0     0     0     0     0     0     0
#>  2 B         1 Plate1          0     0     0     0     0     0     0     0     0
#>  3 C         1 Plate1          0     0     0     0     0     0     0     0     0
#>  4 D         1 Plate1          0     0     0     0     0     0     0     0     0
#>  5 E         1 Plate1          0     0     0     0     0     0     0     0     0
#>  6 F         1 Plate1          0     0     0     0     0     0     0     0     0
#>  7 G         1 Plate1          0     0     0     0     0     0     0     3     0
#>  8 H         1 Plate1          0     0     0     0     0     0     0    12     0
#>  9 A         2 Plate2          0     0     0     0     0     0     0     0     0
#> 10 B         2 Plate2          0     0     0     0     0     0     0     0     0
#> 11 C         2 Plate2          0     0     0     0     0     0     0     0     0
#> 12 D         2 Plate2          0     0     0     0     0     0     0     0     0
#> 13 E         2 Plate2          0     0     0     0     0     0     0     0     0
#> 14 F         2 Plate2          0    10     0     0     0     0     0     0     0
#> 15 G         2 Plate2          0     5     0     0     0     0     0     0     0
#> 16 H         2 Plate2          0     0     0     0     0     4     0     0     0
#> 17 A         3 Plate3         12     0     0     0     0    12     0     0     0
#> 18 B         3 Plate3          5     7     0     0     0     5     0     0     0
#> 19 C         3 Plate3          0    12     0     0     0     0     0     0     0
#> 20 D         3 Plate3          6     6     0     0     0     0     0     0     0
#> # … with 5 more variables: Pi08c <dbl>, Pi10b <dbl>, Pi12 <dbl>, Pi14 <dbl>,
#> #   Wr02 <dbl>, and abbreviated variable name ¹plate_Number

^{Created on 2023-04-11 with reprex v2.0.2}

MN_AGRICO · April 12, 2023, 8:38am

Dear FJCC,

Thank you so much for this code, it does exactly what I would like it to do in the example you've returned.

In my data-set however (which is identical to the example I provided, but longer) it does it a little bit different. See below

The zero's are indeed the values that needed to be replaced by zero. However, the values that should have been kept are now NA (for example the value "12" in Pi08c in row 1).

Also, everything else brings back NA as well, while in your output everything else is zero.

Any idea what went wrong?

See below how I used your code and made small changes. For instance I altered the source of df_A and df_B into the full dataframes (the exact dataframes from which I made the smaller examples).

And I added the list of the marker since that was not included in your code. I suggested the list of marker in my first post. But maybe I did it wrong here?

library(tidyr)
library(dplyr)

marker <- c("Am01", "Am02d", "Am03c", "Am04b", "Am05", "Mc01", "Pi02", "Pi03b", "Pi04", "Pi08c", "Pi10b", "Pi12", "Pi14", "Wr02")

df_A <- reactions_perPlate_perRow_perMarker_sorted
df_B <- Volle_Plaat_Overzicht

A_long <- pivot_longer(data = df_A, cols = -c("row", "plate", "plate_Number"), 
                       names_to = "marker", values_to = "Value")


B_long <- pivot_longer(data = df_B, cols = -plateNumber, 
                       names_to = "marker", values_to = "Value_B")

Joined <- left_join(A_long, B_long, by = c("plate_Number" = "plateNumber","marker"="marker"))

Joined <- mutate(Joined, Value = ifelse(Value_B != "<NA>", 0, Value))

Wide <- Joined |> select(-Value_B) |> 
  pivot_wider(names_from = marker, values_from = Value)

> print(Wide, n = 20)
# A tibble: 808 × 17
   row   plate plate_Number  Am01 Am02d Am03c Am04b  Am05  Mc01  Pi02 Pi03b  Pi04 Pi08c Pi10b  Pi12  Pi14  Wr02
   <chr> <dbl> <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 A         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2 B         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 3 C         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 4 D         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5 E         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6 F         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7 G         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 8 H         1 Plate1           0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9 A         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
10 B         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
11 C         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
12 D         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
13 E         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
14 F         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
15 G         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
16 H         2 Plate2           0    NA    NA    NA    NA    NA    NA     0    NA    NA    NA    NA    NA    NA
17 A         3 Plate3          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
18 B         3 Plate3          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
19 C         3 Plate3          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
20 D         3 Plate3          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# … with 788 more rows
# ℹ Use `print(n = ...)` to see more rows

MN_AGRICO · April 12, 2023, 9:22am

perhaps this helps: I created separate output for the Joined and a Joined_mutate to see what is happening. For some reason the Value is not returned if it is NA in Value_B....

Joined <- left_join(A_long, B_long, by = c("plate_Number" = "plateNumber","Markers"="Markers"))
> print(Joined, n = 20)
# A tibble: 11,312 × 6
   row   plate plate_Number Markers Value Value_B
   <chr> <dbl> <chr>        <chr>   <dbl> <chr>  
 1 A         1 Plate1       Am01       12 Plate1 
 2 A         1 Plate1       Am02d       0 NA     
 3 A         1 Plate1       Am03c       0 NA     
 4 A         1 Plate1       Am04b       0 NA     
 5 A         1 Plate1       Am05        0 NA     
 6 A         1 Plate1       Mc01        0 NA     
 7 A         1 Plate1       Pi02        0 NA     
 8 A         1 Plate1       Pi03b       0 NA     
 9 A         1 Plate1       Pi04        0 NA     
10 A         1 Plate1       Pi08c      12 NA     
11 A         1 Plate1       Pi10b       0 NA     
12 A         1 Plate1       Pi12        0 NA     
13 A         1 Plate1       Pi14        0 NA     
14 A         1 Plate1       Wr02        0 NA     
15 B         1 Plate1       Am01       12 Plate1 
16 B         1 Plate1       Am02d       0 NA     
17 B         1 Plate1       Am03c       0 NA     
18 B         1 Plate1       Am04b       0 NA     
19 B         1 Plate1       Am05        0 NA     
20 B         1 Plate1       Mc01        0 NA     
# … with 11,292 more rows
# ℹ Use `print(n = ...)` to see more rows
> Joined_mutate <- mutate(Joined, Value = ifelse(Value_B != "<NA>", 0, Value))
> print(Joined_mutate, n = 20)
# A tibble: 11,312 × 6
   row   plate plate_Number Markers Value Value_B
   <chr> <dbl> <chr>        <chr>   <dbl> <chr>  
 1 A         1 Plate1       Am01        0 Plate1 
 2 A         1 Plate1       Am02d      NA NA     
 3 A         1 Plate1       Am03c      NA NA     
 4 A         1 Plate1       Am04b      NA NA     
 5 A         1 Plate1       Am05       NA NA     
 6 A         1 Plate1       Mc01       NA NA     
 7 A         1 Plate1       Pi02       NA NA     
 8 A         1 Plate1       Pi03b      NA NA     
 9 A         1 Plate1       Pi04       NA NA     
10 A         1 Plate1       Pi08c      NA NA     
11 A         1 Plate1       Pi10b      NA NA     
12 A         1 Plate1       Pi12       NA NA     
13 A         1 Plate1       Pi14       NA NA     
14 A         1 Plate1       Wr02       NA NA     
15 B         1 Plate1       Am01        0 Plate1 
16 B         1 Plate1       Am02d      NA NA     
17 B         1 Plate1       Am03c      NA NA     
18 B         1 Plate1       Am04b      NA NA     
19 B         1 Plate1       Am05       NA NA     
20 B         1 Plate1       Mc01       NA NA     
# … with 11,292 more rows
# ℹ Use `print(n = ...)` to see more rows

nirgrahamuk · April 12, 2023, 9:28am

I think there is NA confusion. Your data likely always had true NA character values (the screenshots suggest this) ; however when you printed out a text representation they showed up like <NA> leading them to appear like fixed strings rather than NA values. I think the assumptions based around the <NA> idea would need to be walked back.
I think this is a fine example where a dput() is superior than the text table print approach you chose; as it hid the underlying representation. If you can print it, I'm sure you can dput it ...

MN_AGRICO · April 12, 2023, 9:36am

Dear nirgrahamuk and FJCC,

I altered the ifelse test on the joint_mutated a little bit using is.na into

Joined_mutate <- mutate(Joined, Value = ifelse(is.na(Value_B), Value, 0))

It does the trick now!

I am especially grateful to FJCC for helping me! Since I am a beginner, every step in my data-processing comes with new challenges to me. Many times I am not aware of the possibilities of Rstudio. I studied the pivoting proces using youtube and it makes a lot of sense. Taking it step by step

Thnx!

system · April 19, 2023, 9:36am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.