Are Rows Identical?

michaelgloven · January 12, 2024, 1:45am

I am trying to compare subsetted rows of a dataframe from beginning to end (by order of the presented dataframe), and return a true or false if the rows are the same, adding the result of true or false as a new column called compare.

In this reprex I want to return false if the previous row is different than the current row but only for subset columns y3 and y4. The column result "compare" should be False, False, True, True, False, True, False, False, False, False

y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y2, y3, y4)

startz · January 12, 2024, 2:28am

Use distinct() from tidyverse.

michaelgloven · January 12, 2024, 2:32am

yes, I've tried identical, distinct and lead\lag all within looping and apply structures and my results returned in my real case are jumbled up so I thought I'd post this simple example to get some ideas. thanks

FJCC · January 12, 2024, 3:17am

I would do it like this. The first row isn't FALSE but that can be fixed by adding another condition or two to deal with NA values in either the lagged columns or the original columns.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y2, y3, y4)
d |> 
  mutate(y3_lag = lag(y3),
         y4_lag = lag(y4)) |> 
  mutate(compare = y3 == y3_lag & y4 == y4_lag)
#>    y1 y2 y3 y4 y3_lag y4_lag compare
#> 1   T  A  T  B   <NA>   <NA>      NA
#> 2   T  B  F  B      T      B   FALSE
#> 3   T  A  F  B      F      B    TRUE
#> 4   T  B  F  B      F      B    TRUE
#> 5   T  A  T  B      F      B   FALSE
#> 6   T  B  T  B      T      B    TRUE
#> 7   F  A  F  A      T      B   FALSE
#> 8   F  B  F  B      F      A   FALSE
#> 9   F  C  F  C      F      B   FALSE
#> 10  N  D  N  B      F      C   FALSE

^{Created on 2024-01-11 with reprex v2.0.2}

michaelgloven · January 12, 2024, 3:22am

do you have any ideas on how to make this dynamic, that is, specifying different columns by using a vector input of columns?

startz · January 12, 2024, 3:23am

You're right. I gave a poor answer. Try

dSmaller <- d |> select(y3, y4)
duprows <- duplicated(dSmaller)

and then add duprows into your dataframe if you like.

michaelgloven · January 12, 2024, 3:36am

hand checking and the expected results don't align with your code, I think there's promise in the other response, but probably needs to go into a function with apply so the columns can be more dynamic. Interesting we still need smart people to figure this out. I spent hours with ChatGPT giving me non-sense anwers.

d |> 
  mutate(y3_lag = lag(y3),
         y4_lag = lag(y4)) |> 
  mutate(compare = y3 == y3_lag & y4 == y4_lag)

FJCC · January 12, 2024, 3:43am

Here is a modified method that takes a vector of column names. Notice I reordered the columns of d so y3 and y4 are not next to each other.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y3, y2, y4)

chosenRows <- c("y3", "y4")
chosenRows.x <- paste0(chosenRows, ".x")
chosenRows.y <- paste0(chosenRows, ".y")
d_lag <- d |> mutate(ROW = row_number() +1) |>  
  slice(-nrow(d))  |> 
  select(ROW, all_of(chosenRows))
  
d |> mutate(ROW = row_number()) |> 
  left_join(d_lag, by = "ROW") |> 
  rowwise() |> 
  mutate(compare = all(c_across(all_of(chosenRows.x)) == 
                         c_across(all_of(chosenRows.y))))
#> # A tibble: 10 × 8
#> # Rowwise: 
#>    y1    y3.x  y2    y4.x    ROW y3.y  y4.y  compare
#>    <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <lgl>  
#>  1 T     T     A     B         1 <NA>  <NA>  NA     
#>  2 T     F     B     B         2 T     B     FALSE  
#>  3 T     F     A     B         3 F     B     TRUE   
#>  4 T     F     B     B         4 F     B     TRUE   
#>  5 T     T     A     B         5 F     B     FALSE  
#>  6 T     T     B     B         6 T     B     TRUE   
#>  7 F     F     A     A         7 T     B     FALSE  
#>  8 F     F     B     B         8 F     A     FALSE  
#>  9 F     F     C     C         9 F     B     FALSE  
#> 10 N     N     D     B        10 F     C     FALSE

^{Created on 2024-01-11 with reprex v2.0.2}

startz · January 12, 2024, 3:45am

You're right. I forgot that duplicated compares to any previous column, not just the last one. @FJCC's approach is better.

nirgrahamuk · January 12, 2024, 11:12am

You can go fairly succinct using slider package, as it makes row based iteration easier, and supports windowing, in this case I use the .after control to indicate the size 1 look ahead needed to duplicate the results of the accepted answer.

y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y2, y3, y4)

library(tidyverse)
library(slider)
cols_to_do <- c("y3", "y4")

d$dup_column <- slide_lgl(
  .x = d,
  .f = \(r){
    any(duplicated(select(
      r,
      all_of(cols_to_do)
    )))
  },
  .after = 1L,
  .complete = FALSE
)

d

system · January 19, 2024, 11:12am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.