Are Rows Identical?

I am trying to compare subsetted rows of a dataframe from beginning to end (by order of the presented dataframe), and return a true or false if the rows are the same, adding the result of true or false as a new column called compare.

In this reprex I want to return false if the previous row is different than the current row but only for subset columns y3 and y4. The column result "compare" should be False, False, True, True, False, True, False, False, False, False

y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y2, y3, y4)

Use distinct() from tidyverse.

yes, I've tried identical, distinct and lead\lag all within looping and apply structures and my results returned in my real case are jumbled up so I thought I'd post this simple example to get some ideas. thanks

I would do it like this. The first row isn't FALSE but that can be fixed by adding another condition or two to deal with NA values in either the lagged columns or the original columns.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y2, y3, y4)
d |> 
  mutate(y3_lag = lag(y3),
         y4_lag = lag(y4)) |> 
  mutate(compare = y3 == y3_lag & y4 == y4_lag)
#>    y1 y2 y3 y4 y3_lag y4_lag compare
#> 1   T  A  T  B   <NA>   <NA>      NA
#> 2   T  B  F  B      T      B   FALSE
#> 3   T  A  F  B      F      B    TRUE
#> 4   T  B  F  B      F      B    TRUE
#> 5   T  A  T  B      F      B   FALSE
#> 6   T  B  T  B      T      B    TRUE
#> 7   F  A  F  A      T      B   FALSE
#> 8   F  B  F  B      F      A   FALSE
#> 9   F  C  F  C      F      B   FALSE
#> 10  N  D  N  B      F      C   FALSE

Created on 2024-01-11 with reprex v2.0.2

do you have any ideas on how to make this dynamic, that is, specifying different columns by using a vector input of columns?

You're right. I gave a poor answer. Try

dSmaller <- d |> select(y3, y4)
duprows <- duplicated(dSmaller)

and then add duprows into your dataframe if you like.

hand checking and the expected results don't align with your code, I think there's promise in the other response, but probably needs to go into a function with apply so the columns can be more dynamic. Interesting we still need smart people to figure this out. I spent hours with ChatGPT giving me non-sense anwers.

d |> 
  mutate(y3_lag = lag(y3),
         y4_lag = lag(y4)) |> 
  mutate(compare = y3 == y3_lag & y4 == y4_lag)

Here is a modified method that takes a vector of column names. Notice I reordered the columns of d so y3 and y4 are not next to each other.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y3, y2, y4)

chosenRows <- c("y3", "y4")
chosenRows.x <- paste0(chosenRows, ".x")
chosenRows.y <- paste0(chosenRows, ".y")
d_lag <- d |> mutate(ROW = row_number() +1) |>  
  slice(-nrow(d))  |> 
  select(ROW, all_of(chosenRows))
  
d |> mutate(ROW = row_number()) |> 
  left_join(d_lag, by = "ROW") |> 
  rowwise() |> 
  mutate(compare = all(c_across(all_of(chosenRows.x)) == 
                         c_across(all_of(chosenRows.y))))
#> # A tibble: 10 × 8
#> # Rowwise: 
#>    y1    y3.x  y2    y4.x    ROW y3.y  y4.y  compare
#>    <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <lgl>  
#>  1 T     T     A     B         1 <NA>  <NA>  NA     
#>  2 T     F     B     B         2 T     B     FALSE  
#>  3 T     F     A     B         3 F     B     TRUE   
#>  4 T     F     B     B         4 F     B     TRUE   
#>  5 T     T     A     B         5 F     B     FALSE  
#>  6 T     T     B     B         6 T     B     TRUE   
#>  7 F     F     A     A         7 T     B     FALSE  
#>  8 F     F     B     B         8 F     A     FALSE  
#>  9 F     F     C     C         9 F     B     FALSE  
#> 10 N     N     D     B        10 F     C     FALSE

Created on 2024-01-11 with reprex v2.0.2

1 Like

You're right. I forgot that duplicated compares to any previous column, not just the last one. @FJCC's approach is better.

You can go fairly succinct using slider package, as it makes row based iteration easier, and supports windowing, in this case I use the .after control to indicate the size 1 look ahead needed to duplicate the results of the accepted answer.

y1 <- c("T", "T", "T", "T", "T", "T", "F", "F", "F", "N")
y2 <- c("A", "B", "A", "B", "A", "B", "A", "B", "C", "D")
y3 <- c("T", "F", "F", "F", "T", "T", "F", "F", "F", "N")
y4 <- c("B", "B", "B", "B", "B", "B", "A", "B", "C", "B")
d <- data.frame(y1, y2, y3, y4)

library(tidyverse)
library(slider)
cols_to_do <- c("y3", "y4")

d$dup_column <- slide_lgl(
  .x = d,
  .f = \(r){
    any(duplicated(select(
      r,
      all_of(cols_to_do)
    )))
  },
  .after = 1L,
  .complete = FALSE
)

d

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.