Identifying similarities and differences in a Data Frame help

Hi R Community, I am tasked with a project to scrape two excel CSV files with 3 tabs each in the file. For both CSV files, I was told that the rows and columns for tab 1, tab 2, and tab 3 match up but I need to make sure that this is the case. So, I need to detect if there are any differences in each cell for all rows and columns per tab. I think a binary true false outcome is fine. Can anyone point me in the direction for how I can achieve this? Are there any packages that might streamline this task? Thank you in advance.

1 Like

You can compare data frames as shown in the examples below. Neither method will tell you where the differences are, only that the data frames are not identical. You could use the result of DF == DF2 to locate where the result is FALSE.
Keep in mind that any difference will be detected. A trailing space in the text or a one digit difference in the tenth decimal place will count as failures.
I will mention that CSV files do not have tabs. These are probably regular xlsx Excel files.

DF <- data.frame(A = 1:4, B = c("A","B","C","D"))
DF2 <- data.frame(A = 1:4, B = c("A","B","E","D"))
DF3 <- data.frame(A = 1:4, B = c("A","B","C","D"))

all(DF == DF2)
#> [1] FALSE
all(DF == DF3)
#> [1] TRUE

identical(DF, DF2)
#> [1] FALSE
identical(DF, DF3)
#> [1] TRUE

Created on 2022-10-11 with reprex v2.0.2

Thanks! @FJCC

It is an xlsx file. Below is the code that I used, but can't get it to run correctly with the two excel files I have to detect different values.

The task: I have two excel files with multiple sheets. The sheet names and their corresponding column names are same of both the files. Only the values in the sheets are different. I want to compare using R which values are different and want to mark those cells.

# I got this to work

library(tidyverse)
library(readxl)

#Read Articles Coded Final .xlsx file
ACode <- read_excel("Article Coding (Final)_AD.xlsx")

sheet_names <- excel_sheets("Article Coding (Final)_AD.xlsx")
sheet_names

list_all <- lapply(sheet_names, function(x) {
    as.data.frame(read_excel("Article Coding (Final)_AD.xlsx", sheet = 
                               x)) } )
names(list_all) <- sheet_names
# I didn't get this to work and am not sure how to customize the code to capture the properties of my excel files. I couldn't figure out how to upload the files, but am happy to share them over email if needed. Thanks a lot.

wbsCreate <- function(v) {
  wb <- createWorkbook()
  sheet <- createSheet(wb, "Sheet1")
  rows  <- createRow(sheet, rowIndex=1:5)
  cells <- createCell(rows, colIndex=1:5) 
  for (r in 1:5)
    for (c in 1:5)
      setCellValue(cells[[r, c]], value = v[(r-1)*5+c])
  saveWorkbook(wb, tf <- tempfile(fileext = ".xlsx"))
  return(tf)
}

wbsMarkDiff <- function(fn1, fn2) {
  fns <- c(fn1, fn2)
  wb <- lapply(fns, loadWorkbook)  
  cs <- lapply(wb, function(x) CellStyle(x) + 
                 Fill(backgroundColor="red", 
                      foregroundColor="red", 
                      pattern="SOLID_FOREGROUND"))
  sheets <- lapply(wb, getSheets)
  sheetnames <- do.call(intersect, lapply(sheets, names))
  for (sheetname in sheetnames) {
    sheet <- lapply(sheets, "[[", sheetname)
    rows <- lapply(sheet, getRows)
    cells <- lapply(rows, getCells)
    values <- lapply(cells, function(cell) lapply(cell, getCellValue))
    idx <- names(which(!mapply(identical, values[[1]], values[[2]])))
    for (s in 1:2) 
      for (i in idx) 
        setCellStyle(cells[[s]][[i]], cs[[s]])
    for (s in 1:2)
      saveWorkbook(wb[[s]], fns[s])
  }
}

library(xlsx)

# create to excel workbooks (same dimensions per sheet)    
v <- LETTERS[1:25]
tf1 <- wbsCreate("ACode")
v[c(3,6,9)] <- letters[c(3,6,9)]
tf2 <- wbsCreate("ACode")

# mark differences     
wbsMarkDiff(tf1, tf2)
shell.exec(tf1) # open file1 on windows
shell.exec(tf2) # open file2 on windows

Where do the functions createSheet(), createRow(), createCell() come from? I do not see those in readxl.

@FJCC thanks for responding.

It is a function in the package xlsx to manipulate excel docs and sheets. Below is a URL that describes the arguments. Workbook function - RDocumentation

Also, I found the code on stackoverflow but I am not sure how to tweak it to specifically match what I have. Here is the stackoverflow URL: comparison - comparing excel sheets in R - Stack Overflow

I really appreciate your help with this.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.