Generating different output table when I use inner_join and data_table

JojoSouza · April 18, 2022, 10:50pm

I would like to know why I can't generate the same output table with Code 2.

I have the Code 1 which is this:

Code 1

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

df1 <- structure(
  list(date1= c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
                "2021-06-28","2021-06-28","2021-06-28"),
       date2 = c("2021-06-25","2021-06-25","2021-06-27","2021-07-07","2021-07-07","2021-07-09","2021-07-09","2021-07-09"),
       Code = c("FDE","ABC","ABC","ABC","CDE","FGE","ABC","CDE"),
       Week= c("Wednesday","Wednesday","Friday","Wednesday","Wednesday","Friday","Friday","Friday"),
       DR1 = c(4,1,4,3,3,4,3,5),
       DR01 = c(4,1,4,3,3,4,3,6), DR02= c(4,2,6,7,3,2,7,4),DR03= c(9,5,4,3,3,2,1,5),
       DR04 = c(5,4,3,3,6,2,1,9),DR05 = c(5,4,5,3,6,2,1,9),
       DR06 = c(2,4,3,3,5,6,7,8),DR07 = c(2,5,4,4,9,4,7,8),
       DR08 = c(3,2,0,1,2,4,2,2),DR09 = c(0,0,0,0,0,0,0,0),DR010 = c(0,0,0,0,0,0,0,0),DR011 = c(4,0,0,0,0,0,0,0), 
       DR012 = c(3,2,0,3,5,3,4,5),DR013 = c(0,0,1,0,0,0,2,0),DR014 = c(0,0,0,0,0,2,0,0)),
  class = "data.frame", row.names = c(NA, -8L))

selection = startsWith(names(df1), "DR0")

df1[selection][is.na(df1[selection])] = 0

dt1 <- as.data.table(df1)

cols <- grep("^DR0", colnames(dt1), value = TRUE)

medi_ana <- 
  dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
  ][, lapply(.SD, median), by = .(Code, Week), .SDcols = paste0(cols, "_PV") ]


 SPV<-df1%>%
   inner_join(medi_ana, by = c('Code', 'Week')) %>%
   mutate(across(matches("^DR0\\d+$"), ~.x +
                   get(paste0(cur_column(), '_PV')),
                 .names = '{col}_{col}_PV')) %>%
   select(date1:Week, DR01_DR01_PV:last_col())%>%
   data.frame()
 
 dmda<-"2021-07-07"
 CodeChosse<-"CDE"
 
 mat1 <- df1 %>%
   filter(date2 == dmda, Code == CodeChosse) %>%
   select(starts_with("DR0")) %>%
   pivot_longer(cols = everything()) %>%
   arrange(desc(row_number())) %>%
   mutate(cs = cumsum(value)) %>%
   filter(cs == 0) %>%
   pull(name)
 (dropnames <- paste0(mat1,"_",mat1, "_PV"))
 [1] "DR014_DR014_PV" "DR013_DR013_PV"

 First<-SPV %>%
   filter(date2 == dmda, Code == CodeChosse) %>%
   select(-any_of(dropnames))
 
> First
       date1      date2 Code      Week DR01_DR01_PV DR02_DR02_PV DR03_DR03_PV DR04_DR04_PV DR05_DR05_PV DR06_DR06_PV
1 2021-06-28 2021-07-07  CDE Wednesday            3            3            3            3            3            3
  DR07_DR07_PV DR08_DR08_PV DR09_DR09_PV DR010_DR010_PV DR011_DR011_PV DR012_DR012_PV
1            3            3            3              3              3              3

Notice in this first code that the columns "DR013_DR013_PV" and "DR014_DR014_PV" are taken from First. This code is generating the result I want.

To improve the speed of execution I decided to use data_table in SPV instead of using inner_join. However when I use the rest of the code I can't get the desired result, that is, the columns "DR013_PV" and "DR014_PV" are not removed as in the first code. See Code 2. What could be wrong?

Code 2

f1 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f1(names(df1), "^DR0\\d+$")
nm2 <- f1(names(medi_ana), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[medi_ana,  (nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Code, Week)]
SPV <- df1[, c('date1', 'date2', 'Code', 'Week', nm2), with = FALSE] %>% data.frame()


dmda<-"2021-07-07"
CodeChosse<-"CDE"

mat1 <- df1 %>%
  filter(date2 == dmda, Code == CodeChosse) %>%
  select(starts_with("DR0")) %>%
  pivot_longer(cols = everything()) %>%
  arrange(desc(row_number())) %>%
  mutate(cs = cumsum(value)) %>%
  filter(cs == 0) %>%
  pull(name)
(dropnames <- paste0(mat1, "_PV"))

Second<-SPV %>%
  filter(date2 == dmda, Code == CodeChosse) %>%
  select(-any_of(dropnames))

> Second
       date1      date2 Code      Week DR01_PV DR02_PV DR03_PV DR04_PV DR05_PV DR06_PV DR07_PV DR08_PV DR09_PV DR010_PV DR011_PV
1 2021-06-28 2021-07-07  CDE Wednesday       3       3       3       3       3       3       3       3       3        3        3
  DR012_PV DR013_PV DR014_PV
1        3        3        3

Therefore, DR013_PV and DR014_PV would not have to be in Second.

williaml · April 19, 2022, 12:44am

Is mat1 different in the second one? Otherwise:

> First == Second
     date1 date2 Code Week DR01_DR01_PV DR02_DR02_PV DR03_DR03_PV DR04_DR04_PV DR05_DR05_PV DR06_DR06_PV DR07_DR07_PV DR08_DR08_PV DR09_DR09_PV
[1,]  TRUE  TRUE TRUE TRUE         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE
     DR010_DR010_PV DR011_DR011_PV DR012_DR012_PV
[1,]           TRUE           TRUE           TRUE

Actually this bit is different in mat1:

(dropnames <- paste0(mat1,"_",mat1, "_PV"))
(dropnames <- paste0(mat1, "_PV"))

JojoSouza · April 19, 2022, 2:07am

Thanks for reply @williaml

I didn't quite understand what you meant. See that in First of Code 1 I get 16 variables. In Second of Code 2, I get 18 variables. However, I would have to have the same results.

williaml · April 19, 2022, 2:59am

I meant that this is different:

mat1 <- df1 %>%
  filter(date2 == dmda, Code == CodeChosse) %>%
  select(starts_with("DR0")) %>%
  pivot_longer(cols = everything()) %>%
  arrange(desc(row_number())) %>%
  mutate(cs = cumsum(value)) %>%
  filter(cs == 0) %>%
  pull(name)
(dropnames <- paste0(mat1, "_PV"))

I edited the previous post slightly as well.

system · May 10, 2022, 2:59am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.