mutate new dataframe id duplicate with old dataframe

shoaibali · August 27, 2020, 5:07am

I have two data frame, this is just a sample , database have approx 1 million of records.

data1<-data.frame(
  'External ID' = c(86364,"ARV_2612","AGH_2212","IND_2622","CHG_2622"),
  sector = c(3,3,1,2,5),
  col1=c(1,1,0,0,0),
  'Enternal code'=c(1,1,1,1,3),
  col3=c(1,1,0,0,0),
  col4=c(1,0,0,0,0),
  col5=c(1,0,1,1,1)
)

data2<-data.frame(
  'External ID' = c(53265,"ARV_7362",76354,"IND_2622","CHG_9762"),
  sector = c(3,3,1,2,5),
  col1=c(1,1,0,0,0),
  'Enternal code'=c(1,1,1,1,3),
  col3=c(1,1,0,0,0),
  col4=c(1,0,0,0,0),
  col5=c(1,0,1,1,1)
)

new to R,now i am looking for a approach to mutate my one data frame (data2) like. the code while will automatically find the column "External ID", then add a new column in data2 like duplicate and this column will show Y or N if the external_id in data2 present in data1(external ID) any simple solution....?? the output should be like

External.ID	sector	col1	Enternal.code	col3	col4	col5	duplicate
53265	3	1	1	1	1	1	N
ARV_7362	3	1	1	1	0	0	N
76354	1	0	1	0	0	1	N
IND_2622	2	0	1	0	0	1	Y
CHG_9762	5	0	3	0	0	1	N

nirgrahamuk · August 27, 2020, 8:32am

data2 %>% mutate(
  duplicated = factor(if_else(External.ID %in% 
                                pull(data1, External.ID),
                              "Yes", 
                              "No")))

shoaibali · August 27, 2020, 8:45am

can we pull column External.ID like select(matches("^External|ID") because i want to make a function to this , so that it can dynamically select that External_ID if matches column in data1

nirgrahamuk · August 27, 2020, 8:47am

you are saying that there can be deviation in how external ID column is named, so you want regex matching between columns of dataframes where external and ID appear as tokens in a column.
I'd suggest its more elegant to do a function that is limited to identifying the externaID columns, and then use the result from calling that function to then do the duplicated check using the approach I suggested.

shoaibali · August 27, 2020, 9:06am

yes ....
I'd suggest its more elegant to do a function that is limited to identifying the externaID columns .....??
how ....??

nirgrahamuk · August 27, 2020, 9:11am

data1<-data.frame(
  'ExternalID' = c(86364,"ARV_2612","AGH_2212","IND_2622","CHG_2622"),
  sector = c(3,3,1,2,5)
)

data2<-data.frame(
  'External ID' = c(53265,"ARV_7362",76354,"IND_2622","CHG_9762"),
  sector = c(3,3,1,2,5)
)


getexidnames<- function(data1,data2){
(d1name <- names(data1) %>% stringr::str_subset("^External|ID"))
(d2name <- names(data2) %>% stringr::str_subset("^External|ID"))
list(d1name=d1name,
     d2name=d2name)
}

getexidnames(data1,data2)

shoaibali · August 27, 2020, 11:02am

thanks for help, it worked for me

system · September 3, 2020, 11:03am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.