Does Merge work different within a created Function?

carmendw · September 8, 2018, 11:23am

Hi all,

I have a few lines of code that include a column drop, a column rename, and a merge of two df. I need to repeat it 18 times. I have been trying to create a function to do it, but merge is not happy with the renamed column. it does not see it for some reason. I have actually written out the code 18 times for individual trials and everything works, but in the function i create it does not.

Any help is appreciated!

works:

enrollment2001 <- read.csv("C:\\Box Sync\\Name\\Data\\Schools\\enrollment\\cde_enrollment\\Batch579\\enrollment_2000-2001.csv", colClasses = c("CDS_CODE" = "character"))
frpm2001 <- read.csv("C:\\Box Sync\\Name\\Data\\Schools\\frpm\\frpm_2000-2018_clean1\\frpm2001_clean1.csv", colClasses = c("CDSCode" = "character", "year"= "NULL"))

#rename key column to be a shared name CDS_CODE
names(frpm2001)[names(frpm2001) == "CDSCode"] <- "CDS_CODE"
#merge enrollment with frpm
Merged01 <- merge(enrollment2001, frpm2001, by= "CDS_CODE", all.x= TRUE)

create a function to combine enrollment and frpm data

  enrol_frpm<-function(year){
  
  #read in the data
  inputPath1 <- "C:\\Box Sync\\Name\\Data\\Schools\\enrollment\\cde_enrollment\\Batch579\\"
  inputPath2 <- "C:\\Box Sync\\Name\\Data\\Schools\\frpm\\frpm_2000-2018_clean1\\"
  inputPath3 <- "C:\\Box Sync\\Name\\Data\\Schools\\enrol_frpm\\"
  
  enrol <- read.table(paste(inputPath1,"enrollment_",year-1,"-",year,".csv",sep=''),sep = '', fill = TRUE,header = TRUE, quote = "", colClasses = c("CDS_CODE" = "character"))
  frpm <- read.csv(paste(inputPath2,"frpm",year, "_clean1", ".csv",sep = ''),sep = '', fill = TRUE,header = TRUE, quote = "", colClasses = c("CDSCode" = "character"))
  
  #drop year column in frpm
  frpm$year= NULL 
  #rename key column to be a shared name CDS_CODE
  names(frpm) <- c("CDS_CODE"-)
  #merge enrollment with frpm
  Merged<- merge(enrol, frpm, by= `CDS_CODE`, all.x= TRUE)
  
  #write out the data
  write.csv(Merged, paste0(inputPath3,"enrol_frpm",year,".csv",sep=''), row.names = FALSE)

}

#define the years of interest
years <- c(2001:2018)

#run the previously defined function for each year
for (year in years) {
enrol_frpm(year)
}

Error in merge(enrol, frpm, by = CDS_CODE, all.x = TRUE) :
object 'CDS_CODE' not found
In addition: Warning messages:
1: In read.table(paste(inputPath1, "enrollment_", year - 1, "-", year, :
not all columns named in 'colClasses' exist
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
not all columns named in 'colClasses' exist

cderv · September 8, 2018, 1:15pm

This line is different in your function than the one you wrote in the working part. by = "CDS_CODE" is different than

by = `CDS_CODE`

Did you try inside the function with "CDS_CODE" ?

carmendw · September 8, 2018, 2:40pm

@cderv I had to change it. I kept getting a fix.by error. When I googled it people said use the back-ticks instead. I changed it and that error disappeared but the object not found error comes up.

cderv · September 8, 2018, 2:56pm

Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

If you can use a subset of your data, it will be easier to test. merge is used with " in the help page, and it is as I used it. You can also post your previous error to see if it is related. Also, you have in your code

#rename key column to be a shared name CDS_CODE
names(frpm) <- c("CDS_CODE"-)

why the - ? is it working ?

carmendw · September 10, 2018, 7:52pm

@cderv This is the code I hope to make into a function to loop through for every year of data. I have 18 years (2001-2018) in individual files. one folder for frpm and one folder for enrollment, each with their respective 18 files. The data share a column but it have different names. So one line of the code renames that column so they are the same. The next line joins the two dataframes. The third line, which has not yet been added, is to write to csv.

##merge enrollment with frpm for each year (year column droppped from frpm df- year available in enrollment df)

enrollment2001 <- read.csv("C:\\Box Sync\\Karina Fastovsky\\Data\\Schools\\enrollment\\cde_enrollment\\Batch579\\enrollment_2000-2001.csv", colClasses = c("CDS_CODE" = "character"))
frpm2001 <- read.csv("C:\\Box Sync\\Karina Fastovsky\\Data\\Schools\\frpm\\frpm_2000-2018_clean1\\frpm2001_clean1.csv", colClasses = c("CDSCode" = "character", "year"= "NULL"))

  #rename key column to be a shared name CDS_CODE
  names(frpm2001)[names(frpm2001) == "CDSCode"] <- "CDS_CODE"
  #merge enrollment with frpm
  Merged01 <- merge(enrollment2001, frpm2001, by= "CDS_CODE", all.x= TRUE)

the - has no effect on the error. it was added while looking at other examples of code where people mark --- for columns they are skipping over as in c(1--4-6). Don't know if it works but decided to try it. The error happens with or without it.

Previous error: when using quotes on "CDS_CODE"

####create a function to combine enrollment and frpm data
  enrol_frpm<-function(year){
+   
+   #read in the data
+   inputPath1 <- "C:\\Box Sync\\Karina Fastovsky\\Data\\Schools\\enrollment\\cde_enrollment\\Batch579\\"
+   inputPath2 <- "C:\\Box Sync\\Karina Fastovsky\\Data\\Schools\\frpm\\frpm_2000-2018_clean1\\"
+   inputPath3 <- "C:\\Box Sync\\Karina Fastovsky\\Data\\Schools\\enrol_frpm\\"
+   
+   enrol <- read.table(paste(inputPath1,"enrollment_",year-1,"-",year,".csv",sep=''),sep = '', fill = TRUE,header = TRUE, quote = "", colClasses = c("CDS_CODE" = "character"))
+   frpm <- read.csv(paste(inputPath2,"frpm",year, "_clean1", ".csv",sep = ''),sep = '', fill = TRUE,header = TRUE, quote = "", colClasses = c("CDSCode" = "character"))
+   
+   #drop year column in frpm
+   frpm$year= NULL 
+   #rename key column to be a shared name CDS_CODE
+   names(frpm) <- c("CDS_CODE")
+   #merge enrollment with frpm
+   Merged<- merge(enrol, frpm, by= "CDS_CODE", all.x= TRUE)
+   
+   #write out the data
+   write.csv(Merged, paste0(inputPath3,"enrol_frpm",year,".csv",sep=''), row.names = FALSE)
+ 
+ }

#define the years of interest
years <- c(2001:2018)

#run the previously defined function for each year
for (year in years) {
+ enrol_frpm(year)
+ }
 Show Traceback
 
 Rerun with Debug
 Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
#> Error: <text>:22:3: unexpected '}'
#> 21: + 
#> 22: + }
#>       ^

Thank you for your time and help.

cderv · September 10, 2018, 9:06pm

Just to be sure as your example is not a reprex (if you could make one it would be awesome): how many columns are there in frpm ? You are renaming all to "CDS_CODE" here I think.
In your previous code you rename just one column

Is there something off here or do I misunderstood?

Please a reprex is needed here. It will help us help you and I think it will help you while trying to build the reprex. Thanks.

carmendw · September 10, 2018, 10:55pm

@cderv ... This was my attempt at a reprex > In R > library(reprex)> copy section > reprex > highlight what was in the side panel. That is what the tutorial said to do to create a reprex. Sorry, but No idea what else you are asking me to do. frpm has 3 columns, CDSCode, frpm, year ( CDSCode was the first column, I read somewhere online that it renames in order, since it is first anyway, it should have worked correctly). The code had to have worked as the merges worked 18 times correctly.

adding data attempt:
Enrollment df: actually has 23 columns but here are a few

# A tibble: 3 x 3
  CDS_CODE       ENR_TOTAL GR_579
  <chr>          <chr>     <chr> 
1 01100170130401 485       74    
2 01100170130419 48        0     
3 01100170130427 81        0

frpm df: Th e 3rd column "year" is dropped at read csv.

# A tibble: 2 x 2
  CDSCode        frpm 
  <chr>          <chr>
1 01100170130427 0    
2 01611190130229 15 ```