Code read with backticks even though I replaced it with apostrophes so I get a "column does not exist error"

Zcythor · August 3, 2022, 1:19pm

I'm trying to transform an excel file and was "handed down" an Rscript to transform specific files. I am very new to R so apologies in my lack of understanding.
this is my code:

# libraries
library(dplyr)
library(tidyr)
library(readxl)
library(openxlsx) # Load on each new session

# Use this for leavers survey

# change the file path "<path>/NEW - leavers data 2019 2020.xlsx"
inputFilepath <- "C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST.xlsx"
outputFilepath <- "C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST Reshaped.xlsx"

# C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST.xlsx
df = read_excel(inputFilepath, sheet="Sheet1")

# Step 1: Remove Duplicates
dfDuplicatesRemoved = df %>%
  group_by("Employee ID", "Question Body") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

# Step 2: Pivot Data
dfTransformed = dfDuplicatesRemoved %>%
  select("Employee ID", "Question Body", "Questionnaire Answer") %>%
  pivot_wider(id_cols="Employee ID", names_from="Question Body", values_from="Questionnaire Answer") 

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

# Step 4: Join transformed data with Employee Profile to have a complete view
dfTranformedWithEmployeeProfile = dfTransformed %>% right_join(dfEmployeeProfile)

write_xlsx(dfTranformedWithEmployeeProfile, outputFilepath)
rm(list = ls())

when I ran it initially, I ran into an error due to backticks, where, in "Step 1" columns "Employee ID" and "Question Body" would not be detected. I fixed this by replacing the backticks with apostrophes, as, in the earlier code, the line was written as:

 group_by(`Employee ID`, `Question Body`) %>%

Then I ran into this error:

> # Step 2: Pivot Data
> > dfTransformed = dfDuplicatesRemoved %>%
> +   select("Employee ID", "Question Body", "Questionnaire Answer") %>%
> +   pivot_wide .... [TRUNCATED] 
> Error in `select()`:
> ! Can't subset columns that don't exist.
> x Column `Employee ID` doesn't exist.

But this time I replaced all backticks with apostrophes. so what could be causing this error now?

nirgrahamuk · August 3, 2022, 1:30pm

This was a mistake and sent you down the wrong track.
You could use feature of library(rlang) to interpret the strings as symbols for group_by, or use group_by_at that expects to get strings.

#normal
iris %>% group_by(Species) %>% summarise(m_ = mean(Petal.Length))
#fine also
iris %>% group_by(`Species`) %>% summarise(m_ = mean(Petal.Length))
# mis-step , not as expected
iris %>% group_by("Species") %>% summarise(m_ = mean(Petal.Length))

# option 1
iris %>% group_by(!!sym("Species")) %>% summarise(m_ = mean(Petal.Length))
# option 2
iris %>% group_by_at("Species") %>% summarise(m_ = mean(Petal.Length))

Zcythor · August 3, 2022, 1:40pm

I have included rlang and used group_by_at and replaced apostrophes with backticks and I'm back with the same problem but with Step 1:

The code is now:

# Step 1: Remove Duplicates
dfDuplicatesRemoved = df %>%
  group_by_at(`Employee ID`, `Question Body`) %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

and the output error is:

 # Step 1: Remove Duplicates
> dfDuplicatesRemoved = df %>%
+   group_by_at(`Employee ID`, `Question Body`) %>%
+   filter(row_number() == n()) %>% # .... [TRUNCATED] 
Error in tbl_at_vars(tbl, vars, .include_group_vars = .include_group_vars,  : 
  object 'Employee ID' not found

nirgrahamuk · August 3, 2022, 2:05pm

You've chosen to try something that I did not demonstrate to you....
Was there a reason you didn't choose to try one of the options that I showed as working ?

Zcythor · August 3, 2022, 2:40pm

I misunderstood your comment.
I have re-attempted your optional methods, neither seem to work.

# Step 1: Remove Duplicates
dfDuplicatesRemoved = df %>%
  group_by_at("Employee ID", "Question Body") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

which gives me this error:

> # Step 1: Remove Duplicates
> dfDuplicatesRemoved = df %>%
+   group_by_at("Employee ID", "Question Body") %>%
+   filter(row_number() == n()) %>% # .... [TRUNCATED] 
Error in `group_by_at()`:
! Can't subset columns that don't exist.
x Column `Employee ID` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.

and I also tried the other option:

# Step 1: Remove Duplicates
dfDuplicatesRemoved = df %>%
  group_by(!!sym("Employee ID", "Question Body")) %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

which gives me this error:

> # Step 1: Remove Duplicates
> dfDuplicatesRemoved = df %>%
+   group_by(!!sym("Employee ID", "Question Body")) %>%
+   filter(row_number() == n()) % .... [TRUNCATED] 
Error in sym("Employee ID", "Question Body") : 
  unused argument ("Question Body")

nirgrahamuk · August 3, 2022, 2:45pm

This is an aside but when using !!sym() you process one at a time so it would be
group_by(!!sym("Employee ID"), !!sym("Question Body"))

however, I expect this would give the same error as your other code with group_by_at.
The interpretation from the error message is clear. Despite your belief that there is a variable "Employee ID" in the df, there is no such variable.
I recommend you use R to report to you the columnames of 'df' so you can base your manipulations of it from that , sorting them into alphabetical order, might help here.

sort(names(df))

is Employee ID an entry in that output ?

Zcythor · August 3, 2022, 2:53pm

Strange, I don't. I get this:

> sort(names(df))
 [1] "...10"                    "...11"                    "...12"                    "...13"                   
 [5] "...14"                    "...15"                    "...16"                    "...17"                   
 [9] "...18"                    "...19"                    "...2"                     "...20"                   
[13] "...21"                    "...22"                    "...23"                    "...3"                    
[17] "...4"                     "...5"                     "...6"                     "...7"                    
[21] "...8"                     "...9"                     "Exit Interview Responses"

if it helps, this is the excel file I am working with which I use this script to transform, I was told all it needed was changing the filepathname.

Exit Interview Responses
Questionnaire Target Contexts
Questionnaire Targets
Supervisory Organization
Include Subordinate Organizations	Yes
Created On or After	01/06/2022
Created On or Before	30/06/2022
																				Questionnaire Answers
Effective Date	Business Process Reason	Employee ID	Voluntary or Involuntary	Manager	Supervisory Organization	Management Level	Location	Country	FTE %	Service in Years	Worker Type	Supervisory Org - 1 From The Top	Supervisory Org - 2 From The Top	Supervisory Org - 3 From The Top	Supervisory Org - 4 From The Top	Supervisory Org - 5 From The Top	Supervisory Org - 6 From The Top	Questionnaire	Respondent	Question	Question Body	Questionnaire Answer

the columns below would contain information about the surveyees which I did not include

nirgrahamuk · August 3, 2022, 2:57pm

Its clear your excel has preliminary rows that arent intented to be the table to read in and make df out of.
you will have to count the records to skip and use the skip parameter of read_excel to make a table having skipped the initial rows.
Read xls and xlsx files — read_excel • readxl (tidyverse.org)

Zcythor · August 3, 2022, 3:17pm

It worked! thank you, it was an issue with the excel file itself. All that's left is to replace the old code which writes the transformed data frame into excel, I'm guessing write_xlsx is not a used function anymore.

system · August 24, 2022, 3:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.