Read a delimited file with escape char in R

josephmathewp · October 23, 2023, 6:53am

I need to read a delimited file with | delimiter, whole characters quoted using "" and escape character as BACKSLASH.
The file data has 5 columns with Col1 .. Col5 where Col3 has values which are again delimited multiple values.

"Col1"|"Col2"|"Col3"|"Col4"|"Col5"
"Row1"|"aaaa"|"No Comment"|"1111"|"2018-07-02 00:00:01"
"Row2"|"bbbb"|"No Comment"|"2222"|"2019-08-02 00:00:01"
"Row3"|"cccc"|"No Comment"|"3333"|"2020-09-02 00:00:01"
"Row4"|"dddd"|"\"a1|b2|c3|d4|e5|f6|g7|h8|\""|"4444"|"2021-10-02 00:00:01"
"Row5"|"eeee"|"No Comment"|"5555"|"2022-11-02 00:00:01"
"Row6"|"ffff"|"\"a1|b2|c3|d4|e5|f6|g7|h8|\""|"6666"|"2023-10-02 00:00:01"

Row 4 has the value for Col 3 as - "a1|b2|c3|d4|e5|f6|g7|h8|"
and
Row 6 has the value for Col 3 as - "a1|b2|c3|d4|e5|f6|g7|h8|"

In Python, it can be read as

pd.read_csv('./test-file.txt', delimiter='|', doublequote=False, escapechar='\\', quoting=csv.QUOTE_ALL)

Please help me read it in R.

The following command in R

read.table("/test-file.txt", sep='|', quote = "\\", header=TRUE, allowEscapes = TRUE)

gives error

Error in read.table("/test-file.txt", :
more columns than column names

nirgrahamuk · October 23, 2023, 9:23am

This issue seems very related to the particular details of the text, therefore its best to format the text as code, to preserve it accurately as the forum interpreter may alter or obfuscate the issue.

jrkrideau · October 23, 2023, 8:29pm

It may not be a fancy solution but if the pattern is consistent why not just open the .txt file in a text editor and a couple of global find and rplace operations?

"|" --> "," &
\" to nothing gave this

structure(list(Col1 = c("Row1", "Row2", "Row3", "Row4", "Row5", 
"Row6"), Col2 = c("aaaa", "bbbb", "cccc", "dddd", "eeee", "ffff"
), Col3 = c("No Comment", "No Comment", "No Comment", "a1|b2|c3|d4|e5|f6|g7|h8|", 
"No Comment", "a1|b2|c3|d4|e5|f6|g7|h8|"), Col4 = c(1111L, 2222L, 
3333L, 4444L, 5555L, 6666L), Col5 = structure(c(1530489601, 1564704001, 
1599004801, 1633132801, 1667347201, 1696204801), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")), row.names = c(NA, -6L), class = "data.frame")

josephmathewp · October 25, 2023, 4:23am

Solved it using a function.

Read_Data = function(filepath) {
  con = file(filepath, "r")
  dat_df = data.frame()
  cnt = 0
  while ( TRUE ) {
    cnt = cnt + 1
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    line_remove_first_quote <- substr(line,start = 2,stop = nchar(line) - 1)
    line_delimiter_replaced = gsub('"|"', "##", line_remove_first_quote, fixed=TRUE)
    line_splitted <- unlist(strsplit(line_delimiter_replaced, '##'))
    
    if(cnt == 1) {
      columns = c(line_splitted)
      dat_df = data.frame(matrix(nrow = 0, ncol = length(columns)))
      colnames(dat_df) = columns
      next
    }
    dat_df[nrow(dat_df) + 1,] <- c(line_splitted)
  }
  close(con)
  return(dat_df)
}

josephmathewp · November 8, 2023, 6:09am

fread function from data.table can read this type of delimited files correctly

library(data.table)
fread(file_path, sep = "|", na.strings = c("", "NA", "n/a", "na", "N/A", "?", "999999"))

jrkrideau · November 8, 2023, 6:42am

Thanks, I have been trying to learn something about data.table but had not not realized fread could do that.

technocrat · November 8, 2023, 7:24am

data.table is full of the right way to do stuff

jrkrideau · November 8, 2023, 1:53pm

Yes but it taking me a long time to learn it. Too many people use tidyverse

nirgrahamuk · November 8, 2023, 3:36pm

I occasionally use dtplyr to benefit from data.table speed but leverage my existing knowledge of how tidyverse syntax goes (and my relative lack of knowledge of the data.table syntax; dtplyr might be useful to you similarly.

jrkrideau · November 8, 2023, 4:03pm

Ah, thanks but I know less about {dplyr} than I do about {data.table}.

I ran into data.table years ago, pre-tidyverse, when I was just starting to learn R and was a bit scared by the data.table syntax. I was having enough trouble with Base R. It was just last year or so that I realized how handy it is.

I am not a great fan of dplyr as I find it is way too verbose. {data.table} has a conciseness that appeals. I am reminded of going from SPSS to SAS/Systat.

My background is in psychology and for practical reasons, back in the stone age, SPSS was preferred over SAS in general. It was a revelation to discover SAS.

To my amazement, I was talking to a shiny brand-new graduate student in Psych the other day who was bemoaning the fact that her new department wanted her to learn R. After all the SPSS she had learned working with her undergrad advisor.

It's like a zombie, it cannot be killed.

martin.R · November 8, 2023, 4:42pm

In many ways the data.table syntax is much simpler in my opinion, at least for standard dataframe manipulations. I view it as a turbo-charged dataframe enhancement to R with some fantastic additions like fread() & fwrite() (whereas the tidyverse packages have pretty much evolved to form a new language in their own right) .

I had a similar experience recently of having to combine many csv files. I started with read_csv() until I found errors because the files had a variable number of header rows to be skipped. fread() dealt with this without any issues.

As for dtplyr, I would only use it as a help to learn or check data.table syntax.

jrkrideau · November 9, 2023, 12:04am

Yes! I find myself getting rapidly frustrated if I have to deal with a data.frame or tibble.

On the other hand, one really needs to wrench one's thinking around after years of base R.

technocrat · November 9, 2023, 1:59am

My experience was going from {tidyverse} to {base} made it much easier to finally crack {data.table}. Looking back over the approximately 6 month experience, I realized that {data.table} and {base} are far less syntax intensive over all.

system · November 30, 2023, 1:59am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.