How can I load in a file with misplaced linebreaks?

RGIS · December 11, 2019, 7:43pm

I have a file that looks like this where someone in the original database added a key incorrectly with a return symbol. How can I read file in R that has a regular repeating LF or other line break in the wrong place? For the sake clarity, suppose I have the following reproduceable example:

example.txt =

ID|Event Date|Code Meaning|Code|Price \n
10020|2017-01-17|scotch tape|01|1.00 \n
10021|2017-02-17|rubber \n duck|02|3.99
10022|2018-03-17|rusty, nail 9000|03|5.99 \n
10023|2016-04-17|rubber \n duck|02|1.99 \n
10024|2015-02-17|scotch tape|01|2.00 \n

where every \n or character code "0x0A" is a line break. I would like to get a file that is like this....
ID| Date|Name|Code|Price \n

10020|2017-01-17|scotch tape|01|1.00 \n
10021|2017-01-17|rubber duck|02|3.99
10022|2017-01-17|rusty, nail 9000|03|5.99 \n
10023|2017-02-17|rubber duck|02| 1.99 \n
10024|2017-02-17|scotch tape|01|2.00 \n

so I can load in the data using something like tidyverse or readr so looks like this...

ID =c(10020,10021,10022,10023,10024)
Date=c("2017-01-17","2017-01-17","2017-01-17","2018-02-17","2017-01-17")
Name=c("scotch tape","rubber duck",rusty, nail 9000","rubber duck","scotch tape")
Price=c(1.00,3.99,5.99,1.99,2.00)
mydata <- data.frame(ID,Date,Name,Price)
View(mydata)

How can I do this in R?

FJCC · December 11, 2019, 9:04pm

There is probably some cool function out there that does this but I wrote a little code to do it. It might be painfully slow for a large data set.

library(stringr)
DF <- readLines("/home/fjcc/R/Play/Dummy.csv")
j <- 1
i <- 1
OutVec <- vector("character", length = length(DF))
tmp <- ""
while (DF[i] != "") {
  TestStr <- paste0(tmp, DF[i])
  if (str_count(TestStr, "\\|") == 4) {
    OutVec[j] <- TestStr
    tmp <- ""
    j <- j + 1
  } else {
    tmp <- TestStr
  }
  i <- i + 1
}

OutVec <- OutVec[OutVec != ""]
writeLines(OutVec, "/home/fjcc/R/Play/FixedFile.txt")
Fixed <- read.table("/home/fjcc/R/Play/FixedFile.txt", sep ="|", header = TRUE)
Fixed
#>      ID Event.Date     Code.Meaning Code Price
#> 1 10020 2017-01-17      scotch tape    1  1.00
#> 2 10021 2017-02-17     rubber  duck    2  3.99
#> 3 10022 2018-03-17 rusty, nail 9000    3  5.99
#> 4 10023 2016-04-17     rubber  duck    2  1.99
#> 5 10024 2015-02-17      scotch tape    1  2.00

^{Created on 2019-12-11 by the reprex package (v0.2.1)}

andresrcs · December 11, 2019, 9:52pm

Another approach could be to read the file as text and remove the unwanted characters with a regular expression, see this example (the regex might need fine-tuning for your actual application)

library(tidyverse)

raw_text <- read_file("example.txt")
raw_text <- str_remove_all(raw_text, "\\s\\\\n(?=\\s)")
df <- read_delim(raw_text, delim = "|")
df
#> # A tibble: 5 x 5
#>      ID `Event Date` `Code Meaning`   Code  Price
#>   <dbl> <date>       <chr>            <chr> <dbl>
#> 1 10020 2017-01-17   scotch tape      01     1   
#> 2 10021 2017-02-17   rubber duck      02     3.99
#> 3 10022 2018-03-17   rusty, nail 9000 03     5.99
#> 4 10023 2016-04-17   rubber duck      02     1.99
#> 5 10024 2015-02-17   scotch tape      01     2

system · January 1, 2020, 9:52pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.