kbzsl
May 7, 2019, 5:20am
1
Hi,
I am trying to import UTF-16LE formatted files (and later to convert/process). After some troubleshooting I found that the read_lines_raw is not handling the CRLF in UTF-16LE files.
> iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
[1] 61 00 62 00 0d 00 0a 00 31 00 32 00
> readr::read_lines_raw(iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]])
[[1]]
[1] 61 00 62 00
[[2]]
[1] 00
[[3]]
[1] 00 31 00 32 00
Unfortunately the separator argument cannot be used with the read_lines_raw().
Is this by design or a fault?
Do you have any idea for a workaround?
Thank you.
mara
May 7, 2019, 10:32am
2
What is your expected output?
From the readr read_lines_raw()
docs:
read_lines_raw()
produces a list of raw vectors, and is useful for handling data with unknown encoding.
Also, if you wouldn't mind running examples through reprex in the future, it makes it a bit easier for others to work with your code (since you can just copy and paste it directly)!
iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
#> [1] 61 00 62 00 0d 00 0a 00 31 00 32 00
readr::read_lines_raw(iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]])
#> [[1]]
#> [1] 61 00 62 00
#>
#> [[2]]
#> [1] 00
#>
#> [[3]]
#> [1] 00 31 00 32 00
Created on 2019-05-07 by the reprex package (v0.2.1.9000)
kbzsl
May 7, 2019, 10:57am
3
Sorry, I am not familiar with reprex. Itβs on my backlog to learn it. I tried to compile an easily reproducible example instead.
The example contains two (2) lines "ab" (61 00 62 00) and "12" (31 00 32 00) separated by the CRLF (0d 00 0a 00) and not CR in UTF-16LE coding (and in raw format).
[1] 61 00 62 00 0d 00 0a 00 31 00 32 00
The expected result is a list of 2 raw vectors (as many line lines are present) and not 3 raw vectors.
[[1]]
[1] 61 00 62 00
[[2]]
[1] 31 00 32 00
I used the read_lines_raw() to avoid any issues by using UTF-16LE encoding. (and I read through the documentation before I wrote this topic)
mara
May 7, 2019, 11:37am
4
Possibly related issues (though none seem exactly on, so you might consider filing a new one):
opened 10:45AM - 03 Nov 15 UTC
closed 04:40PM - 20 May 21 UTC
feature
multibyte
Hi, I am trying to read in a file with UTF-16LE encoding
which can be done wit⦠h base package codes
``` R
df <- read.delim(file1, stringsAsFactors = FALSE, fileEncoding = 'UTF-16LE')
```
but when I try to use readr to do the same
``` R
df <- read_tsv(file1, locale = locale(encoding = 'UTF-16LE'))
```
I got the error **Error: Incomplete multibyte sequence**
Can you please help fix it? Thanks for your advice!
opened 07:45PM - 28 Feb 19 UTC
closed 03:09PM - 06 May 21 UTC
bug
Windows newlines `\r\n` are treated as two new lines when `skip_empty_rows = FAL⦠SE`.
``` r
library(readr)
# This is the output I would expect.
read_csv("foo\n\nbar", col_names = FALSE, skip_empty_rows = FALSE)
#> # A tibble: 3 x 1
#> X1
#> <chr>
#> 1 foo
#> 2 <NA>
#> 3 bar
# I would expect the output to be the same as above.
read_csv("foo\r\n\r\nbar", col_names = FALSE, skip_empty_rows = FALSE)
#> # A tibble: 4 x 1
#> X1
#> <chr>
#> 1 foo
#> 2 <NA>
#> 3 <NA>
#> 4 bar
```
Created on 2019-02-28 by the [reprex package](https://reprex.tidyverse.org) (v0.2.0.9000).
<details>
<summary>Session info</summary>
``` r
devtools::session_info()
#> β Session info ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 3.5.2 (2018-12-20)
#> os Arch Linux
#> system x86_64, linux-gnu
#> ui X11
#> language
#> collate en_NZ.UTF-8
#> ctype en_GB.UTF-8
#> tz Europe/London
#> date 2019-02-28
#>
#> β Packages ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date lib source
#> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
#> backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.2)
#> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.2)
#> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
#> devtools 2.0.1.9000 2019-01-28 [1] Github (r-lib/devtools@e4e57aa)
#> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
#> evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1)
#> fansi 0.4.0 2018-11-09 [1] Github (brodieG/fansi@ab11e9c)
#> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.2)
#> glue 1.3.0.9000 2019-01-28 [1] Github (tidyverse/glue@8188cea)
#> highr 0.7 2018-06-09 [1] CRAN (R 3.5.1)
#> hms 0.4.2.9001 2019-02-28 [1] Github (tidyverse/hms@16ff76e)
#> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
#> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.1)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
#> nvimcom * 0.9-75 2019-01-03 [1] local
#> pillar 1.3.1.9000 2019-01-23 [1] Github (r-lib/pillar@3a54b8d)
#> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
#> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
#> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.1)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.2)
#> R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
#> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2)
#> readr * 1.3.1.9000 2019-02-28 [1] Github (tidyverse/readr@b7e0b99)
#> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.2)
#> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
#> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.1)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1)
#> stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
#> testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.2)
#> tibble 2.0.1.9001 2019-02-28 [1] Github (tidyverse/tibble@92f5604)
#> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.1)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.0)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
#> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.1)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
#>
#> [1] /home/nacnudus/R/x86_64-pc-linux-gnu-library/3.5
#> [2] /usr/lib/R/library
```
</details>
kbzsl
May 7, 2019, 12:30pm
5
Thank you for your answer.
In advance I checked that ticket (and some others). Initially I was not sure that they are connected, because I was reading in raw (= hex) format and not expecting that the format/encoding is parsed during reading process (read_lines_raw vs read_lines).
But checking for different combination for end of line separators (CRLF, CR and LF) it is clearly visible that they are not parsed as 2 bytes: the last raw vector in each case is staring with 00.
> raw_crlf = iconv("ab\r\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> raw_cr = iconv("ab\r12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
> raw_lf = iconv("ab\n12",from="UTF-8",to="UTF-16LE", toRaw = TRUE)[[1]]
>
> readr::read_lines_raw(raw_crlf)
[[1]]
[1] 61 00 62 00
[[2]]
[1] 00
[[3]]
[1] 00 31 00 32 00
> readr::read_lines_raw(raw_cr)
[[1]]
[1] 61 00 62 00
[[2]]
[1] 00 31 00 32 00
> readr::read_lines_raw(raw_lf)
[[1]]
[1] 61 00 62 00
[[2]]
[1] 00 31 00 32 00
I assume that this is connected to multi-byte issue.
In meantime (till the multi-byte support will be implemented), do you have any idea for a workaround?
Thank you.
mara
May 7, 2019, 2:33pm
6
I don't, but hopefully someone else will! You might take a look at the iotools package, though I'm not sure if readAsRaw()
will fit your use case.
https://CRAN.R-project.org/package=iotools
1 Like
system
Closed
May 28, 2019, 2:33pm
7
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.