Character Encoding within the tidyverse

pgensler · January 26, 2018, 2:35am

As I have been using R, I seem to bump into encoding issues more and more often, and would like to help fix some of these within the tidyverse. However, I am not sure what the de-facto standard is for these types of issues, and fixing them in a proper fashion to attempt to help alleviate these issues, as they can be very troublesome. Is rlang meant to help alleviate these issues from within R, or should one be using stronger tools, like stringi instead for a PR?

mara · January 29, 2018, 1:14pm

Good question, and I'm not sure of the answer. @jennybryan, do you know if there's a simple rule for this?

@pgensler, all of RStudio is about to be in one place at the same time for a week and change, and the lower/higher-language-fix questions are definitely on my list of things to investigate.

dpprdan · January 29, 2018, 2:49pm

As a Windows R user (I guess you are too?) I also run into encoding problems on a regular basis and I also submitted the odd bug report regarding character encoding within the tidyverse, e.g. here and here. From my experience, most issues can be alleviated by either encoding strings to UTF-8 with enc2utf8 or explicitly declaring a string as UTF-8 with Encoding(string) <- "UTF-8" (most of the first and the second example above). Or require they require working around/fixing issues in base-R, e.g.

printing of CP1252 characters in the 80-9F code point range (bug report now fixed in R-devel),
Characters garbled from sink() on Windows, or
the format bug

In general I'd say that the proper way to fix it depends on the issue. If you are unsure, open an issue report and discuss with the maintainer before making a pull request. Do you have any examples?

pgensler · January 29, 2018, 7:53pm

Yeah, issues like this are some of the pain's I've dealt with:

github.com/r-dbi/odbc

odd encoding errors with IBM Informix

opened 02:14AM - 26 Jan 18 UTC

closed 03:27PM - 24 Apr 23 UTC

pgensler

bug reprex informix

### Issue Description and Expected Result Example: `dbGetQuery()` returns a bad… error msg/truncated? I think this may be due to the length of one of the columns in the query, which truncates the error msg. I used the connection string from Access, so maybe this is a bad example? ### Database IBM Informix 3.70 ### Reproducible Example ```r library(DBI) library(odbc) library(tidyverse) con <- dbConnect(odbc::odbc(), .connection_string = "Dsn=my_prod_db;DRIVER={IBM INFORMIX ODBC DRIVER}; UID=my_user; DLOC=en_US.819; CLOC=en_US.CP1252; PRO=onsoctcp; SERV=my_server_name; SRVR=my_srvr; HOST=my_ip_addr; DATABASE=my_prod_db;") #Encoding Errors using Infomix mod <- dbSendQuery(conn = con, statement = "SELECT LIMIT 100 ean CONVERT(CHAR, ean, 20) FROM mod00") Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: 42000: [Informix][Informix ODBC Driver][I ? ?? ?? mod <- dbSendQuery(conn = con, statement = "SELECT CAST(EAN AS VARCHAR 20) LIMIT 100 * FROM mod00") Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: 42000: [Informix][Informix ODBC Driver][I? ????????????????????????????????? o ``` <details> <summary>Session Info</summary> ```r > devtools::session_info() Session info ---------------------------------------------------------------------------------------------------------------------- setting value version R version 3.4.3 (2017-11-30) system x86_64, mingw32 ui RStudio (1.1.383) language (EN) collate English_United States.1252 tz America/Chicago date 2018-01-25 Packages -------------------------------------------------------------------------------------------------------------------------- package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.4.3) backports 1.1.2 2017-12-13 CRAN (R 3.4.3) base * 3.4.3 2017-12-06 local bindr 0.1 2016-11-13 CRAN (R 3.4.3) bindrcpp 0.2 2017-06-17 CRAN (R 3.4.3) bit 1.1-12 2014-04-09 CRAN (R 3.4.1) bit64 0.9-7 2017-05-08 CRAN (R 3.4.1) blob 1.1.0 2017-06-17 CRAN (R 3.4.3) broom 0.4.3 2017-11-20 CRAN (R 3.4.3) callr 1.0.0 2016-06-18 CRAN (R 3.4.3) cellranger 1.1.0 2016-07-27 CRAN (R 3.4.3) cli 1.0.0 2017-11-05 CRAN (R 3.4.3) clipr 0.4.0 2017-11-03 CRAN (R 3.4.3) colorspace 1.3-2 2016-12-14 CRAN (R 3.4.3) compiler 3.4.3 2017-12-06 local crayon 1.3.4 2017-09-16 CRAN (R 3.4.3) datasets * 3.4.3 2017-12-06 local DBI * 0.7 2017-06-18 CRAN (R 3.4.3) devtools 1.13.4 2017-11-09 CRAN (R 3.4.3) digest 0.6.14 2018-01-14 CRAN (R 3.4.3) dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.3) evaluate 0.10.1 2017-06-24 CRAN (R 3.4.3) forcats * 0.2.0 2017-01-23 CRAN (R 3.4.3) foreign 0.8-69 2017-06-22 CRAN (R 3.4.3) ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.3) glue 1.2.0 2017-10-29 CRAN (R 3.4.3) graphics * 3.4.3 2017-12-06 local grDevices * 3.4.3 2017-12-06 local grid 3.4.3 2017-12-06 local gtable 0.2.0 2016-02-26 CRAN (R 3.4.3) haven 1.1.1 2018-01-18 CRAN (R 3.4.3) hms 0.4.0 2017-11-23 CRAN (R 3.4.3) htmltools 0.3.6 2017-04-28 CRAN (R 3.4.3) httr 1.3.1 2017-08-20 CRAN (R 3.4.3) jsonlite 1.5 2017-06-01 CRAN (R 3.4.3) knitr 1.18 2017-12-27 CRAN (R 3.4.3) lattice 0.20-35 2017-03-25 CRAN (R 3.4.3) lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.3) lubridate 1.7.1 2017-11-03 CRAN (R 3.4.3) magrittr 1.5 2014-11-22 CRAN (R 3.4.3) memoise 1.1.0 2017-04-21 CRAN (R 3.4.3) methods * 3.4.3 2017-12-06 local mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1) modelr 0.1.1 2017-07-24 CRAN (R 3.4.3) munsell 0.4.3 2016-02-13 CRAN (R 3.4.3) nlme 3.1-131 2017-02-06 CRAN (R 3.4.3) odbc * 1.1.4 2018-01-10 CRAN (R 3.4.3) parallel 3.4.3 2017-12-06 local pillar 1.1.0 2018-01-14 CRAN (R 3.4.3) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.3) plyr 1.8.4 2016-06-08 CRAN (R 3.4.3) psych 1.7.8 2017-09-09 CRAN (R 3.4.3) purrr * 0.2.4 2017-10-18 CRAN (R 3.4.3) R6 2.2.2 2017-06-17 CRAN (R 3.4.3) Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3) readr * 1.1.1 2017-05-16 CRAN (R 3.4.3) readxl 1.0.0 2017-04-18 CRAN (R 3.4.3) reprex * 0.1.1 2017-01-13 CRAN (R 3.4.3) reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3) rlang 0.1.6 2017-12-21 CRAN (R 3.4.3) rmarkdown 1.8 2017-11-17 CRAN (R 3.4.3) rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3) rstudioapi 0.7 2017-09-07 CRAN (R 3.4.3) rvest 0.3.2 2016-06-17 CRAN (R 3.4.3) scales 0.5.0 2017-08-24 CRAN (R 3.4.3) stats * 3.4.3 2017-12-06 local stringi 1.1.6 2017-11-17 CRAN (R 3.4.2) stringr * 1.2.0 2017-02-18 CRAN (R 3.4.3) tibble * 1.4.1 2017-12-25 CRAN (R 3.4.3) tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.3) tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.3) tools 3.4.3 2017-12-06 local utils * 3.4.3 2017-12-06 local whisker 0.3-2 2013-04-28 CRAN (R 3.4.3) withr 2.1.1 2017-12-19 CRAN (R 3.4.3) xml2 1.1.1 2017-01-24 CRAN (R 3.4.3) yaml 2.1.16 2017-12-12 CRAN (R 3.4.3) ``` </details>

and

https://forum.posit.co/t/split-uneven-length-vectors-to-columns-with-tidyr/
which are good examples. The second illustrates just how bad it gets, as you really need to share the byte sequence to see the issue, not just the string itself.

I'm just curious if we should be using something like TERR, or if RStudio plans to implement its own flavor of R to solve these issues.

jennybryan · January 29, 2018, 8:13pm

With apologies for vagueness, I think the rule is "UTF-8 All The Things".

Now that doesn't tell a user or developer exactly what to do, but the overall current of development is to push everything towards UTF-8. I would love for us to develop and share guides at some point about how to implement this principle in, say, your own package. I need this guide myself! But that's just a goal at this point.

mara · January 29, 2018, 8:32pm

Re. the aforequoted:

That was in response to your listing:

#want to unnest list to chr vector
options are:
  -flatten()
  -unnest()
  -unlist()
  -squash()
  -anything in purrr ?

@mara is is possible to get some clarity around when we should be using the above functions for what, and when?

I then disambiguated those functions, after conferring with @hadley and @jennybryan…

I understand that said disambiguation wasn't the totality of your problem there, but I'm unsure as to why you're quoting that one line from me in re. this issue…

pgensler · January 30, 2018, 3:49am

@dpprdan Yes, for work, but I've run into issues where I end up having to use iconv before I can even import the file into R.

@mara I wasn't trying to quote you, just link to the thread. Should we be using specific packages to help in dealing with these? This package seems tremendously helpful https://github.com/patperry/r-utf8 l, but adding that burden onto every other tidyverse package is not a small fiasco, so then what is the preferred solution? Build a new version of R, and host on GitHub, or build it into rlang? I would imagine there is enough complaints from different users that it may warrant worth making RStudio's own flavor of R, but I could be wrong.

cole · January 30, 2018, 1:57pm

It generally seems to me that a fork of R would be antithetical to RStudio's vision/mission of being a part of / supporting the R community. Forking can contribute to division (i.e. now there are two "masters") and can isolate development (especially when the change is as core as encoding). I expect that there are other more collaborative ways that the goal will be realized. Then again, I mostly avoid Windows for this and many other reasons

This comic makes the notion clear in a comedic way

More reading on the general problems with forking (although I must admit I have not read the articles completely):

https://mako.cc/writing/to_fork_or_not_to_fork.html