RStudio changes umlaut charcters ä to Ã¤, ö to Ã¶ etc. in old and new program files

faltinl · August 18, 2018, 7:23am

In comments to my R programmes I want to use everyday language. In German this requires the use of umlaut characters like ä, ö, ü etc. Since about 1 month I observe the problem indicated in the topic line: old (>1 month) files exhibit this effect upon opening; in newly written ones I can write the correct character and a few mins later I notice, it has changed. Strange enough, a few of the umlaut characters persist without being transformed to other chracter combinations.

What can I do against that?

I am using RStudio 1.1.383 together with R3.4.3 on a Notebook under WIN10 (Build 1803).

jcblum · August 18, 2018, 6:03pm

I don’t know what’s causing this problem, but as a first pass: are you able to update to the latest version of R Studio (1.1.456), and if so does the problem persist? I'd also be curious to know whether you see the same thing using the current Preview release, since it's using a new rendering engine.

Beyond that, I think it will be easier for others to troubleshoot with some more info:

The output of running sessionInfo()
Is there anything unusual about your Installation of R (e.g, did you build it yourself, or install it in a customized fashion)?
Can you provide a small sample script where this problem has occurred? Ideally, it would include both characters that changed and ones that did not change.

faltinl · August 19, 2018, 8:59am

Hi,-
and tnx a lot for your extremly usefull suggestions.

ad 1. > sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200) [omg - it has changed again!]

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_2.2.1 keras_2.1.6.9002 tensorflow_1.8.0.9000
[4] reticulate_1.8.0.9000 RevoUtils_10.0.7 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 whisker_0.3-2 magrittr_1.5 munsell_0.4.3
[5] colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.1.6
[9] plyr_1.8.4 tools_3.4.3 grid_3.4.3 gtable_0.2.0
[13] tfruns_1.1 yaml_2.1.16 lazyeval_0.2.1 tibble_1.4.1
[17] Matrix_1.2-12 base64enc_0.1-3 zeallot_0.0.6 labeling_0.3
[21] compiler_3.4.3 pillar_1.0.1 scales_0.5.0 jsonlite_1.5

ad 2. Installation: I don't think there is much unusual with my installation. The only 'irregularity' I should mention is that the packages tensorflow, keras & reticulate by some inattention got implanted into the documents section of my HD instead of one of the usual program files. They are working seamlessly, however.

ad 3. Have a look at the following two text examples:

1297 # es genügt dzt. nicht, nur den u.U. sehr kurzen Prognosebereich für Pseq zu
1298 # definieren: wenn nicht mindestens 1 Fall jeder Klasse auftritt, hat proga
1299 # nicht mehr 3 Klassen und das Programm st?rzt [should read: stürzt] ab.

1326 # bzw. ältere (max 20 Rdn) Gewichte bzw.
1327 # beim 1. Mal die Gewichte vom Zeitpunkt der ersten Initialisierung als
1328 # Initialisierung der nÃ¤chsten [should read: nächsten] Runde...

In the meantime I have experimented with these examples in the following way: I copied the text from the original program, shown above, in 2 separate R-files. The first one (A) with encoding ISO-8859-1 (presently indicated as my system default), the second one (B) with UTF-8.

Upon reopening these two files under the unchanged system default setting, (A) showed no change, while (B) showed the initially correct characters ü & ä mutated to the same strange characters implanted earlier into the text.

Conclusions:
Apparently, at some time (see remarks below), the default encoding in my system has temporarily changed from a previous state to some other state (perhaps UTF-8, but I don't know either). When working with previously generated program files, the umlaut characters have been converted into some other character combinations; I corrected only part of them and stored the files again; this was repeated several times. Thus, a mixture of erroneous and correct characters was produced. This applies even to "new" files which invariably have been created from parts of copies of old ones, just containing different versions of a program under development. Thus, several files got infected with the character mixture.

The root cause of the problem is still unclear. However, it might be that a complete system crash of my computer following a forced WIN10-update on July, 25th, produced this, as it did with numerous other parameter settings. The system had to be set up completely anew as none of my back-ups could be reanimated, only data files have been recovered. This would explain the 1-month-period since when I observe the reported problems.

I haven't updated RStudio yet but will do that soon - thanks for the reminder anyway!

Regards
Leo

Note added a few minutes after posting the above:
Reading the last reply (7 May) to the post Unicode replacement character (�) issue in RStudio only within R markdown files, I should perhaps wait with the proposed RStudio update, shouldn't I?-)

jcblum · August 19, 2018, 10:53am

I’m not sure whether that person was talking about the Preview release or a Daily build, all of which are newer now than when that was posted a few months ago. So to me it’s still unresolved whether these are the same problem or not.

Your system reinstall does seem like it could be part of the problem (text encoding on Windows is challenging already). I’m not sure what to do about it going forward, though! Hoping someone else with deeper understanding of the relevant system settings that could be in conflict here will drop by.

faltinl · August 19, 2018, 12:35pm

Hi,-

yes, true, a reply from May 7th is alread a long time ago!-) So, perhaps I should dare an RStudio update nevertheless...

Be that as it may, in the meantime I have corrected all wrong umlaut characters in the program I am actually working on, saved it "with encoding ISO-8859-1" and opened it again: no wrong characters appearing, neither old nor new ones!

At the moment, that's all I want - at least regarding operational aspects of my programming work. Now, back to the more interesting topics...

Thanks a lot for your support, showing clearly how useful the right questions at the right time may be!