with the error " Caused by error in codes(): nope, isn't working forc�� (sorry, there's a missing space there). So it is failing on the third row, but I don't know hot to debug this any further.
NB: I have to use the unicode codes because R CMD check won't allow non-ascii characters in the package.
How do you create these strings? My guess is that they are in the native encoding, which is latin1 on R 4.1.x Windows that you are using on GHA. You are probably using R 4.2.x locally, which is UTF-8 by default.
I think a good fix is to create UTF-8 strings, even on non-UTF-8 platforms. If that's possible.
The strings were manually entered into Excel. And I am using the same 4.1.2 version of R both locally and on the runners on github. But so this would make sense, that it's an issue with the Excel encoding, I can try checking that, thanks!
Well, that does not necessarily mean that the strings that you read from Excel are UTF-8 in R. So please check if they are, both locally and on GitHub.
I'm not sure this is what you meant, but I added two print statements to the test suite and they print out the following - reminder, the character vector is "Test", "Tešt", "čšž" - (and you can see the action on gh here):
declared encodings with Encoding():`
[1] "unknown" "UTF-8" "UTF-8"
detected encodings with stringi:`
[[1]]
Encoding Language Confidence
1 ISO-8859-1 en 0.60
2 ISO-8859-2 ro 0.60
3 UTF-8 0.15
4 UTF-16BE 0.10
5 UTF-16LE 0.10
[[2]]
Encoding Language Confidence
1 UTF-8 0.8
2 UTF-16BE 0.1
3 UTF-16LE 0.1
4 GB18030 zh 0.1
5 EUC-JP ja 0.1
6 EUC-KR ko 0.1
7 Big5 zh 0.1
[[3]]
Encoding Language Confidence
1 windows-1252 no 0.85
2 UTF-8 0.80
3 UTF-16BE 0.10
4 UTF-16LE 0.10
5 Shift_JIS ja 0.10
Error: Error: R CMD check found ERRORs
6 GB18030 zh 0.10
7 Big5 zh 0.10
Was that what you meant @Gabor or is there some other way I can check?
Yeah, that's what I meant. So the declared encoding is UTF-8, but the strings might not be UTF-8. You can use stringi::stri_enc_isutf8() to check if they really are or not. Encoding detection is ambiguous, the same byte sequence can be valid in multiple encodings.
If they are not UTF-8, then the question is why aren't they. If they are marked UTF-8, but they are not in fact UTF-8, that certainly seems like a bug.
Btw. I don't really understand how this is different locally and on GH. I see it locally as well.
OK, I've added the check if it's utf-8 and I get [1] TRUE TRUE TRUE
both locally and on gh-actions .
I'm getting more and more confused, one thing is the declared encoding, but what do you mean by the actual encoding? I understand if there is no declared encoding there is ambiguity, but where are these declarations happening (or getting lost)?
And what do you mean when you say you see it locally as well, you mean the test is also failing for you locally?
Just to reiterate, all the declared and detected encoding outputs are the same for me locally as well as on gh-actions, but the test is passing locally and failing on gh-actions.
Aha, I may be on to something: I also tried changing the R version on the runner to 4.2.2, since 4.2 is meant to support native UTF-8 encoding (Upcoming Changes in R 4.2 on Windows - The R Blog) and now the check is failing for a different reason!!
My guess is that you don't see it, because your native encoding is different.
More importantly, my other guess is that the issue is that switch() (or parse() when your file is parsed?) converts the strings into the native encoding, and that is different for you than the one on GitHub. (And I have the same native encoding as GitHub.)
Of course this means (assuming I am right) that you cannot use non-ASCII strings in switch(). You'll need to use the old school if (...) ... else if (...) ... else ... construct, or match() or something else.
You are right about the switch being the culprit (or at least one of them). If I use a character that is not in the 1250 codepage, the switch doesn't work, but an ifelse does
But this is still so confusing. I'm going to check what the encoding is meant to be on the GitHub runner, but how do I change it? I mean switching to R 4.2 changed something, but I don't even understand what .
And if 4.2 supports native UTF-8, why do i get unable to translate 'Te<U+0161>t' to native encoding, why couldn't it translate <U+0161> to UTF-8?
And how am I supposed to solve this? I need non-ASCII characters in my code, and I want to do the R CMD check on github actions, surely these two things cannot be incompatible?
The actual switch that I have has 16 cases, I'm not keen on rewriting it as an ifelse..
I am sorry to say, but you can't use switch() with non-ascii strings, even if you find a workaround on GHA, because your users might have different native encodings. You can maybe use match().
OK, I have rewritten the function to use match instead of switch, and it is now passing, thanks for that Gabor.
Still, this leaves me unsatisfied: all the runners on GHA had UTF-8 as their local encoding, but it was only the Windows one where the code didn't work. And I still don't understand why.
Well, your test does pass on R 4.2.x Windows, there is only a warning coming from R CMD check, and that probably happens because R CMD check runs that particular check in the C locale, so only ASCII function argument names are allowed. I haven't really checked this in depth, but I am fairly sure that this is what happens. E.g. my native encoding is UTF-8 by default, so this works:
But if I switch to the C locale, then it does not:
❯ LANG=C R -q -e 'switch("\u010d", "\u010d" = "match", "not match")'
> switch("\u010d", "\u010d" = "match", "not match")
[1] "not match"
Warning message:
unable to translate '<U+010D>' to native encoding
Whether this is a bug in R CMD check, I am not sure. Once we only need to support UTF-8 systems (if that ever happens), I guess we can "fix" R CMD check to run in a UTF-8 locale.