Encoding issues

Vangz · April 18, 2022, 7:22am

I am following text mining tutorial in Introduction to corpus • corpus

library("corpus")
palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33"))
set.seed(0)
url <- "http://www.gutenberg.org/cache/epub/55/pg55.txt"
raw <- readLines(url, encoding = "UTF-8")

But when I inspect the raw objects, it return like this:

raw[1]
[1] "\037\u008b\b\b\xff\u0086u_\002\xffpg55.txt.utf8.

raw[2]
[1] "�\xfb\xe7\xfed\u0087\xfa\u008f]�\xdf\xf0\xd7\xff\xdd\177\xfd\xef\xe7�=\xc6\b\xf8\xebŏvb\xf1\u0099\177\xfaG\xff\xccwg{\177\xd7\xe4Ǿ�c\xc4E\xf9�X\xcfo\017\xed\xd0\xd8W\177\xdfM\xd7\xd8/\xfbu,\xedOm3�\037\xf3\xde\xfeܭ\u009fl\xed~\xee�\xf5#?\xf7\u009ff\xeb\xc1\037\xc52\xfe\xdc\035w\xf6\xd9\037\xfb\xe3S{\xd5Kc\xf5\xfeЍ\xeb\xden��Qk\xd8\016C�\xda\xeb\031�\u0080?6�n�\xf80L9\001.\xd4\xf7\xe7\xc3\xea�\xe3'\xff�o.~\xf9M�\xdf\xf7�mE3.~hp\006Z.\xd9?aE?@\022\xf8r\xffԟ'\u008d\037\u008b\xf9a\u009a\032N\xcb\xe4\u0097\xd6r\xf78\xd9\xe0\027\u009f\u0087V�\xf5ϱ\u009e\177h��\xd3\xef\037�c\xc3c7"

raw[3]
[1] "W~ \u0096\023\033�\xf8�]\xf7\u0087v\xe4\xc3\xfe"

Is this encoding issues? What encoding should I use to read the text properly?

system · May 9, 2022, 7:22am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.