Encoding headaches from json

adklein23 · February 27, 2020, 4:21am

Hi, I'm trying to do some analysis on my Facebook Messages data, however I'm running into some encoding issues. Facebook allows you to download your data as .json files. I want to analyze some of the emojis and text used, but am having difficulties converting them. For example, in the raw .json file, I see:

{
  "messages": [
    {
      "sender_name": "Person 1",
      "timestamp_ms": 1572480435138,
      "content": "I\u00e2\u0080\u0099m assuming..",
      "type": "Generic"
    }
	]
}

When reading it into R, using jsonlite::fromJSON, it appears in the console like:

library(jsonlite)

dat <- fromJSON('example.json')
dat$messages$content

> dat$messages$content
[1] "Iâ\u0080\u0099m assuming.."

Is there an easy way to convert these weird characters. This one for example looks like it should be an apostrophe

I'm on a Windows machine using Rstudio if that matters at all

pieterjanvc · February 27, 2020, 12:42pm

Hi,

From what I read in a similar post, JSON does not allow \u to be part of its structure, so this file should be invalid. That's why it's not getting loaded properly

Maybe you can add an extra escape character to every \u in the file to prevent it from loading incorrectly.

Hope this helps,
PJ

adklein23 · February 28, 2020, 4:13pm

Thanks for the link. Since I don't control the creation of these json files, and there are many of them, I guess I'd need to write like a bash command to loop through them all and do a find/replace? Or is there a way I can do this processing within R?

pieterjanvc · February 28, 2020, 9:13pm

Hi,

You could do this in R, but I think the easier solution might be to see if you can change the type of JSON file. How did you download it? Is there any option to change the type of json or the encoding when downloading?

Here is a way to clean the JSON file:

library(jsonlite)
library(stringr)

#Clean JSON
cleanData = str_remove_all(
    readLines("test.json"),
    "\\\\u....")

write(cleanData, "test.json")

#Read clean JSON
data = fromJSON("test.json")

This will remove all the unicode, not replace it (as the unicodes provided in your example make no sense)

PJ

adklein23 · February 29, 2020, 10:44pm

It looks like facebook gives you the option of json or html for an export, but that's about it. I'm surprised a company like that would produce non-valid json code. I can see what the html files look like, but I figured json would be easier to parse.

I found this which might be related, but not sure how I can solve this in R?

Thanks again for the suggestions

pieterjanvc · March 2, 2020, 12:19pm

Hi,

With this info, I was able to replace the incorrect unicode with the correct one, and it seems once fixed, R reads it like an apostrophe again:

library(jsonlite)
library(stringr)

#Clean JSON
cleanData = str_replace_all(
    readLines("test.json"),
    "\\\\u00e2\\\\u0080\\\\u0099", "\\\\u2019")

write(cleanData, "test.json")

#Read clean JSON
data = fromJSON("test.json")

> data$messages
  sender_name timestamp_ms        content    type
1    Person 1  1.57248e+12 I’m assuming.. Generic

If this is the only error, it should be fixed now, and there are no unicode characters removed.

PJ

adklein23 · March 2, 2020, 3:36pm

Hi PJ,
thanks so much for the example. Since that's not the only messy unicode, I'll just start working through and start replacing as needed as I encounter these.

I also found this thread on SO, so it looks like a known issue. Maybe I can use some of the code there via reticulate and do this at a systematic level.

Thanks,
Andrew

system · March 23, 2020, 3:36pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.