How to get the utf-8 codes from a text string?

mara · February 28, 2019, 2:14pm

OK, so I was able to scrape a data frame for you which has the binary and the UTF-8 codes (I'm just showing you a subset because the first several entries are <control> and blanks.

Because string encoding is, well, unpredictably weird, your results may vary, or you might want a different set of characters, etc., but the method I used should work for the various combinations available on the site:
https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=bin

library(tidyverse)
library(janitor)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

url <- "https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=bin"

utf8_enc <- url %>%
  read_html() %>%
  html_nodes(css = 'body > table.codetable') %>%
  html_table()

utf8_enc_tab <- utf8_enc[[1]]  

utf8_enc_tab <- utf8_enc_tab %>%
  janitor::clean_names()

utf8_enc_tab %>%
  slice(70:80)
#>    unicodecode_point character utf_8_bin                   name
#> 1             U+0045         E  01000101 LATIN CAPITAL LETTER E
#> 2             U+0046         F  01000110 LATIN CAPITAL LETTER F
#> 3             U+0047         G  01000111 LATIN CAPITAL LETTER G
#> 4             U+0048         H  01001000 LATIN CAPITAL LETTER H
#> 5             U+0049         I  01001001 LATIN CAPITAL LETTER I
#> 6             U+004A         J  01001010 LATIN CAPITAL LETTER J
#> 7             U+004B         K  01001011 LATIN CAPITAL LETTER K
#> 8             U+004C         L  01001100 LATIN CAPITAL LETTER L
#> 9             U+004D         M  01001101 LATIN CAPITAL LETTER M
#> 10            U+004E         N  01001110 LATIN CAPITAL LETTER N
#> 11            U+004F         O  01001111 LATIN CAPITAL LETTER O

^{Created on 2019-02-28 by the reprex package (v0.2.1)}

I did write it out to a csv, but I suggest you do the scraping on your own machine, since these things vary from OS to OS, etc.

gist.github.com

https://gist.github.com/batpigandme/fb6bf75a0158b1f3bdda8530d0cb35ac

utf8_enc_tab.csv

unicodecode_point,character,utf_8_bin,name
U+0000,,00000000,<control>
U+0001,,00000001,<control>
U+0002,,00000010,<control>
U+0003,,00000011,<control>
U+0004,,00000100,<control>
U+0005,,00000101,<control>
U+0006,,00000110,<control>
U+0007,,00000111,<control>
U+0008,,00001000,<control>

This file has been truncated. show original

You can then basically use this to do a lookup:

utf8_enc_tab <- utf8_enc[[1]]

utf8_enc_tab <- as_tibble(utf8_enc_tab) %>%
  janitor::clean_names()


x <- "abc"
characters <- strsplit(x, "")[[1]]

char_frame <- tibble(chars = characters)

char_frame <- char_frame %>%
  mutate(bits = pryr::bits(chars)) %>%
  left_join(utf8_enc_tab, by = c("chars" = "character"))

char_frame
#> # A tibble: 3 x 5
#>   chars bits     unicodecode_point utf_8_bin name                
#>   <chr> <chr>    <chr>             <chr>     <chr>               
#> 1 a     01100001 U+0061            01100001  LATIN SMALL LETTER A
#> 2 b     01100010 U+0062            01100010  LATIN SMALL LETTER B
#> 3 c     01100011 U+0063            01100011  LATIN SMALL LETTER C