Hi all—
I am hoping to get some advice/assurance on parsing some text data in R.
Background
To the best of my Googling, there's no obvious way to generate a parser in R based on an (E)BNF grammar of some sort (happy to be corrected about this ). I recently came across
ropenscilabs/gramr
, and saw that the package was using a Javascript package write-good
to do the heavy lifting (nifty!). So, I thought I'd try out a similar thing using Nearley, a Javascript parser generator toolkit.
Reprex: checking a French-to-English dictionary
The data that I work with are dictionaries formatted as backslash-coded lines, which is a relatively common format within [endangered] language documentation work (see a longer example here). Below, I've made a toy French-to-English dictionary:
library(tidyverse)
library(zoo)
library(V8)
lexicon <-
'\\lx rouge
\\ps adjective
\\de red
\\xv La chaise est rouge
\\xe The chair is red
\\lx bonjour
\\de hello
\\ps exclamation
\\lx parler
\\ps verb
\\de speak
\\xv Parlez-vous français?
'
lexicon_df <-
read_lines(file = lexicon) %>%
tibble(line = 1:length(.), data = .) %>%
extract(col = data,
regex = "\\\\([a-z]+)\\s(.*)",
into = c("code", "value"),
remove = F) %>%
mutate(lx_id = ifelse(code == "lx", line, NA) %>% na.locf(na.rm = F))
I've found tidyverse a great way to work with a lot of aspects of the data, so a lot of my workflow consists of working on a data frame that looks like:
line | data | code | value | lx_id |
---|---|---|---|---|
1 | \lx rouge | lx | rouge | 1 |
2 | \ps adjective | ps | adjective | 1 |
3 | \de red | de | red | 1 |
4 | \xv La chaise est rouge | xv | La chaise est rouge | 1 |
5 | \xe The chair is red | xe | The chair is red | 1 |
6 | NA | NA | 1 |
For example, I can use assertr::verify
to make sure all the parts of speech values (adjective, noun, etc.) in the ps
codes are valid. Other than value validation, validation of the order of the code
column is also something important to check, and this is the part I haven't quite worked out how to do [well] in R.
Question/code review: how can the following be done better?
Following Jeroen Ooms's 'Using NPM packages in V8' vignette, I experimented writing a compile_grammar
R function (GitHub gist here). The function takes a Nearley grammar, such as lexicon_grammar
below, and uses V8 and Nearley to compile the grammar into "R code":
lexicon_grammar <- '
entry -> "lx" _ "ps" _ "de" _ examples:?
examples -> ("xv" _ "xe" _):+
_ -> " " | null
'
source("https://git.io/vAFux") # source compile_grammar function from GitHub gist
parser <- compile_grammar(lexicon_grammar)
To check whether our dictionary entries are valid, we can use the generated parser
function within a mutate
call:
lexicon_df %>%
filter(!is.na(code)) %>%
group_by(lx_id) %>%
summarise(code_sequence = paste0(code, collapse = " ")) %>%
rowwise() %>%
mutate(
parsed_sequence = parser(code_sequence, stop_on_error = F),
valid_sequence = is.list(parsed_sequence)
)
lx_id | code_sequence | parsed_sequence | valid_sequence |
---|---|---|---|
1 | lx ps de xv xe | list("lx", " ", "ps", " ", "de", " ", list(list(list("xv", " ", "xe", character(0))))) | TRUE |
7 | lx de ps | Error: invalid syntax at line 1 col 4: lx de ps ^ Unexpected "d" |
FALSE |
11 | lx ps de xv | Error: Parse incomplete, expecting more text at end of string: 'lx ps de xv' | FALSE |
As we can see, only our \lx rouge ...
entry block is valid within the grammar. The 2nd item, \lx bonjour ...
has its ps
and de
lines inverted, and the third is missing a required English sentence xe
for its example sentence, \xv Parlez-vous français?
.
I was wondering if anyone knew a more robust/R-native way to do the same/a similar thing. One issue I've already encountered with using V8
is that the package uses an older version of the v8 engine, so this method isn't quite able to fully take advantage of the Nearley parsing toolkit, and also compiling not-so-toy-example grammars is actually quite frustrating .
Thanks for reading!