Best practices for generating parsers in R

fauxneticien · March 7, 2018, 7:14am

Hi all—

I am hoping to get some advice/assurance on parsing some text data in R.

Background

To the best of my Googling, there's no obvious way to generate a parser in R based on an (E)BNF grammar of some sort (happy to be corrected about this ). I recently came across ropenscilabs/gramr, and saw that the package was using a Javascript package write-good to do the heavy lifting (nifty!). So, I thought I'd try out a similar thing using Nearley, a Javascript parser generator toolkit.

Reprex: checking a French-to-English dictionary

The data that I work with are dictionaries formatted as backslash-coded lines, which is a relatively common format within [endangered] language documentation work (see a longer example here). Below, I've made a toy French-to-English dictionary:

library(tidyverse)
library(zoo)
library(V8)

lexicon <-
'\\lx rouge
\\ps adjective
\\de red
\\xv La chaise est rouge
\\xe The chair is red

\\lx bonjour
\\de hello
\\ps exclamation

\\lx parler
\\ps verb
\\de speak
\\xv Parlez-vous français?
'

lexicon_df <-
    read_lines(file = lexicon) %>%
    tibble(line = 1:length(.), data = .) %>%
    extract(col = data,
            regex = "\\\\([a-z]+)\\s(.*)",
            into = c("code", "value"),
            remove = F) %>%
    mutate(lx_id = ifelse(code == "lx", line, NA) %>% na.locf(na.rm = F))

I've found tidyverse a great way to work with a lot of aspects of the data, so a lot of my workflow consists of working on a data frame that looks like:

line	data	code	value	lx_id
1	\lx rouge	lx	rouge	1
2	\ps adjective	ps	adjective	1
3	\de red	de	red	1
4	\xv La chaise est rouge	xv	La chaise est rouge	1
5	\xe The chair is red	xe	The chair is red	1
6		NA	NA	1

For example, I can use assertr::verify to make sure all the parts of speech values (adjective, noun, etc.) in the ps codes are valid. Other than value validation, validation of the order of the code column is also something important to check, and this is the part I haven't quite worked out how to do [well] in R.

Question/code review: how can the following be done better?

Following Jeroen Ooms's 'Using NPM packages in V8' vignette, I experimented writing a compile_grammar R function (GitHub gist here). The function takes a Nearley grammar, such as lexicon_grammar below, and uses V8 and Nearley to compile the grammar into "R code":

lexicon_grammar <- '
entry    -> "lx" _ "ps" _ "de" _ examples:?

examples -> ("xv" _ "xe" _):+

_        -> " " | null
'

source("https://git.io/vAFux") # source compile_grammar function from GitHub gist
parser <- compile_grammar(lexicon_grammar)

To check whether our dictionary entries are valid, we can use the generated parser function within a mutate call:

lexicon_df %>%
    filter(!is.na(code)) %>%
    group_by(lx_id) %>%
    summarise(code_sequence = paste0(code, collapse = " ")) %>%
    rowwise() %>% 
    mutate(
        parsed_sequence  = parser(code_sequence, stop_on_error = F),
        valid_sequence   = is.list(parsed_sequence)
    )

lx_id	code_sequence	parsed_sequence	valid_sequence
1	lx ps de xv xe	list("lx", " ", "ps", " ", "de", " ", list(list(list("xv", " ", "xe", character(0)))))	TRUE
7	lx de ps	Error: invalid syntax at line 1 col 4: lx de ps ^ Unexpected "d"	FALSE
11	lx ps de xv	Error: Parse incomplete, expecting more text at end of string: 'lx ps de xv'	FALSE

As we can see, only our \lx rouge ... entry block is valid within the grammar. The 2nd item, \lx bonjour ... has its ps and de lines inverted, and the third is missing a required English sentence xe for its example sentence, \xv Parlez-vous français?.

I was wondering if anyone knew a more robust/R-native way to do the same/a similar thing. One issue I've already encountered with using V8 is that the package uses an older version of the v8 engine, so this method isn't quite able to fully take advantage of the Nearley parsing toolkit, and also compiling not-so-toy-example grammars is actually quite frustrating .

Thanks for reading!