My n-gram dictionary looks like this
word1 word2 frequency nprefix
during the day 1206 2
during the christmas 566 2
during the pm 480 2
during the recovery 440 2
during the night 406 2
during the dayi 395 2
during the month 373 2
during the weekend 321 2
during the campaign 239 2
during the scottish 217 2
Here nprefix is the number of words in word1.
My function was designed to match a typed-in phrase to the dictionary. Once a match was found, it would look up the next word, then print it out. If a match was not found, the phrase would be shortened by dropping off the first word, then looking it up again.
On occasions, there were more than one choices of next word. Before printing out, the next words were ranked by the frequency that they occurred in the training data. Here are the functions used
#Look up the next word of the phrase
nxtword1<- function(word){dat %>% filter(word1 == word) %>% select(-word1)}
##function to change ngram to n-1 gram
less1gram <- function(x){str_replace(x, "^[^ ]+ ", "")
}
whatsnext1 <- function(phrase){
nwords <-str_count(phrase, pattern=" ")
while (!(phrase %in% dat$word1) && nwords >=1){
phrase<- less1gram(phrase)
nwords<-str_count(phrase, pattern=" ")
print(nxtword1(phrase)[1:5, 1],col.names=FALSE)
}
}
I tried to complete this phrase
whatsnext1("the faith during the")
When an attempt was made to print out the five top word choices, an extra row of NA’s would appear on the top.
[1] NA NA NA NA NA
[1] "day" "christmas" "pm" "recovery" "night"
How can I get rid of the row of NA’s