I'm trying to clean a dataset which has miss spelled words. I couldnt solve it with regex since there is no pattern to move on. But I'm trying to use stringdist package to find high percent matches in another dataset which I created. I really dont know about for loops that much and couldnt figure out how can I apply a for loop to this. Here is an example of my data.
data1 <- c("neeyork","dalas","houson","new york")
data2 <- c("houston","newyork","dallas","washington")
for (i in 1:length(data1)) {
data99 <- data.table(stringsim(data1[i],data2, method = 'cosine'))
}
#It gives me something like this
#>1 0.3535534
#>2 0.9354143
#>3 0.0000000
#>4 0.4082483
Correct me if I'm wrong but I think I'm failing for loop here cause its only trying to match "neeyork" with data2 values. How can I fix it? Also even if I fix this how should I suppose to know which data1 value matched with which data2 value? I have over 10k value in data1 and 50k+ value in data2.
Right, if you need to do all those pairwise comparisons, it will take a long time. A couple options might be to do this in parallel, or you if you have additional information about which elements are potential matches, you could limit the comparisons. This technique is sometimes called "blocking" and you read more about it and see an implementation in the fastLink package