String match with miss spelled words

darkstormx1 · June 9, 2020, 3:20pm

Hello dear community.

I'm trying to clean a dataset which has miss spelled words. I couldnt solve it with regex since there is no pattern to move on. But I'm trying to use stringdist package to find high percent matches in another dataset which I created. I really dont know about for loops that much and couldnt figure out how can I apply a for loop to this. Here is an example of my data.

data1 <- c("neeyork","dalas","houson","new york")
data2 <- c("houston","newyork","dallas","washington")

for (i in 1:length(data1)) {
  data99 <- data.table(stringsim(data1[i],data2, method = 'cosine'))
}

#It gives me something like this

#>1 0.3535534
#>2	0.9354143
#>3	0.0000000
#>4	0.4082483

Correct me if I'm wrong but I think I'm failing for loop here cause its only trying to match "neeyork" with data2 values. How can I fix it? Also even if I fix this how should I suppose to know which data1 value matched with which data2 value? I have over 10k value in data1 and 50k+ value in data2.

mfherman · June 9, 2020, 3:35pm

If you're trying to compare all elements to all other elements, you might want stringdistmatrix() instead of stringsim():

library(stringdist)

data1 <- c("neeyork","dalas","houson","new york")
data2 <- c("houston","newyork","dallas","washington")

stringdistmatrix(data1, data2, method = "cosine", useNames = "strings")
#>             houston    newyork     dallas washington
#> neeyork  0.66666667 0.11808290 1.00000000  0.7113249
#> dalas    0.87401184 1.00000000 0.04381711  0.6726732
#> houson   0.05719096 0.59910814 0.88819660  0.3876276
#> new york 0.64644661 0.06458565 1.00000000  0.5917517

^{Created on 2020-06-09 by the reprex package (v0.3.0)}

darkstormx1 · June 9, 2020, 3:45pm

Yeah this might work but I'dont think my pc can handle it like I said if I try this

data1 > 12k
data2 > 50k

I'll need to look for 600million values to find highest in each... How can I find top values in that mess?

mfherman · June 9, 2020, 3:54pm

Right, if you need to do all those pairwise comparisons, it will take a long time. A couple options might be to do this in parallel, or you if you have additional information about which elements are potential matches, you could limit the comparisons. This technique is sometimes called "blocking" and you read more about it and see an implementation in the fastLink package

darkstormx1 · June 9, 2020, 3:55pm

Thank you for your help. Have a nice day

system · June 30, 2020, 4:06pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.