pathos
February 4, 2022, 10:40am
1
Let's say I have the following sample data:
postcode postcode_city
<chr> <chr>
1 3069 XJ 3069 XJ Rotterdam
2 3076 BJ 3076 BJ Rotterdam
3 3037 EA 3037 EA Rotterdam
4 3043 KC 3043 KC Rotterdam
5 3031 AM 3031 AM Rotterdam
6 3039 ZK 3039 ZK Rotterdam
I found a package that doesn't install into the current version of R, so I looked at the source code here: OmicsMarkeR source: R/stability.R
With a small deletion, essentially, this is the code:
sorensen <- function(x,y){
index <-
2*(length(intersect(x,y)))/(2*(length(intersect(x,y)))+
length(setdiff(x,y))+
length(setdiff(y,x)))
return(index)
}
### the goal:
sorensen(df$postcode, df$postcode_city)
# [1] 0
### since above isn't working, attempting individual parts
intersect(df$postcode[1], df$postcode_city[1])
# character(0)
setdiff(df$postcode[1], df$postcode_city[1])
# [1] "3069 XJ"
setdiff(df$postcode_city[1], df$postcode[1]) # just reversed x:y to y:x
# [1] "3069 XJ Rotterdam"
So setdiff
seems to be off, and intersect
doesn't seem to work at all.
Hi,
You cannot compare strings like that using intersect or setdiff.
"3069 XJ" is not the same as "3069 XJ Rotterdam" thus there will be no intersection and everything will be different. It's not clear what your goal is here, as the SD coefficient is based on similarities between list, but I don't see which lists you are trying to compare here.
You can look at which postcodes have the same city for example, or the number of unique postcodes etc, but for that you'd first need to transform your data. For example:
library(tidyverse)
myData = data.frame(
stringsAsFactors = FALSE,
postcode = c("3069 XJ","3076 BJ","3037 EA",
"3043 KC","3031 AM","3039 ZK"),
postcode_city = c("3069 XJ Rotterdam","3076 BJ Rotterdam","3037 EA Rotterdam",
"3043 KC Rotterdam","3031 AM Rotterdam","3039 ZK Rotterdam")
)
myData
#> postcode postcode_city
#> 1 3069 XJ 3069 XJ Rotterdam
#> 2 3076 BJ 3076 BJ Rotterdam
#> 3 3037 EA 3037 EA Rotterdam
#> 4 3043 KC 3043 KC Rotterdam
#> 5 3031 AM 3031 AM Rotterdam
#> 6 3039 ZK 3039 ZK Rotterdam
myData %>% separate(postcode, c("postcode", "abbr")) %>%
mutate(city = str_remove(postcode_city, "^\\d+\\s\\w+\\s")) %>%
select(-postcode_city)
#> postcode abbr city
#> 1 3069 XJ Rotterdam
#> 2 3076 BJ Rotterdam
#> 3 3037 EA Rotterdam
#> 4 3043 KC Rotterdam
#> 5 3031 AM Rotterdam
#> 6 3039 ZK Rotterdam
Created on 2022-02-04 by the reprex package (v2.0.1)
Now you can do more analyses based on any of the 3 variables. Please explain a bit more about what you like to do if needed.
Hope this helps,
PJ
pathos
February 4, 2022, 1:53pm
3
What I want to do is string fuzzy matching, using Dice's coefficient.
postcode
and postcode_city
are the lists of what I would like to compare.
Essentially, with the currently shown sample lists, there should be 100% similarity or close (I'm assuming).
Hi,
So when you look at the dataset I created, you can exactly extract that information because the last column (city) is identical for all samples. This way there is no need for any special other logic. Of course if you like to compare string similarities if this is just a dummy example (e.g. when there might be typos or other string variations) there are more methods to do string comparison on a character basis.
Does this make sense? Please provide more examples if this is not what you are looking for
PJ
pathos
February 7, 2022, 6:06am
5
Yes they are all identical -- this is just an example. As I mentioned, it should result in 100% similarity.
And you are right that I would like to do string comparisons, but I would like to specifically try SD-method.
library(tidyverse)
(example_df <- enframe(rownames(mtcars)) %>% mutate(val2 = lag(value)))
sorensen <- function(fullx,fully){
purrr::map2_dbl(fullx,fully,
~ {
x<-strsplit(x = .x,
split="") %>% unlist
y<-strsplit(x = .y,
split="")%>% unlist
2*(length(intersect(x,y)))/(2*(length(intersect(x,y)))+
length(setdiff(x,y))+
length(setdiff(y,x)))
})
}
### the goal:
sorensen(example_df$value,example_df$val2)
#or
example_df %>% mutate(
myscore = sorensen(value,val2))
1 Like
system
Closed
February 14, 2022, 10:28am
7
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.