Hi,
I'm not sure if this is the right forum, please delete it if it isn't appropriate.
I am currently trying to model fraud based on among other things street addresses and phone numbers. I have a dataset where 10% of the data is confirmed as fraud. The data comprises of addresses and phone numbers. The addresses are unordered free text so anyone can put anything in there. For example i could have "1 RStudio Road" or "RStudio Road 1" or "One Rstdio Rd" as addresses.
My plan was to first split the data into a train and test set.
Using just the training set I wanted to use a levenschtein distance
to compare confirmed shipper addresses with new customer shipper addresses. I would also use the same reference table to compare receivers with previous receivers of suspicious goods.
In the end a row of data might look something like
id, orig_add_sim, dest_add_sim, valid_phone, fraud
1, 0.11, 0.22, 1 0
2, 0.42, 0.95, 0 1
So in the case above an algorithm might decide that record 1 is ok because of the low similarity with previous point to point locations while item two although it has a relatively low origin similarity score has an invalid phone number and a very high score with a previous receiver. I would then use something like logistic regression
to do this at scale. I cant help thinking that by using existing reference data for comparing confirmed cases we are by proxy kind of cheating and biasing the model.
My hope was to maybe identify new fraud based on the network effect i.e. people who commit fraud probably know other people doing the same thing and so there would be a network of the same people ordering from the same addresses and maybe from addresses i had not considered before.
However, if for example I have a 1000 records that i train the data on, I may get a new data-set which doesn't contain any previously identified addresses and so the similarity scores would be low for all of them tending to bias the model to existing knowledge
Does anyone have any thoughts on if this would be the correct approach or have any ideas of a better approach?
Thanks very much