Modelling Approach to fraud using street addresses


I'm not sure if this is the right forum, please delete it if it isn't appropriate.

I am currently trying to model fraud based on among other things street addresses and phone numbers. I have a dataset where 10% of the data is confirmed as fraud. The data comprises of addresses and phone numbers. The addresses are unordered free text so anyone can put anything in there. For example i could have "1 RStudio Road" or "RStudio Road 1" or "One Rstdio Rd" as addresses.

My plan was to first split the data into a train and test set.

Using just the training set I wanted to use a levenschtein distance to compare confirmed shipper addresses with new customer shipper addresses. I would also use the same reference table to compare receivers with previous receivers of suspicious goods.

In the end a row of data might look something like

id, orig_add_sim, dest_add_sim,         valid_phone,            fraud
1,  0.11,                0.22,                1                    0
2,  0.42,                0.95,                0                    1

So in the case above an algorithm might decide that record 1 is ok because of the low similarity with previous point to point locations while item two although it has a relatively low origin similarity score has an invalid phone number and a very high score with a previous receiver. I would then use something like logistic regression to do this at scale. I cant help thinking that by using existing reference data for comparing confirmed cases we are by proxy kind of cheating and biasing the model.

My hope was to maybe identify new fraud based on the network effect i.e. people who commit fraud probably know other people doing the same thing and so there would be a network of the same people ordering from the same addresses and maybe from addresses i had not considered before.

However, if for example I have a 1000 records that i train the data on, I may get a new data-set which doesn't contain any previously identified addresses and so the similarity scores would be low for all of them tending to bias the model to existing knowledge

Does anyone have any thoughts on if this would be the correct approach or have any ideas of a better approach?

Thanks very much

Can you encode the address as longitude and latitude?

Hi @Max

Thanks for coming back to me. Its an interesting topic that i have been coming back to every couple of months or so

We do have lat lon for some of the addresses (roughly 2%) however I have about a million records to play around with so i don't think trying to batch upload it will be very quick to something like OSM (they probably wouldn't appreciate me hammering their servers) unless there is a way to do it without impacting other people's use of the service that you know of?

Thanks again for your time
Have a nice weekend

I was involved in a similar project earlier, maybe our approach could be helpful.

What we did was that we scored each postal code based on relative frequency of fraud attempts, i.e. postal codes with no recorded frauds got a perfect score.

This was then included as a variable in the fraud detection-model, which was modelled using logistic regression. However, note that if you do not have the right covariates, this model could suffer from overfitting.

E.g. the model first identified everyone from the worst postal code as a fraud, regardless of their other parameters. After introducing more variables to the model (e.g. annual income), this was no longer an issue and the partial effect of the postal-code indicator diminished.

Hi @taran

Thank you very much for your reply. I will try out your technique and let you knowif it improved the quality of our classification

Thanks again for taking the time out to answer