Retrieving an old cap as a corporate lawyer—it depends.
On the one hand there may be distinct legal entities, such as Farmer's Propane of Shawnee Mission, L.P. and Farmer's Propane of Hutchinson, L.P. Assuming they don't have overlapping market areas, the could both be doing business as Farmer's Propane. They might be controlled by siblings and Farmer's Propane was a parent's proprietorship and each sibling inherited part of the business and operate separately. Or Farmer's Propane may have been a general partnership to which the siblings succeeded and the partnership owns both of the businesses and they are operated under common control for the benefit of the partnership.
The possibilities multiply—I won't go into the offshore shell company variations.
That raises the question about the goal: legal distinctiveness, operational distinctiveness or just minimal nominal distinctiveness. And that depends on the purpose of the analysis.
Let's suppose that nominal distinctiveness is the goal because doing the other is madness for 150K names.
Here's a framework:
Every R
problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.
Here x can be a list of two vectors, vendor
and payee
, and y is a vector consisting of the intersection of x and y after each has been subjected to a function, f, to be composed, that makes each element of x and y nominally distinct internally.
Let f_1 be a function that removes stopwords, a vector, stops of the comment suffix identifiers
stops <- c("Inc","Inc.","Incorporated","Corporation","Corp","Corp.","Company","LP","L.P","LLC","L.L.C")
(Scanning the ends of names helps identify the candidates present in each vector. Use a {stringr}
or other regex.)
f_1 <- function(x) x %in% stops
holder <- vector()
for(i in seq_along(x) holder[i] = f_1(i)
holder <- vector()
setdiff(x,holder)
for(i in seq_along(y) holder[i] = f_1(i)
setdiff(y,holder)
With those results, an f_2 could look with unique(union(.,.)
for identical names in both lists without regard to the stopwords. Those can be set aside and then taken from the vectors to reduce the search space.
Another f_n# would be to extract the records that begin with Kansas
, chop off the stopwords, and work from the last word back to find the unique names.
Etc.