find similar or nearly duplicate records

Hi

I'm a newbie to R and have a business problem to solve.

We would like to analyze the near duplicate requests for materials posted by our end users to our procurement department.

This will help us to identify most commonly requested materials and to codify them as a stock item, and possibly identify the suppliers who give good rates.

The material description is a free flow text and different interpretations exists, like, for eg, "3 inch" is written as "3 inch", "3 in", 3" (double quotes), "3 inches", "3 inch", etc.

Your help is much appreciated.

Regards
Ilyas

1 Like

You could try fuzzyjoin:
https://cran.r-project.org/web/packages/fuzzyjoin/index.html

1 Like

I would use: -

library(stringr)

a = c("3inch3in3''3inches3inchesss5inch4inch2in2.0001''300''13Inch99inch300Inch13Inch099inch")

b = str_extract_all(a,"[-+]?[0-9]*[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?")  # regex for all integer and float values

c = data.frame(b)    #  Converting the vector b to data frame
colnames(c) <- "Value"

c$Value =as.numeric(as.character(c$Value)) # Changing values from factor type to numeric for removal of duplicates. Here first factor are converted to character then numeric.

c = unique(c)

c$Value = paste(c$Value,"inch") # adding 'inch' after each value

O/P

 Value
1       3 inch
6       5 inch
7       4 inch
8       2 inch
9  2.0001 inch
10    300 inch
11     13 inch
12     99 inch

Will this help you.

Thanks Martin, Im gonna have a look on this

HI Prateek
Thanks for taking your time to give a script, I appreciate.

However, this is not limited to only this example.
Like, below are some sample materials requests. If you can see, two requestor has requested the same kind of material with different terminologies.

REDUCER CONC BW, SIZE 1(NB):25, Size 2: 15, SCHEDULE/THICKNESS: 40, DIMENSIONAL STD: ASME B16.9, MATERIAL GRADE:ASTM A234 WPB, CERTIFICATION: EN10204 2.2

REDUCER CONC BW, ASME B16.9, size1: DN50, size2: DN40,THK: S-40XS-80, ASTM A234 WPB, Others Requirements: EN 10204-3.1 CERTIFICATE

I want to sort of the maximum occuring materials. In this case, REDUCER CONC BW, is a kind of material that would be requested often, regardless of their sizes.

Hope this is clear.
Regards, Ilyas

Another option is using tidystringdist, take a look to this related thread to see an example.

If you need more specific help please provide a minimal REPRoducible EXample (reprex) illustrating your issue. A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

1 Like

Here, is one way I would follow to get the required output

text <- c("Material compositions	main_element	Trg	Density_composition_average	IsBoron_composition_average	IsDBlock_composition_average
Ag10Cu35Zr55	Zr	0.534	0.579069323	-0.585599963	1.029882069
Ag10Cu50Zr40	Cu	0.608	0.748401453	-0.585599963	1.029882069
Ag10Cu55Zr35	Cu	0.609	0.804845496	-0.585599963	1.029882069
Ag20Cu40Zr40	Cu	0.577	0.821973182	-0.585599963	1.029882069
Ag35Ca65	Ca	0.515	-0.860991846	-0.585599963	-0.9989612
Al13.3Co26.7Zr60	Zr	0.573	0.058889078	0.464229426	0.614749523
Al15.4Co26Zr58.6	Zr	0.565	0.013549211	0.629991961	0.549202279
Al15Ca70Mg15	Ca	0.545	-2.233221755	0.598418145	-2.091415267
Al16.9Co25.6Zr57.5	Zr	0.572	-0.0177169	0.748393772	0.502382819
Al18Sr82	Sr	0.527	-1.815268726	0.835221766	-2.091415267
Al19.4Co24.8Zr55.8	Zr	0.577	-0.071319762	0.945730123	0.424350386
Al20Ca65Mg15	Ca	0.597	-2.206276695	0.993090847	-2.091415267
Al21.2Co21.2Zr57.6	Zr	0.575	-0.143767766	1.087812296	0.368167033
Al22.3Co22.3Zr55.4	Zr	0.578	-0.151097759	1.17464029	0.333832763
Al22Co24Zr54	Zr	0.584	-0.126708497	1.150959928	0.343196655
Al23.5Co23.5Zr53	Zr	0.593	-0.159094116	1.269361739	0.296377195
Al30Ca60Mg10	Ca	0.582	-2.156791506	1.782436252	-2.091415267
Al30Sr70	Sr	0.556	-1.811332404	1.782436252	-2.091415267
Al32Ca58Mg10	Ca	0.588	-2.146013482	1.940305333	-2.091415267
Al35Ca65	Ca	0.688	-2.138656309	2.177108955	-2.091415267
Al35La50Ni15	La	0.548	-0.544586548	2.177108955	-1.623220667
Al35Sr65	Sr	0.493	-1.80969227	2.177108955	-2.091415267
Al40La45Ni15	La	0.492	-0.625328007	2.571781658	-1.623220667
Al85Co5Y10	Al	0.48	-1.56006386	6.123835981	-1.623220667
Al85Ni5Y10	Al	0.43	-1.559876416	6.123835981	-1.623220667
Al86.5Co4.5Y9	Al	0.441	-1.582894527	6.242237792	-1.670040127
Al86.5Ni4.5Y9	Al	0.413	-1.582725827	6.242237792	-1.670040127
Al88Co4Y8	Al	0.43	-1.605725193	6.360639602	-1.716859587
Al88Ni4Y8	Al	0.39	-1.605575238	6.360639602	-1.716859587
Al89.5Co3.5Y7	Al	0.39	-1.62855586	6.479041413	-1.763679047
Al89.5Ni3.5Y7	Al	0.37	-1.628424649	6.479041413	-1.763679047
Au35Ca65	Ca	0.508	0.583966294	-0.585599963	-0.9989612
Au49Ag5.5Pd2.3Cu26.9Si16.3	Au	0.623	3.080367076	-0.585599963	0.521110603
Au55Cu25Si20	Au	0.532	3.18404745	-0.585599963	0.405622601
Au77.8Si8.4Ge13.8	Au	0.466	4.418690726	-0.585599963	0.33695406
Au77Si9.4Ge13.6	Au	0.469	4.35226717	-0.585599963	0.311983681
")
### Took some sample table as example from a website ###

### Reading the text file ###
#### One can use read.csv() for comma seperated files

dataframe <- read.table(text = text,sep = "\t",header = TRUE)
###  In this dataset, will calculate repeation of same kinds of Alloys irrespective of composition.
library(stringr)

#### DATA FORMATTING ####
dataframe$GROUPBY <- as.character(str_extract_all(dataframe$Material.compositions,"[a-zA-Z]+"))
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,'"',"")
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,'c\\(',"")
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,'\\)',"")
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,',',"")

### GENERATING REQUIRED OUTPUT ###
library(dplyr)
finalreport <- dataframe %>% group_by(GROUPBY) %>% summarise(length(GROUPBY))

### ORDERING DATA w.r.t FREQUENCY ###
finalreport <- finalreport[order(finalreport$`length(GROUPBY)`,decreasing = TRUE),]

O/P
1       Al Co Zr 8 ### 8 implies frequency of Al Co Zr 
2	Ag Cu Zr 4
3	Al Ca Mg 4
4	Al Co Y 4
5	Al Ni Y 4
6	Al Sr	 3
7	Al La Ni 2
8	Au Si Ge 2
9	Ag Ca 1
10	Al Ca 1
11	Au Ag Pd Cu Si 1
12	Au Ca 1
13	Au Cu Si 1

Is this helpful :slightly_smiling_face:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.