I'm a newbie to R and have a business problem to solve.
We would like to analyze the near duplicate requests for materials posted by our end users to our procurement department.
This will help us to identify most commonly requested materials and to codify them as a stock item, and possibly identify the suppliers who give good rates.
The material description is a free flow text and different interpretations exists, like, for eg, "3 inch" is written as "3 inch", "3 in", 3" (double quotes), "3 inches", "3 inch", etc.
library(stringr)
a = c("3inch3in3''3inches3inchesss5inch4inch2in2.0001''300''13Inch99inch300Inch13Inch099inch")
b = str_extract_all(a,"[-+]?[0-9]*[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?") # regex for all integer and float values
c = data.frame(b) # Converting the vector b to data frame
colnames(c) <- "Value"
c$Value =as.numeric(as.character(c$Value)) # Changing values from factor type to numeric for removal of duplicates. Here first factor are converted to character then numeric.
c = unique(c)
c$Value = paste(c$Value,"inch") # adding 'inch' after each value
O/P
Value
1 3 inch
6 5 inch
7 4 inch
8 2 inch
9 2.0001 inch
10 300 inch
11 13 inch
12 99 inch
HI Prateek
Thanks for taking your time to give a script, I appreciate.
However, this is not limited to only this example.
Like, below are some sample materials requests. If you can see, two requestor has requested the same kind of material with different terminologies.
I want to sort of the maximum occuring materials. In this case, REDUCER CONC BW, is a kind of material that would be requested often, regardless of their sizes.
Another option is using tidystringdist, take a look to this related thread to see an example.
If you need more specific help please provide a minimal REPRoducible EXample (reprex) illustrating your issue. A reprex makes it much easier for others to understand your issue and figure out how to help.
If you've never heard of a reprex before, you might want to start by reading this FAQ:
Here, is one way I would follow to get the required output
text <- c("Material compositions main_element Trg Density_composition_average IsBoron_composition_average IsDBlock_composition_average
Ag10Cu35Zr55 Zr 0.534 0.579069323 -0.585599963 1.029882069
Ag10Cu50Zr40 Cu 0.608 0.748401453 -0.585599963 1.029882069
Ag10Cu55Zr35 Cu 0.609 0.804845496 -0.585599963 1.029882069
Ag20Cu40Zr40 Cu 0.577 0.821973182 -0.585599963 1.029882069
Ag35Ca65 Ca 0.515 -0.860991846 -0.585599963 -0.9989612
Al13.3Co26.7Zr60 Zr 0.573 0.058889078 0.464229426 0.614749523
Al15.4Co26Zr58.6 Zr 0.565 0.013549211 0.629991961 0.549202279
Al15Ca70Mg15 Ca 0.545 -2.233221755 0.598418145 -2.091415267
Al16.9Co25.6Zr57.5 Zr 0.572 -0.0177169 0.748393772 0.502382819
Al18Sr82 Sr 0.527 -1.815268726 0.835221766 -2.091415267
Al19.4Co24.8Zr55.8 Zr 0.577 -0.071319762 0.945730123 0.424350386
Al20Ca65Mg15 Ca 0.597 -2.206276695 0.993090847 -2.091415267
Al21.2Co21.2Zr57.6 Zr 0.575 -0.143767766 1.087812296 0.368167033
Al22.3Co22.3Zr55.4 Zr 0.578 -0.151097759 1.17464029 0.333832763
Al22Co24Zr54 Zr 0.584 -0.126708497 1.150959928 0.343196655
Al23.5Co23.5Zr53 Zr 0.593 -0.159094116 1.269361739 0.296377195
Al30Ca60Mg10 Ca 0.582 -2.156791506 1.782436252 -2.091415267
Al30Sr70 Sr 0.556 -1.811332404 1.782436252 -2.091415267
Al32Ca58Mg10 Ca 0.588 -2.146013482 1.940305333 -2.091415267
Al35Ca65 Ca 0.688 -2.138656309 2.177108955 -2.091415267
Al35La50Ni15 La 0.548 -0.544586548 2.177108955 -1.623220667
Al35Sr65 Sr 0.493 -1.80969227 2.177108955 -2.091415267
Al40La45Ni15 La 0.492 -0.625328007 2.571781658 -1.623220667
Al85Co5Y10 Al 0.48 -1.56006386 6.123835981 -1.623220667
Al85Ni5Y10 Al 0.43 -1.559876416 6.123835981 -1.623220667
Al86.5Co4.5Y9 Al 0.441 -1.582894527 6.242237792 -1.670040127
Al86.5Ni4.5Y9 Al 0.413 -1.582725827 6.242237792 -1.670040127
Al88Co4Y8 Al 0.43 -1.605725193 6.360639602 -1.716859587
Al88Ni4Y8 Al 0.39 -1.605575238 6.360639602 -1.716859587
Al89.5Co3.5Y7 Al 0.39 -1.62855586 6.479041413 -1.763679047
Al89.5Ni3.5Y7 Al 0.37 -1.628424649 6.479041413 -1.763679047
Au35Ca65 Ca 0.508 0.583966294 -0.585599963 -0.9989612
Au49Ag5.5Pd2.3Cu26.9Si16.3 Au 0.623 3.080367076 -0.585599963 0.521110603
Au55Cu25Si20 Au 0.532 3.18404745 -0.585599963 0.405622601
Au77.8Si8.4Ge13.8 Au 0.466 4.418690726 -0.585599963 0.33695406
Au77Si9.4Ge13.6 Au 0.469 4.35226717 -0.585599963 0.311983681
")
### Took some sample table as example from a website ###
### Reading the text file ###
#### One can use read.csv() for comma seperated files
dataframe <- read.table(text = text,sep = "\t",header = TRUE)
### In this dataset, will calculate repeation of same kinds of Alloys irrespective of composition.
library(stringr)
#### DATA FORMATTING ####
dataframe$GROUPBY <- as.character(str_extract_all(dataframe$Material.compositions,"[a-zA-Z]+"))
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,'"',"")
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,'c\\(',"")
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,'\\)',"")
dataframe$GROUPBY <- str_replace_all(dataframe$GROUPBY,',',"")
### GENERATING REQUIRED OUTPUT ###
library(dplyr)
finalreport <- dataframe %>% group_by(GROUPBY) %>% summarise(length(GROUPBY))
### ORDERING DATA w.r.t FREQUENCY ###
finalreport <- finalreport[order(finalreport$`length(GROUPBY)`,decreasing = TRUE),]
O/P
1 Al Co Zr 8 ### 8 implies frequency of Al Co Zr
2 Ag Cu Zr 4
3 Al Ca Mg 4
4 Al Co Y 4
5 Al Ni Y 4
6 Al Sr 3
7 Al La Ni 2
8 Au Si Ge 2
9 Ag Ca 1
10 Al Ca 1
11 Au Ag Pd Cu Si 1
12 Au Ca 1
13 Au Cu Si 1