Items match based on text descriptions in a dataset

a.jain · January 7, 2020, 2:43pm

I have a long list of item descriptions. Some of these descriptions are duplicates but the text is not exactly matching. For example: descriptions are like "The Brawn White Bolt Laser 10W" and "Laser for 10 Watt White Bolt". Though these are the same items but having different descriptions. There are many items as such in the list. My objective is to find out such common items based on their descriptions.

I have a thought to find the nouns in each description field and match them across each other. Segregate those item descriptions that matches with most common nouns. Just wondering if it is making any sense? Is there any algorithm for that in R? Is there any other way to do this kind of matching?

woodward · January 7, 2020, 10:49pm

You need to provide more information. If there are only a few different descriptions you can standardise them manually. Or maybe you can use simple string manipulation to standardise them. If you have an item code (e.g. stock control unit SKU) you can use that. In the worst case it is a difficult language processing problem.

Please provide a sample of your data using dput.

andresrcs · January 8, 2020, 3:07am

You could try fuzzy matches using tidystringdist see this related thread

a.jain · January 8, 2020, 10:24am

Thanks woodward for your reply..There are 36451 item descriptions.
Please see the sample data using dput()

dput(head(data,20))

structure(list(Item # = c("22300406", "2G10089L12", "2G10156",
"2G10198", "2G10201L12", "2G10244L12", "2G10278", "2G10303",
"2G10306", "2G10307", "2G10308", "2G10316", "2G10327", "2G10341L12",
"2G10458", "2G10483", "2G1055", "2G10557", "2G10610", "2G10694"
), Description = c("SGM-02B312PY MOTOR", "VE-001572 TURNBUCKLE, KNUCKLE JOINT",
"ROBOT, HYD BASIN", "SGM-02B312PY-X1 MOTOR", "HOSE/GUN MODULE 1028328A",
"Laser for 10 Watt White Bolt", "BUSHING, 4429K425, REDUCER 1-1/2" MALE NPT TO 3/4"",
"HOSE/GUN MODULE 1028328A", "VN007052, CYLINDER RETURN BRACKET",
"G020900, HYDRATION PS2 EOAT", "BUSHING, 4429K425, REDUCER 1-1/4" MALE NPT TO 3/4"",
"VN007052 C, CYLINDER RETURN BRACKET", "1028328A HOSE/GUN MODULE",
"VE-001572 TURNBUCKLE, KNUCKLE JOINT", "MY1H20-200 CYLINDER",
"INTERFACE 8 PORT BLOCK", "The Brawn White Bolt Laser 10W", "MY1H20-200-10 CYLINDER",
"SGM-02B312PY-X2 MOTOR", "BLOCK, INTERFACE 8 PORT, C04T8N-00-PB-050M"
)), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))

Thanks...

a.jain · January 8, 2020, 11:35am

Thanks andresrcs for connecting me to the related thread. I tried fuzzy matches using tidystringdist but the line of code,

match <- tidy_comb_all(data$Description)

ran for around 20 minutes and thereafter I got an error as "Error: cannot allocate vector of size 7.6 Gb"

The code works fine for small data set but as I am having 36451 item descriptions, it is not executing. Is there any workaround on this? Thanks...

woodward · January 8, 2020, 3:49pm

This is going to be tough for an algorithm, Even as a human I don't know whether some of these items are the same or different. Item names can be different by a single character, and are actually different items.
e.g.
"BUSHING, 4429K425, REDUCER 1-1/2" MALE NPT TO 3/4"" = "BUSHING, 4429K425, REDUCER 1-1/4" MALE NPT TO 3/4"?
"SGM-02B312PY MOTOR" = "SGM-02B312PY-X1 MOTOR"? = "SGM-02B312PY-X2 MOTOR??

Are the Item # unique? Can't you just use that?

Your structure() is misformed, maybe because of the " symbols.

system · January 29, 2020, 3:49pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.