Need to extract a specific text number format from a string

msrose · February 25, 2022, 7:04pm

I have thousands of descriptions with numbers and text combined in a column EX: AB CD PINE_1_200_CL_EB_7X240 and BLS MAN_3_200_LD_BN_10X27_KD. In one column i need just the first number underscore and second number EX: 1_200 and 3_200. In a different column i need just the third number X and fourth number EX: 7X240 and 10X27. I have tried several different methods and it either extracts all of the digits or only some of the digits. I'm trying to specify a way to indicate "digit_digit" for the first column and "digitXdigit" for the second column.

Thanks!

FJCC · February 25, 2022, 7:13pm

Here is one method.

TEXT <- data.frame(Strings= c("AB CD PINE_1_200_CL_EB_7X240", 
"BLS MAN_3_200_LD_BN_10X27_KD"))
library(stringr)
TEXT$First <- str_extract(TEXT$Strings,"\\d+_\\d+")
TEXT$SECOND <- str_extract(TEXT$Strings,"\\d+X\\d+")
TEXT
#>                        Strings First SECOND
#> 1 AB CD PINE_1_200_CL_EB_7X240 1_200  7X240
#> 2 BLS MAN_3_200_LD_BN_10X27_KD 3_200  10X27

^{Created on 2022-02-25 by the reprex package (v2.0.1)}

msrose · February 25, 2022, 8:28pm

This worked for the vast majority of my data. I am having one problem though, there are a few data points that are "BLS MAN_3x6_200_LD_BN_10X27_KD" which in the first column is returning an "NA" and in the second column is returning "3x6". Not sure if there is a fix for this by being able to add in an additional statement to the first column extract such as str_extract(TEXT$Strings,"\d+\d+" & "\d+X\d+\_").

If not, is there a way to have the second column value be filled with NA ONLY IF the first column value is NA?

FJCC · February 25, 2022, 8:55pm

I added your new problematic string to my code and I do not get the result you describe.

TEXT <- data.frame(Strings= c("AB CD PINE_1_200_CL_EB_7X240",
                               "BLS MAN_3_200_LD_BN_10X27_KD", 
                               "BLS MAN_3x6_200_LD_BN_10X27_KD"))
library(stringr)
TEXT$First <- str_extract(TEXT$Strings,"\\d+_\\d+")
TEXT$SECOND <- str_extract(TEXT$Strings,"\\d+X\\d+")
TEXT
                         Strings First SECOND
1   AB CD PINE_1_200_CL_EB_7X240 1_200  7X240
2   BLS MAN_3_200_LD_BN_10X27_KD 3_200  10X27
3 BLS MAN_3x6_200_LD_BN_10X27_KD 6_200  10X27

Is that correct or do you expect a different result?

msrose · February 25, 2022, 9:27pm

Hi! No, that doesn't fix the issue. What that does is in the first column it gives 6_200 and then in the second column it prints 3X6

FJCC · February 25, 2022, 9:38pm

Please post the result of running my code. We seem to be getting different results from the same code. Run each line of my code and when you are done, copy the resulting console output.

system · March 18, 2022, 9:38pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.