Separate specific values in string vectors across various columns

ysc · October 25, 2021, 2:38pm

Hi all. I was wondering how to identify and then split certain values in string vectors. Say we have the simplified dataframe:

df <- data.frame(
  var1 = c("1", "2,3", "4", "5"),
  var2 = c("3", "4", "5,6", "7,8"),
  var3 = c("2", "5", "6", "9,10")
)

Notice that some values are two integers separated by a comma. I would like to identify all cases in which this occurs and then replace that case using the highest integer, so "2,3" becomes "3" instead and so on. I have been trying with separate() and strsplit() to no avail.

Ultimately, my goal is to transform these columns into numeric variables, but need to first identify those pesky commas throughout the dataframe. Perhaps there is an easiest way to do this without splitting strings? Thanks!

DavoWW · October 26, 2021, 3:15am

Hi @ysc,
I tried "splitting" whole columns but it got messy!
Here is a way of achieving your result using matrices and operating on each element. It also works with more than 2 values per string:

df <- data.frame(
  var1 = c("1", "2,3", "4", "5"),
  var2 = c("3", "4", "5,6", "7,8"),
  var3 = c("2,3,4", "5", "6", "9,10,11,12")
)

mat <- as.matrix(df)
new <- matrix(data=NA, ncol=ncol(mat), nrow=nrow(mat))

for (ii in 1:nrow(mat)) {
  for (jj in 1:ncol(mat)) {
    tmp <- mat[ii, jj]
    xx <- unlist(strsplit(tmp, ","))
    yy <- max(as.integer(xx))
    new[ii, jj] <- yy
  }
}

df1 <- as.data.frame(new)
names(df1) <- names(df)

df
#>   var1 var2       var3
#> 1    1    3      2,3,4
#> 2  2,3    4          5
#> 3    4  5,6          6
#> 4    5  7,8 9,10,11,12
df1
#>   var1 var2 var3
#> 1    1    3    4
#> 2    3    4    5
#> 3    4    6    6
#> 4    5    8   12

^{Created on 2021-10-26 by the reprex package (v2.0.1)}

Maybe someone else has a nice tidyverse solution?

ysc · October 26, 2021, 7:07am

Thanks so much for the reply. It certainly works but you're right it got a bit messy! I wonder if I'm missing something much simpler and obvious. Ultimately the goal is to transform these columns to numeric, so perhaps there's another specialised function that saves us the "splitting"?

Andrzej · October 26, 2021, 6:54pm

Would you please care to elaborate a bit what is happening here in this code:

for (ii in 1:nrow(mat)) {
  for (jj in 1:ncol(mat)) {
    tmp <- mat[ii, jj]
    xx <- unlist(strsplit(tmp, ","))
    yy <- max(as.integer(xx))
    new[ii, jj] <- yy
  }
}

kaijabean · October 27, 2021, 4:11am

Here's a tidyverse solution, @ysc!

library(tidyverse) # loading tidyverse

# create the sample data frame
df <- data.frame(
  var1 = c("1", "2,3", "4", "5"),
  var2 = c("3", "4", "5,6", "7,8"),
  var3 = c("2", "5", "6", "9,10")
)

# create a new data frame with the commas and first numbers removed
df1 <- df %>%
  mutate(across(everything(), function(x){
    stringr::str_remove(x, "[0-9]*,") # using `str_remove` to get rid of a number followed by a comma
  })) %>%
# now that we've removed the numbers and commas, convert all columns to numeric
  mutate(across(everything(), as.numeric))

# Let's take a look at the new data frame:
> df1
var1 var2 var3
1    1    3    2
2    3    4    5
3    4    6    6
4    5    8   10
# awesome, looks like we got rid of those numbers and commas!

# Check the structure of the data frame to make sure all the cols are numeric:
str(df1)

'data.frame':	4 obs. of  3 variables:
 $ var1: num  1 3 4 5
 $ var2: num  3 4 6 8
 $ var3: num  2 5 6 10
# looks good!

This solution relies on a regular expression in the line stringr::str_remove(x, "[0-9]*,")
Here's what it's saying.

stringr::str_remove: I'm using the str_remove function in the stringr package to remove a specific pattern from all strings where it occurs. For any strings where the pattern doesn't occur, nothing will happen--the pattern isn't found, so it won't be removed.

"[0-9]*,": this means "any digits any number of times, followed by a comma". [0-9] looks for any digit; * means "any number of times, and ,` is a literal comma.

Note: if your real data has a different format than the toy example (i.e. if the thing you're trying to remove isn't always digits followed by a comma) then you may need to re-write the regular expression to fit your use case. If you need help with that, let me know! Here is a regex cheatsheet (second page).

The mutate(across(everything())) syntax is also a little confusing. across() is a relatively new function in dplyr, and it can be hard to get used to. But basically this code is just saying "For each column, apply this function" and then you define the function, which in our case is the str_remove section.

I hope that's helpful! Let me know if you need any more help here.

system · November 3, 2021, 4:11am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.