How to remove undesired space before certain observations in a dataframe column

cp1998 · October 22, 2023, 1:12pm

Hi,

So I have an excel file with a column that is peoples surnames. Certain entries have an undesired blank space at the start of the entry. I want to remove this space from all the impacted entries. (Not real data, only for demo purposes)

So I have read in the excel file to R and stored the data in a data frame called EU_DATA. No issues here.

I now want to remove the blank space from the start of the relevant entries

I am using the following code to trim the white space but it doesn't do anything
EU_DATA$Op_s.name <- trimws(EU_DATA$Op_s.name, which = "left")

I have als tried the following code using gsub but it also doesn't do anything
EU_DATA$Op_s.name <- gsub("^\s+","",EU_DATA$Op_s.name)

I then created my own character vector a <- c(" Lyons"," Murphy")
and then when I run the above code on the vector a, then it successfully removes the blank spaces. So I am really confused as to what is going on.

Any help in solving this would be much appreciated

Regards,
Colm

jrkrideau · October 22, 2023, 1:41pm

Have a look at trimws().

?trimws

technocrat · October 23, 2023, 6:48am

In our minds content is our first concern. R is different. Its first concern is

what object have I been handed?

So, although the contents of a vector and a data frame column may be identical, they aren't identical objects.

mtcars[1]  |> str()
#> 'data.frame':    32 obs. of  1 variable:
#>  $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
mtcars$mpg |> str()
#>  num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
identical(mtcars$mpg,mtcars[1])
#> [1] FALSE

^{Created on 2023-10-22 with reprex v2.0.2}

It's not always easy to mentally map.

For the simple case of leading and trailing whitespace, trimws() beats regular expressions, but there are less straightforward situations when gsub() and friends are the right tool. The good news is that regular expressions are very powerful and the bad news is that they are easy to mess up. But the joyous news is that if you ask nicely, AI bots will do it for you. For example, I asked: "In the R programming language, give me a regular expression in base that will match whitespace at either the beginning or the end of a character string." I got

In base R, you can use the gsub() function with regular expressions to match and remove whitespace at the beginning or the end of a string. To match whitespace at the beginning or the end of a string, you can use the regular expression ^\\s+|\\s+$. The caret symbol ^ matches the start of a string, \\s+ matches one or more whitespace characters, and the dollar symbol $ matches the end of a string. The pipe symbol | represents "or" in the regular expression, so it matches either the beginning or the end of the string.

The bot remembered that \ has to be escaped.

But once the right function or regular expression is found, getting it back into its origin can also be tricky. Another example:

d <- data.frame(
  the_names = c(" Tisa Casper","Jaimee Brekke ", 
                "Lizabeth Emard","Dorian Grant ",
                " Damaris Hahn"," Jeane Conroy ",
                " Dow Senger-Lind "," Lucile Turner-Christiansen ",
                " Alvera Hoeger ","Clarabelle Schuppe"))
d
#>                       the_names
#> 1                   Tisa Casper
#> 2                Jaimee Brekke 
#> 3                Lizabeth Emard
#> 4                 Dorian Grant 
#> 5                  Damaris Hahn
#> 6                 Jeane Conroy 
#> 7              Dow Senger-Lind 
#> 8   Lucile Turner-Christiansen 
#> 9                Alvera Hoeger 
#> 10           Clarabelle Schuppe

strips <- "^\\s+|\\s+$"
gsub(strips,"",d$the_names)
#>  [1] "Tisa Casper"                "Jaimee Brekke"             
#>  [3] "Lizabeth Emard"             "Dorian Grant"              
#>  [5] "Damaris Hahn"               "Jeane Conroy"              
#>  [7] "Dow Senger-Lind"            "Lucile Turner-Christiansen"
#>  [9] "Alvera Hoeger"              "Clarabelle Schuppe"
trimws(d$the_names)
#>  [1] "Tisa Casper"                "Jaimee Brekke"             
#>  [3] "Lizabeth Emard"             "Dorian Grant"              
#>  [5] "Damaris Hahn"               "Jeane Conroy"              
#>  [7] "Dow Senger-Lind"            "Lucile Turner-Christiansen"
#>  [9] "Alvera Hoeger"              "Clarabelle Schuppe"
sapply(d,trimws)
#>       the_names                   
#>  [1,] "Tisa Casper"               
#>  [2,] "Jaimee Brekke"             
#>  [3,] "Lizabeth Emard"            
#>  [4,] "Dorian Grant"              
#>  [5,] "Damaris Hahn"              
#>  [6,] "Jeane Conroy"              
#>  [7,] "Dow Senger-Lind"           
#>  [8,] "Lucile Turner-Christiansen"
#>  [9,] "Alvera Hoeger"             
#> [10,] "Clarabelle Schuppe"
apply(d,2,trimws)
#>       the_names                   
#>  [1,] "Tisa Casper"               
#>  [2,] "Jaimee Brekke"             
#>  [3,] "Lizabeth Emard"            
#>  [4,] "Dorian Grant"              
#>  [5,] "Damaris Hahn"              
#>  [6,] "Jeane Conroy"              
#>  [7,] "Dow Senger-Lind"           
#>  [8,] "Lucile Turner-Christiansen"
#>  [9,] "Alvera Hoeger"             
#> [10,] "Clarabelle Schuppe"
d |> dplyr::mutate(the_names = trimws(the_names))
#>                     the_names
#> 1                 Tisa Casper
#> 2               Jaimee Brekke
#> 3              Lizabeth Emard
#> 4                Dorian Grant
#> 5                Damaris Hahn
#> 6                Jeane Conroy
#> 7             Dow Senger-Lind
#> 8  Lucile Turner-Christiansen
#> 9               Alvera Hoeger
#> 10         Clarabelle Schuppe
d |> dplyr::transmute(the_names = trimws(the_names))
#>                     the_names
#> 1                 Tisa Casper
#> 2               Jaimee Brekke
#> 3              Lizabeth Emard
#> 4                Dorian Grant
#> 5                Damaris Hahn
#> 6                Jeane Conroy
#> 7             Dow Senger-Lind
#> 8  Lucile Turner-Christiansen
#> 9               Alvera Hoeger
#> 10         Clarabelle Schuppe
# doesn't do what you might think
# d[1] <- trimws(d[1])
# but these do
d[,1] <- trimws(d[,1])
d$the_names <- trimws(d$the_names)

^{Created on 2023-10-22 with reprex v2.0.2}

To my view dplyr::mutate() probably provides most users with the smoothest path to modifying a data frame in place, although I prefer working in vector and matrix or data.table objects.

system · December 4, 2023, 6:49am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.