In our minds content is our first concern. R
is different. Its first concern is
what object have I been handed?
So, although the contents of a vector and a data frame column may be identical, they aren't identical objects.
mtcars[1] |> str()
#> 'data.frame': 32 obs. of 1 variable:
#> $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
mtcars$mpg |> str()
#> num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
identical(mtcars$mpg,mtcars[1])
#> [1] FALSE
Created on 2023-10-22 with reprex v2.0.2
It's not always easy to mentally map.
For the simple case of leading and trailing whitespace, trimws()
beats regular expressions, but there are less straightforward situations when gsub()
and friends are the right tool. The good news is that regular expressions are very powerful and the bad news is that they are easy to mess up. But the joyous news is that if you ask nicely, AI bots will do it for you. For example, I asked: "In the R programming language, give me a regular expression in base that will match whitespace at either the beginning or the end of a character string." I got
In base R, you can use the gsub()
function with regular expressions to match and remove whitespace at the beginning or the end of a string. To match whitespace at the beginning or the end of a string, you can use the regular expression ^\\s+|\\s+$
. The caret symbol ^
matches the start of a string, \\s+
matches one or more whitespace characters, and the dollar symbol $
matches the end of a string. The pipe symbol |
represents "or" in the regular expression, so it matches either the beginning or the end of the string.
The bot remembered that \
has to be escaped.
But once the right function or regular expression is found, getting it back into its origin can also be tricky. Another example:
d <- data.frame(
the_names = c(" Tisa Casper","Jaimee Brekke ",
"Lizabeth Emard","Dorian Grant ",
" Damaris Hahn"," Jeane Conroy ",
" Dow Senger-Lind "," Lucile Turner-Christiansen ",
" Alvera Hoeger ","Clarabelle Schuppe"))
d
#> the_names
#> 1 Tisa Casper
#> 2 Jaimee Brekke
#> 3 Lizabeth Emard
#> 4 Dorian Grant
#> 5 Damaris Hahn
#> 6 Jeane Conroy
#> 7 Dow Senger-Lind
#> 8 Lucile Turner-Christiansen
#> 9 Alvera Hoeger
#> 10 Clarabelle Schuppe
strips <- "^\\s+|\\s+$"
gsub(strips,"",d$the_names)
#> [1] "Tisa Casper" "Jaimee Brekke"
#> [3] "Lizabeth Emard" "Dorian Grant"
#> [5] "Damaris Hahn" "Jeane Conroy"
#> [7] "Dow Senger-Lind" "Lucile Turner-Christiansen"
#> [9] "Alvera Hoeger" "Clarabelle Schuppe"
trimws(d$the_names)
#> [1] "Tisa Casper" "Jaimee Brekke"
#> [3] "Lizabeth Emard" "Dorian Grant"
#> [5] "Damaris Hahn" "Jeane Conroy"
#> [7] "Dow Senger-Lind" "Lucile Turner-Christiansen"
#> [9] "Alvera Hoeger" "Clarabelle Schuppe"
sapply(d,trimws)
#> the_names
#> [1,] "Tisa Casper"
#> [2,] "Jaimee Brekke"
#> [3,] "Lizabeth Emard"
#> [4,] "Dorian Grant"
#> [5,] "Damaris Hahn"
#> [6,] "Jeane Conroy"
#> [7,] "Dow Senger-Lind"
#> [8,] "Lucile Turner-Christiansen"
#> [9,] "Alvera Hoeger"
#> [10,] "Clarabelle Schuppe"
apply(d,2,trimws)
#> the_names
#> [1,] "Tisa Casper"
#> [2,] "Jaimee Brekke"
#> [3,] "Lizabeth Emard"
#> [4,] "Dorian Grant"
#> [5,] "Damaris Hahn"
#> [6,] "Jeane Conroy"
#> [7,] "Dow Senger-Lind"
#> [8,] "Lucile Turner-Christiansen"
#> [9,] "Alvera Hoeger"
#> [10,] "Clarabelle Schuppe"
d |> dplyr::mutate(the_names = trimws(the_names))
#> the_names
#> 1 Tisa Casper
#> 2 Jaimee Brekke
#> 3 Lizabeth Emard
#> 4 Dorian Grant
#> 5 Damaris Hahn
#> 6 Jeane Conroy
#> 7 Dow Senger-Lind
#> 8 Lucile Turner-Christiansen
#> 9 Alvera Hoeger
#> 10 Clarabelle Schuppe
d |> dplyr::transmute(the_names = trimws(the_names))
#> the_names
#> 1 Tisa Casper
#> 2 Jaimee Brekke
#> 3 Lizabeth Emard
#> 4 Dorian Grant
#> 5 Damaris Hahn
#> 6 Jeane Conroy
#> 7 Dow Senger-Lind
#> 8 Lucile Turner-Christiansen
#> 9 Alvera Hoeger
#> 10 Clarabelle Schuppe
# doesn't do what you might think
# d[1] <- trimws(d[1])
# but these do
d[,1] <- trimws(d[,1])
d$the_names <- trimws(d$the_names)
Created on 2023-10-22 with reprex v2.0.2
To my view dplyr::mutate()
probably provides most users with the smoothest path to modifying a data frame in place, although I prefer working in vector and matrix or data.table
objects.