gsub and regex issue

aarontimo · September 30, 2022, 6:50am

Hi there,

I am trying to remove all A-Z characters in my column names after the last digit. Here is a sample dataset:

df <- data.frame("C1 Lorem ipsum dolor sit amet" = c("Jon", "Bill", "Maria", "Ben", "Tina"),
                 "C3 001 commodo ligula eget dolor" = c(23, 41, 32, 58, 26)
                 "C3 002 Maecenas nec odio et ante tincidunt tempus" = c(23, 41, 32, 58, 26)
)

print(df)

I have tried to achieve this with the following variations but not having any success and not sure where I am going wrong.

gsub("C[0-9].[0-9]", "", colnames(df))

gsub("CC\d\s\d+", "", colnames(df))

Any help or advice is much appreciated.

nirgrahamuk · September 30, 2022, 9:59am

maybe its a typo, but you omitted a critical comma thats needed to define the df.
furthermore, standard data.frames dont support column names with spaces etc, and will convert them to be . instead. You should use tibbles() if the names with spaces are important to you.

library(tidyverse)
(tb_1 <- tibble(`long name with spaces`=5))
(df_1 <- data.frame(tb_1))

cmeuli07 · October 3, 2022, 7:25pm

The below code uses nirgahamuk's tip with tibble and then some regex to trim the column names to your specifications.

require(stringr)
require(tibble)

df <- tibble("C1 Lorem ipsum dolor sit amet" = c("Jon", "Bill", "Maria", "Ben", "Tina"),
                 "C3 001 commodo ligula eget dolor" = c(23, 41, 32, 58, 26),
                 "C3 002 Maecenas nec odio et ante tincidunt tempus" = c(23, 41, 32, 58, 26)
)

vec_col_names <- colnames(df)

# The regex here reads in English as:
# Match the character string that:
#   1) Is anchored at the beginning by a digit 0-9, exclusive
#   2) Itself contains infinite number of all possible characters EXCEPT digits 0-9
#   3) Ends at the end of string character, inclusive
vec_col_names2 <- str_replace_all(vec_col_names, '(?<=[[0-9]])[^[0-9]]*$', '')

colnames(df) <- vec_col_names2

aarontimo · October 4, 2022, 1:55am

Thank you @cmeuli07 and @nirgrahamuk for your help. That worked!!

In better trying to understand how @cmeuli07's regex expression worked, I came across RStudio's cheatsheets (RStudio Cheatsheets - RStudio). I share the link here for others (like me) new to RStudio and regex. There is a great cheatsheet on Stringr, which helped me understand the (?<=[[0-9]])[^[0-9]]*$bit.

Thank you, again, @cmeuli07 and @nirgrahamuk for your help!

system · October 11, 2022, 1:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.