Removing columns by index numbers

Nile · December 26, 2022, 5:47pm

Hi,

I have a dataset with column names such as state, education, school_1, school_2, school_3, year_1, year_2...

I want to remove the columns where the number after the underscore is higher than 1.

Please let me know is there is any way to achieve this.

Thank you!

EconProf · December 26, 2022, 6:02pm

select(yourdata, -ends_with("_1")) should do it

Nile · December 26, 2022, 6:43pm

Hi,

Thanks so much for your reply.

This method removes the columns that end with _1 and keeps the higher ones (e.g. _2 and _3), which is the opposite of what I want to have!

Also, when I try

select(dd, !ends_with("_1"))

it keeps those with '_1', but removes (unwantedly) columns that do not contain an underscore.

EconProf · December 26, 2022, 7:04pm

Sorry, I misread your first post!

Would select(dd, !(ends_with("2") | ends_with("3"))) work?

technocrat · December 26, 2022, 7:41pm

# setup to demonstrate one solution
varnames <- c("state", "education", "school_1", "school_2", "school_3", "year_1", "year_2")
length(varnames)
#> [1] 7
dumb <- mtcars[1:6,1:7]
colnames(dumb) <- varnames
dumb
#>                   state education school_1 school_2 school_3 year_1 year_2
#> Mazda RX4          21.0         6      160      110     3.90  2.620  16.46
#> Mazda RX4 Wag      21.0         6      160      110     3.90  2.875  17.02
#> Datsun 710         22.8         4      108       93     3.85  2.320  18.61
#> Hornet 4 Drive     21.4         6      258      110     3.08  3.215  19.44
#> Hornet Sportabout  18.7         8      360      175     3.15  3.440  17.02
#> Valiant            18.1         6      225      105     2.76  3.460  20.22

# solution
# regular expression to match colnames indexes ending in _1
pat <- "^.*_[1]$"
# subset by indexes of columns that do not match pat
dumb[grep(pat,colnames(dumb),invert = TRUE)]
#>                   state education school_2 school_3 year_2
#> Mazda RX4          21.0         6      110     3.90  16.46
#> Mazda RX4 Wag      21.0         6      110     3.90  17.02
#> Datsun 710         22.8         4       93     3.85  18.61
#> Hornet 4 Drive     21.4         6      110     3.08  19.44
#> Hornet Sportabout  18.7         8      175     3.15  17.02
#> Valiant            18.1         6      105     2.76  20.22

^{Created on 2022-12-26 with reprex v2.0.2}

EconProf · December 26, 2022, 8:42pm

@technocrat: I knew that using a regular expression was the solution, but it is not something that I have taken the time to learn. With my retirement at the end of fall term, I finally started down that path a couple of weeks ago! Thanks for the example.

technocrat · December 26, 2022, 8:48pm

In retirement you’ll wonder how you ever had time to go to work.

Nile · December 26, 2022, 10:14pm

Hi,

Thanks so much for the detailed steps! My goal is actually to keep school_1, year_1... and remove those with _2 and higher. I am sorry, if it was not clear in the description. I look forward to your reply.

Thank you.

Nile · December 27, 2022, 3:44pm

Hi,

This seems to be the way to solve this. The only thing is that my target columns are those with *_1 e.g. school_1 and year_1 (not the higher ones e.g. school_2 or year_3) as well as the ones that do not have that underscore pattern.

I look forward to your kind reply.

Thank you!

technocrat · December 27, 2022, 5:20pm

should be


dumb[grep(pat,colnames(dumb))]

technocrat · December 27, 2022, 6:03pm

varnames <- c("state", "education", "school_1", "school_2", "school_3", "year_1", "year_2")
dumb <- mtcars[1:6,1:7]
colnames(dumb) <- varnames

# solution
# regular expression to match colnames indexes ending in _1 or a letter
pat <- "^.*_[1]$|[a-z]$"
# subset by indexes of columns that do not match pat
dumb[grep(pat,colnames(dumb))]
#>                   state education school_1 year_1
#> Mazda RX4          21.0         6      160  2.620
#> Mazda RX4 Wag      21.0         6      160  2.875
#> Datsun 710         22.8         4      108  2.320
#> Hornet 4 Drive     21.4         6      258  3.215
#> Hornet Sportabout  18.7         8      360  3.440
#> Valiant            18.1         6      225  3.460

Nile · December 28, 2022, 3:26pm

That's a great piece of poetry! Thanks so much!

I'd like to ask an additional question that you might know. Is there any way to save the 'pat' function permanently?

technocrat · December 28, 2022, 5:52pm

Either as a function in a self-designed utility, package match_1() or, since it is an object, and objects can be serialized `saveRDS(pat, file = "somewhere_convenient.Rds").

RuiBarradas · December 29, 2022, 6:05pm

Hello,

First, make up a data set. In fact there are two example data sets below, a matrix and a data.frame. The code to get the columns ending with a number after an underscore is the same for both. Just substitute df1 for mat1.

# make up a data set
mat1 <- matrix(1:(7*4), ncol = 7)
colnames(mat1) <- c("state", " education", " school_1", " school_2", 
                    " school_3", " year_1", " year_2")
# show that it works with data.frames too
df1 <- as.data.frame(mat1)

^{Created on 2022-12-29 with reprex v2.0.2}

Now, the regular expression means:

".*_" any character repeated any number of times followed by an underscore;
the capture group "(\\d+$)" is a digit repeated at least once until the end ($)

To substitute this pattern by the first (\\1) and only capture group effectively removes everything else keeping only the numbers. Coerce the numbers to numeric and test for > 1. But be careful, if the search pattern doesn't exist in the column names the coercion will return NA's so test for them too.

The index gives columns with numbers greater than one after the underscore and to negate it gives the wanted columns.

# keep only the numbers after an underscore and coerce to numeric
i_col <- as.numeric(sub(".*_(\\d+$)", "\\1", colnames(mat1)))
#> Warning: NAs introduced by coercion
# this is the logical index giving the answer
i_col <- !is.na(i_col) & i_col > 1

mat1[, !i_col]
#>      state  education  school_1  year_1
#> [1,]     1          5         9      21
#> [2,]     2          6        10      22
#> [3,]     3          7        11      23
#> [4,]     4          8        12      24
df1[, !i_col]
#>   state  education  school_1  year_1
#> 1     1          5         9      21
#> 2     2          6        10      22
#> 3     3          7        11      23
#> 4     4          8        12      24

^{Created on 2022-12-29 with reprex v2.0.2}

Nile · December 30, 2022, 6:50pm

This is fantastic! Thanks so much for the step-by-step descriptions. I learned a lot from your answer!
Happy new year!

system · January 6, 2023, 6:50pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.