How to subset a data set based on column names

Jack19 · March 6, 2021, 5:28pm

I am not good at R coding. I have a problem to subset a data set based on column names. In my data set, The first 14 columns have words as names, and the rest 1000 columns have numbers as names (not in order). When I read the data in, I guess all the column names become strings. How do I subset certain columns based on column names' value (like column names between 750 and 850, and still keep the first 14 columns) among those 1000 columns with numbers as names? Is there any easy way to do it? Your help is very appreciated.

Jack

technocrat · March 6, 2021, 5:53pm

Column names have to be strings; they can't be numeric. Although you can subset based on name, there's no point in this case since the names are number-like anyway. We can use the numeric indices, instead.

the subset <- DF[,c(1:14,243,546,547)]

The comma separates rows from columns; in this case you want all rows, so there is just the comma.

lars · March 6, 2021, 6:07pm

Aside from the provided perfectly working solution in base-r, you may want consider the tidyverse package as part of your toolkit:

The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy.

For instance, in the following chapter - of the excellent online free R for Data Science book - the select() function is introduced and several approaches of selecting your column variables are described.

This book is definitely worthwhile to get you started.

HTH as well

Jack19 · March 6, 2021, 10:02pm

Thank you for your reply! But that is not what I wanted because I have to type 300 numbers in your way if I have 300 columns to select.

Jack

lars · March 6, 2021, 10:59pm

It is possibly to achieve this with base-r, by using a regex expression.

As I happen to favour working with the tidyverse approach, I still recommend to have a look at section 5.4 of the R for Data Science book, where selecting columns based on pattern matching is introduced.

There are a number of helper functions you can use within select() :

starts_with("abc") : matches names that begin with “abc”.

ends_with("xyz") : matches names that end with “xyz”.

contains("ijk") : matches names that contain “ijk”.

matches("(.)\\1") : selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

num_range("x", 1:3) : matches x1 , x2 and x3 .

See ?select for more details.

If you're still struggling, please provide more detailed information on the column names and which one you're trying to select. Then we're able to help in constructing the right pattern and solution in either base-r or tidyverse.

Jack19 · March 7, 2021, 4:02pm

Thank you, Lars! I will take a look.

Jack

technocrat · March 8, 2021, 9:56pm

That's true if all 300 numbers are discontinuous. If they are blocks \dots

c(267,290:321,415:682 ...)

If the columns are in an sequence without much order, the suggested reprex methods based on character representation is appropriate, but possibly just as much work depending on the naming convention.

Zack83 · March 9, 2021, 8:42am

You can also convert the wished numbers to strings before using them to subset.
Something like dataset[,as.character(5*(80:90))]

as.character can be replaced by as.string, toString, str_pad, formatC. The latter two allow fixinf the string width and the padding character.

system · March 30, 2021, 8:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.