Subsetting Data

tarkom · July 17, 2018, 10:23pm

Dear All,
I am trying to subset some particular rows in the data with the picture attached. My goal is to (1) List the "Make.and.model" with model names ending with 'R'.

(2) List the "Make.and.model" that might be smaller than 'liter' bikes (engine size < 1000 cc), based on their name. I want to achieve this by first, excluding motorcycles with 1 in the name (these will mostly be 1000+ numbers). From that set of names, I will now select those with numbers in range 2-9 in their names.
Please, how do I achieve these?

mishabalyasin · July 18, 2018, 8:33am

It is not a good idea to put your data in the screenshot like this. Here you can find what you should do instead:

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

This way it'll be much easier to others to help you.

I think, both of your questions can be answered with stringr package, specifically stringr::str_detect().

mara · July 18, 2018, 1:04pm

As @mishabalyasin mentioned, it will be much easier to help you with a reprex. However, a general suggestion would be to take a look at the stringr package and its Regular Expression helpers (likely for use in combination with tidyr, or a similar tool for helping you separate out multiple variables from a single column, which is what you have right now). See, for example, the section on separating multiple vars in the Tidy data vignette:

The stringr cheat sheet can be an invaluable asset as you go, too:

Once you've done that (which is the majority of the work), you'll be able to begin to subset data. For example, if you got a reliable displacement figure (I don't know how accurate you want to be, for example, the Diavel is 1200cc, which isn't in the name — R can also mean different things depending on the make) you could use dplyr::filter() to select rows with engine displacements >= 1000.

tarkom · July 19, 2018, 3:02am

Thank you all for your suggestions. I got it figured out. I used the sqldf and it was helpful.

tarkom · July 19, 2018, 3:03am

I got it figured Mishabalyasin. Thanks very much. I used the sqldf and it help me out.

mishabalyasin · July 19, 2018, 8:18am

Great! Can you share your solution here as well? It is very likely that someone else might have similar problem in future, so it will be helpful to have it in one place.

tarkom · July 19, 2018, 4:12pm

Sure!
So for the first part of my problem, this is how I got it done
sqldf("select * from data where variable like '%R' ").

This is the second part of the problem and the code:

# selecting those with numbers with 1
 data$variable[grepl("[1]", data$variable)]

# dropping those with numbers with 1
exlud <- data$variable[!grepl("[1]", data$variable)]
exlud
# those with numbers between 2-9
exlud[grepl("[2-9]", exlud)]

Hope it helps...

Thanks