Selecting data in a range

kcuestas · June 28, 2018, 7:27pm

I have a table of counties, states and their minimum and maximum temperatures. I need to select only the counties that have temperatures in a range such as -15 degrees to 40 degrees. What function would I use?

mishabalyasin · June 28, 2018, 7:31pm

Hi, can you put your question into reprex?

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

It would help everyone here help you in a most straightforward manner.

One possible approach is to group by county and then summarize it with minimum and maximum of their temperatures. Then you can use this information to filter out all the counties that are in the range and join it with your original table by name.

kcuestas · June 28, 2018, 7:43pm

@mishabalyasin
climate.minmax <-
climate.data %>%
group_by(County, State) %>%
summarise(temp_min = min(temp_min),
temp_max = max(temp_max))

I did summarize it and that is all I have. Now I need to filter out the ones that do not fit in my desired range. How do I do that?

mishabalyasin · June 28, 2018, 8:01pm

You can use temp_min and temp_max in your new dataset to create a new variable with mutate (something like mutate(include = temp_min >= -15 & temp_max <= 40))

Then you filter to only have rows with TRUE and use dplyr::semi_join on your original data.

Anantadinath · June 28, 2018, 8:15pm

If you understand sql you can even try data.table package which is the fastest in entire R programming.

library(data.table)

climate.data %>% setDT()

climate.data[,.(temp_min=min(temp_min),
        temp_max=max(temp_max)),
by=.(County,State)][
(temp_min > -15) & (temp_max <40),]

FYI

data.table has a syntax like sql something like this

from[where, select, group by]