cole
January 28, 2019, 7:49pm
21
Another random tidbit, tally()
is a "summarize" function (meaning it drops all variables other than your group_by
vars). You can actually do grouped mutates in dplyr
that behave like the dplyr::add_count
above. It saves the join
and gives you a lot of flexibility. Here, I do a manual count
like add_count
, but I also get the mean HP for each group of (# of cylinders).
This is actually behavior that I first learned in PROC SQL
, if memory serves me correctly, and I was very happy to find it in R as well
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(count = n(), avg_hp = mean(hp)) %>%
select(mpg, cyl, hp, count, avg_hp)
#> # A tibble: 32 x 5
#> # Groups: cyl [3]
#> mpg cyl hp count avg_hp
#> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 21 6 110 7 122.
#> 2 21 6 110 7 122.
#> 3 22.8 4 93 11 82.6
#> 4 21.4 6 110 7 122.
#> 5 18.7 8 175 14 209.
#> 6 18.1 6 105 7 122.
#> 7 14.3 8 245 14 209.
#> 8 24.4 4 62 11 82.6
#> 9 22.8 4 95 11 82.6
#> 10 19.2 6 123 7 122.
#> # ... with 22 more rows
Created on 2019-01-28 by the reprex package (v0.2.1)
1 Like
Instead of age - mean(age)
you can also do scale(age, scale = FALSE)
, not more compact but no variable repetition
1 Like
Leon
February 1, 2019, 5:24pm
23
I've begun using between()
, which works quite nicely:
> tibble(x = rnorm(10)) %>% filter(x %>% between(-1, 1))
# A tibble: 6 x 1
x
<dbl>
1 0.463
2 0.891
3 -0.254
4 0.0976
5 -0.819
6 0.596
nwerth
February 1, 2019, 7:50pm
24
There are two versions of between
: one from the dplyr
package (which I assume you're using), and another from the data.table
package.
dplyr
's version has the benefit of being translatable to SQL by the dbplyr
package.
library(dplyr)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mycars <- copy_to(con, mtcars)
mycars %>% filter(between(hp, 10, 30)) %>% show_query()
# <SQL>
# SELECT *
# FROM `mtcars`
# WHERE (`hp` BETWEEN 10.0 AND 30.0)
But the data.table
version allows vectors for the lower and upper bounds. Obviously, this is good when the bounds depend on the observation.
campaigns <- data.frame(
name = c("Super sale!", "Crazy cuts!", "Delirious deals!"),
start = as.Date(c("2018-08-01", "2018-12-20", "2019-01-15")),
end = as.Date(c("2018-08-06", "2019-01-02", "2019-02-15"))
)
campaigns %>%
filter(data.table::between(Sys.Date(), start, end))
# name start end
# 1 Delirious deals! 2019-01-15 2019-02-15
1 Like
Leon
February 2, 2019, 8:19am
25
I have never used data table - Total dplyr/TV fan
system
Closed
February 23, 2019, 8:20am
26
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.