Somebody PLEASE save me from loops & mapply()!...

kooshkashkai · March 16, 2020, 10:44am

I'm trying to calculate some historic averages on a reasonably large data frame (size 50,000r x 150c) and to do so, I first need to subset the data according to two criteria then apply some arithmetic function to one of the subsetted columns (e.g. calculate an average between current date and date-3 months). I then write the result to another data frame as a new column of counts / averages / weighted averages over certain periods. Due to the size of the datasets (and the limitations of my coding skill) this is taking a long time (loop after loop after loop...) and my question is, is there a better way to approach this kind of problem in R?

Here's a simple example to illustrate what I'm trying to do:

library(lubridate)

###  Create dataframe Df

date <- c("01/01/2020", "02/01/2020", "02/01/2020","02/01/2020", "03/01/2020", 
          "03/01/2020", "03/01/2020", "03/01/2020", "04/01/2020", "04/01/2020")
date <- dmy(date)
name <- c("john", "paul", "john", "peter", "peter", 
          "john", "andrew", "john", "peter", "peter")
visits <- c(1, 3, 2, 1, 3, 
            4, 6, 1 ,1, 9)
Df <- data.frame(date, name, visits)
Df


###  Create dataframe Df1

date1 <- c("01/01/2020", "02/01/2020", "03/01/2020", "04/01/2020")
date1 <- dmy(date1)
name1 <- c("john", "paul", "andrew", "peter")
totvisits <- c(0, 0, 0, 0)
Df1 <- data.frame(date1, name1, totvisits)
Df1

Df$name <- as.character(Df$name)
Df1$name1 <- as.character(Df1$name1)

In this example I want to (for each row name1/date1 pair in Df1) subset Df according to date == date1 / name==name1 and return the number of visits each 'name1' has made prior to each 'date1' value ie by summing the 'visits' column for date < date1. I then want to save this value to the relevant row of a new column ('Df1$totvisits' in this example). As far as I can tell this requires some kind of loop which on bigger datasets is clunky and takes ages. I've tried mapply() too but that's no quicker and doesn't get away from the fact that my solution isn't very elegant.

### loop 
for (i in 1:dim(Df1)[1]) {
  Df1[i, 3] <- sum(subset(Df, Df$name == Df1$name1[i] & Df$date <= Df1$date1[i])[,3])
}
Df1

### apply()
f <- function(x, y) {
  sum(subset(Df, (Df$name == x) & (Df$date <= y))[,3])
  }
Df1[, 3] <- mapply(f, x = Df1$name1, y = Df1$date1)
Df1

To make the above clearer, what I'm trying to do is to add a new column to Df1 ('Df1$totvisits') where each entry is the result of looking up date1/name1 in the other data frame ('Df), and returning the sum of the visits that occurred before 'date1' for each 'name1'. For example, in the "2020-01-02 paul" row in Df1, I need to get values of visits from Df where 'name == paul' and 'date <= 02/01/2020', and then put the sum of them in the third column, second row of Df1. Since there is only one instance of paul in Df with date <= "2020-01-02", this entry becomes = 3.

I would like to be able to extend this to looking up value from a range of dates e.g. dates between x and y (specified by 2 columns of Df1 such as date1/date2) where person z has made a visit. I'd like to do this so that I can calculate average visits in per day over say a rolling 3m window.

In a dplyr context, I think what I'm trying to do is 'mutate' Df1 with a new column that contains an expression that returns values of sums of Df$visits before (/ between date1/date2 in my more advanced case) each date. It's just I cant seem to get it to work and it's driving me nuts!

Thanks in advance for any help with this.

technocrat · March 16, 2020, 9:16pm

Hi, and welcome!

Great reprex, thanks.

The lag part of the problem will require some head scratching. I think the place to start, however, is by consolidating the visits in Df1, which is done simply.

suppressPackageStartupMessages(library(dplyr)) 
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

date <- c("01/01/2020", "02/01/2020", "02/01/2020","02/01/2020", "03/01/2020", 
          "03/01/2020", "03/01/2020", "03/01/2020", "04/01/2020", "04/01/2020")
date <- dmy(date)
name <- c("john", "paul", "john", "peter", "peter", 
          "john", "andrew", "john", "peter", "peter")
visits <- c(1, 3, 2, 1, 3, 
            4, 6, 1 ,1, 9)
Df <- data.frame(date, name, visits)

Df %>% group_by(date,name) %>% summarize(tot_visits = sum(visits)) -> Df_grouped
Df_grouped
#> # A tibble: 8 x 3
#> # Groups:   date [4]
#>   date       name   tot_visits
#>   <date>     <fct>       <dbl>
#> 1 2020-01-01 john            1
#> 2 2020-01-02 john            2
#> 3 2020-01-02 paul            3
#> 4 2020-01-02 peter           1
#> 5 2020-01-03 andrew          6
#> 6 2020-01-03 john            5
#> 7 2020-01-03 peter           3
#> 8 2020-01-04 peter          10

^{Created on 2020-03-16 by the reprex package (v0.3.0)}

technocrat · March 16, 2020, 9:26pm

OK, so this is actually, now that I've thought about it for a little, a time series problem. The library forecast has a moving average function that should be able to be made to produce tot_visits over an n-month ma window for name. I was just about to return to forecast for another problem. This will make a good warm-up.

technocrat · March 16, 2020, 11:22pm

Here's what I'll be using to try the time series approach

library(babynames)
suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(lubridate)) 
suppressPackageStartupMessages(library(purrr)) 
suppressPackageStartupMessages(library(tibble)) 
suppressPackageStartupMessages(library(tsibble)) 

# convenience functions
n_defactor <- function(x) {as.numeric(levels(x))[x]}
c_defactor <- function(x) {as.character(levels(x))[x]}

# create larger test data set
dates <- as.character(seq(as.Date("1995/1/1"), by = "day", length.out = 10000))

set.seed(42)
namelist  <- filter(babynames, year == 1995) # millennial given names
namelist <- sample(namelist$name, 100) # limit to 100
set.seed(137)
names <- sample(namelist, 10000, replace = TRUE)
set.seed(727)
visits <- c(1,2,3,4,5,6,7) # assumed numbers of single-day observed visits
visits <- sample(visits, 10000, replace = TRUE)
dat <- cbind(dates,namelist,visits) %>% 
  as.data.frame() %>%  
  as_tibble() %>% 
  mutate(dates = c_defactor(dates)) %>%
  mutate(namelist = c_defactor(namelist)) %>%
  mutate(visits = n_defactor(visits)) %>% 
  mutate(dates = ymd(dates)) %>% 
  rename(name = namelist, date = dates)
  
# convert to time series
  
as_tsibble(dat, key = name) -> dat_ts
#> Using `date` as index variable.

dat_ts
#> # A tsibble: 10,000 x 3 [1D]
#> # Key:       name [100]
#>    date       name  visits
#>    <date>     <chr>  <dbl>
#>  1 1995-01-10 Abdul      1
#>  2 1995-04-20 Abdul      5
#>  3 1995-07-29 Abdul      5
#>  4 1995-11-06 Abdul      7
#>  5 1996-02-14 Abdul      3
#>  6 1996-05-24 Abdul      7
#>  7 1996-09-01 Abdul      2
#>  8 1996-12-10 Abdul      6
#>  9 1997-03-20 Abdul      2
#> 10 1997-06-28 Abdul      7
#> # … with 9,990 more rows

^{Created on 2020-03-16 by the reprex package (v0.3.0)}

I'm not sure it will be necessary, but to sort in date order

dat_ts %>% arrange(-desc(date))

Update 2020-03-17

to convert dat_ts to an actual time series, use forecast::as.ts(dat_ts)
I think the names will need to be spread to create a usable ts

further 2020-03-17
3, Since a ts is a matrix, the names need to be cast as dim names

kooshkashkai · March 17, 2020, 10:48am

Hi Technocrat & thanks for your detailed response. I've been going round in circles with this for a while and it's really helpful to see a novel approach. Looking forward to seeing if 'forecast' package can work quickly on this new data!

technocrat · March 17, 2020, 5:34pm

Great! Come back if you have more questions!

irene · March 17, 2020, 5:55pm

@kooshkashkai building off of @technocrat's work, this is how I would approach the problem:

Sort by date and then get the cumulative sum of visits for each person
Get the maximum number of totvisits per person per day
Join to your dataframe of date/name combinations by both date and name

suppressPackageStartupMessages(library(dplyr)) 
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

# visit data
date <- c("01/01/2020", "02/01/2020", "02/01/2020","02/01/2020", "03/01/2020", 
          "03/01/2020", "03/01/2020", "03/01/2020", "04/01/2020", "04/01/2020")
date <- dmy(date)
name <- c("john", "paul", "john", "peter", "peter", 
          "john", "andrew", "john", "peter", "peter")
visits <- c(1, 3, 2, 1, 3, 
            4, 6, 1 ,1, 9)
Df <- data.frame(date, name, visits, stringsAsFactors = FALSE)
# stringsAsFactors = FALSE means that all character vectors do not automatically become factors
# if you use tibble() instead of data.frame(), it will not convert strings to factors by default

# target dates/names
date <- c("01/01/2020", "02/01/2020", "03/01/2020", "04/01/2020")
date <- dmy(date)
name <- c("john", "paul", "andrew", "peter")
Df1 <- data.frame(date, name, stringsAsFactors = FALSE)

# cumulative sum per name over time (date)
df_cumulative <- Df %>% 
  filter(between(date, ymd("2020-01-01"), ymd("2020-01-04"))) %>% #modify this as needed
  arrange(date) %>% 
  group_by(name) %>% 
  mutate(totvisits = cumsum(visits)) %>% #cumulative sum
  ungroup() #removes grouping, may affect later operations if you don't do it
df_cumulative
#> # A tibble: 10 x 4
#>    date       name   visits totvisits
#>    <date>     <chr>   <dbl>     <dbl>
#>  1 2020-01-01 john        1         1
#>  2 2020-01-02 paul        3         3
#>  3 2020-01-02 john        2         3
#>  4 2020-01-02 peter       1         1
#>  5 2020-01-03 peter       3         4
#>  6 2020-01-03 john        4         7
#>  7 2020-01-03 andrew      6         6
#>  8 2020-01-03 john        1         8
#>  9 2020-01-04 peter       1         5
#> 10 2020-01-04 peter       9        14

# get maximum totvisits per day per name
df_max_per_day <- df_cumulative %>% 
  select(-visits) %>% 
  group_by(date, name) %>% 
  filter(totvisits == max(totvisits)) %>% 
  ungroup()
df_max_per_day
#> # A tibble: 8 x 3
#>   date       name   totvisits
#>   <date>     <chr>      <dbl>
#> 1 2020-01-01 john           1
#> 2 2020-01-02 paul           3
#> 3 2020-01-02 john           3
#> 4 2020-01-02 peter          1
#> 5 2020-01-03 peter          4
#> 6 2020-01-03 andrew         6
#> 7 2020-01-03 john           8
#> 8 2020-01-04 peter         14

Df1 %>% 
  left_join(df_max_per_day, by = c("date", "name"))
#>         date   name totvisits
#> 1 2020-01-01   john         1
#> 2 2020-01-02   paul         3
#> 3 2020-01-03 andrew         6
#> 4 2020-01-04  peter        14

^{Created on 2020-03-17 by the reprex package (v0.2.1)}

I would also recommend checking out the slider package for moving averages/etc, which I actually discovered myself yesterday. This 5 minute talk introduces the package.

kooshkashkai · March 17, 2020, 8:12pm

Hi Irene & thank you very much. I'll give this a try on my actual problem dataset and report back but both methods (yours and Technocrats) look promising. I'm quite new to all this and it's quite easy to get lost in all the jargon etc. so any help is appreciated!

technocrat · March 17, 2020, 9:42pm

It may not ultimately work as a time series object if there are missing dates, unless we can populate them with NAs. I set it up for 10,000 records assuming that there would be at least one for each date; and to preserve the date associated with a particular name may be a challenge. This may make @irene's slider approach preferable.

technocrat · March 18, 2020, 4:18am

Here's what I'm using currently:

library(babynames)
suppressPackageStartupMessages(library(dplyr)) 

# create large test data set

# for reproducibility
set.seed(42)
namelist  <- filter(babynames, year == 1995) # millennial given names
namelist <- sample(namelist$name, 100) # limit to 100
set.seed(137)
# set number of rows and columns
r = 10000
c = 100
# assumed numbers of visit on any given day, repeated numbers = greater weight
visitnums <- c(0,0,0,0,0,1,1,1,1,2,2,2,3,4,5,7,9)
# create a matrix and randomly fill it with number of visits
m <- matrix(sample(visitnums,r*c, replace=TRUE),r,c)
# assign names to columns, representing names
dimnames(m) <- list(seq(1:10000),namelist)
head(m)
#>   Payden Vashawn Monai Deidra Tallie Apolonia Rheannon Akili Jehan Abdul
#> 1      0       0     7      1      0        1        4     4     2     1
#> 2      1       7     4      0      4        1        0     4     0     0
#> 3      5       5     2      0      0        3        2     0     1     1
#> 4      1       1     3      1      7        0        1     0     7     3
#> 5      5       0     5      0      0        1        0     1     0     1
#> 6      1       2     7      7      3        1        0     0     5     0
#>   Harmoni Rosy Jozlyn Sharese Sherry Mikala Okechukwu Terrina Jamarri Kristofor
#> 1       7    1      1       3      2      1         4       5       1         1
#> 2       4    4      0       2      1      0         0       4       1         1
#> 3       5    3      2       0      5      9         2       1       1         9
#> 4       2    1      0       0      0      2         3       0       4         0
#> 5       1    1      9       4      1      0         0       4       9         1
#> 6       1    2      2       0      0      0         0       0       2         9
#>   Darrisha Mckenna Erasto Kaleigh Armonie Linzi Curley Yuri Kinsey Lanina Kecia
#> 1        1       0      0       9       4     3      4    0      1      9     1
#> 2        9       1      1       0       0     0      2    0      0      1     1
#> 3        7       0      0       1       0     4      1    1      0      2     1
#> 4        5       1      0       7       9     1      1    2      2      0     4
#> 5        0       0      1       0       1     0      0    1      7      1     2
#> 6        0       7      0       7       0     1      0    1      7      1     3
#>   Hendrick Yaniv Quin Abigael Jaden Shirly Lilibet Ryker Isis Sabrinia Flavio
#> 1        1     5    0       7     0      0       2     3    7        1      0
#> 2        0     2    1       0     5      0       7     0    2        1      3
#> 3        1     5    1       5     9      0       4     1    5        4      0
#> 4        0     2    4       2     1      1       2     1    4        5      9
#> 5        0     0    1       0     5      1       0     4    4        0      4
#> 6        1     1    1       1     2      0       3     1    0        1      5
#>   Batsheva Davione Jennika Arlen Lyman Leonarda Joseph Kiran Lincoln Gabryel
#> 1        9       0       1     3     1        1      7     0       0       1
#> 2        2       7       0     1     3        0      3     0       1       3
#> 3        0       0       1     4     5        2      0     3       2       1
#> 4        7       1       0     7     2        1      0     2       0       2
#> 5        9       0       5     1     0        0      0     0       4       1
#> 6        2       0       2     0     2        1      2     3       0       0
#>   Pressley Jerline Julianne Zedric Wynter Leah Lyndee Shanesha Yeshaya Zanae
#> 1        1       4        0      5      1    2      0        0       5     0
#> 2        1       7        0      1      5    3      0        1       2     2
#> 3        2       1        1      5      0    0      1        0       2     1
#> 4        5       7        2      4      0    1      0        0       2     7
#> 5        0       2        2      0      0    3      0        3       3     5
#> 6        0       3        2      2      4    4      4        2       2     4
#>   Kolin Magnolia Shaheen Shadasia Niccole Tegan Jaicee Stephanie Baotran
#> 1     1        0       3        0       2     1      0         3       9
#> 2     0        5       7        0       1     9      0         0       5
#> 3     7        1       2        1       7     0      0         0       0
#> 4     2        4       2        2       9     3      1         0       2
#> 5     0        9       2        2       0     2      1         0       0
#> 6     1        1       1        0       5     9      0         2       1
#>   Shametria Flabio Nikita Ricky Kristeen Elizah Hedit Gennie Liron Yarlin
#> 1         3      1      0     1        1      7     5      2     0      2
#> 2         0      0      5     2        0      1     7      0     2      1
#> 3         9      4      1     1        1      2     1      1     3      3
#> 4         0      0      5     0        2      1     1      1     0      0
#> 5         4      0      2     1        1      1     2      1     7      1
#> 6         0      0      1     0        0      2     1      9     1      2
#>   Savita Aide Valerie Chellsea Salah Nick Andrew Shai Selena Niccolas Mandolin
#> 1      9    1       0        2     0    1      0    0      2        2        0
#> 2      7    9       2        2     2    1      4    1      2        0        9
#> 3      5    0       0        4     3    7      1    0      1        1        0
#> 4      9    1       1        3     5    0      0    0      2        0        0
#> 5      0    9       0        5     7    1      3    4      0        7        4
#> 6      5    1       2        1     2    4      1    2      1        1        1
#>   Quentavious Rivers Sena Paiton Jiovanna Symeon Shaana Lataria
#> 1           0      4    0      9        5      0      3       5
#> 2           5      0    3      2        4      4      0       0
#> 3           0      1    5      2        2      1      2       0
#> 4           2      1    2      0        2      1      0       0
#> 5           0      3    7      0        0      1      4       0
#> 6           2      0    9      2        3      5      0       1

^{Created on 2020-03-17 by the reprex package (v0.3.0)}

This is a 10000x100 matrix with 100 names as columns and rows as the number of visits on a sequence of 10,000 dates. In converting this to a time series object, we can specify a start date c(1995,1) and frequency = 1 for daily.

As a relational database table you'd see this

create visits table in a database named mbt

CREATE TABLE visits (
ID INT NOT NULL PRIMARY KEY,
visit_date date,
visitor VARCHAR(10),
visits INT(2)
)


MariaDB [mbt]> describe visits;
+------------+-------------+------+-----+---------+-------+
| Field      | Type        | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| ID         | int(11)     | NO   | PRI | NULL    |       |
| visit_date | date        | YES  |     | NULL    |       |
| visitor    | varchar(10) | YES  |     | NULL    |       |
| visits     | int(2)      | YES  |     | NULL    |       |
+------------+-------------+------+-----+---------+-------+
4 rows in set (0.008 sec)

Now that I've got a structure, I can turn to the logic of creating the functions to query the object by date criteria and create summaries

technocrat · March 18, 2020, 5:38am

2020-03-19

Well, it took me long enough.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate)) 
suppressPackageStartupMessages(library(purrr)) 
suppressPackageStartupMessages(library(readr))

# test data set
dat<- read_csv("https://careaga.s3-us-west-2.amazonaws.com/visits.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   date = col_date(format = "")
#> )
#> See spec(...) for full column specifications.

# specify dates and names

date1 <- c("01/01/2000", "02/01/2000", "03/01/2000", "04/01/2000")
date1 <- dmy(date1)
name1 <- c("Rivers", "Sena", "Paiton","Jiovanna")

# find number of visits prior to dates for each name
dat %>% filter(date < date1) %>% select(name1) %>% colSums()
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(name1)` instead of `name1` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#>   Rivers     Sena   Paiton Jiovanna 
#>     4137     4059     4100     4015

^{Created on 2020-03-19 by the reprex package (v0.3.0)}

historical garbage
Well, time series turned out to be not fruitful, but I did get a data frame out of it with a usable structure, that can be downloaded from AWS S3 as a csv.

Update 2020-03-18

I've been successful querying a tibble constructed from visits.csv based on date1 and name1. The approach I'm now pursuing is not to update DF1 but to recreate it by extracting date1 and name1 to filter and select from the visits data frame and purrr:map2__(date1,name1,fun) Using Df %>% mutate(tovisits = function(totvisits) doesn't work because it isn't really feasible rowwise

system · April 8, 2020, 5:38am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.