My problem is a little bit difficult to explain. I have a dataframe my_df
, grouped by variable character
. I also have some more columns: in particular, I have a last_event
column which retains the last event happened, and a column date_time
. For each character
, I want to find the "chain of events" which happened to her/him. This means looking at the last_event
column and store only the rows where last_event
changes values, i.e., the times where an event happens, while quickly removing all the other rows. Hopefully the example will make things more clear:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(tibble)
# set seed for reproducibility
set.seed(2)
create_input_dataframe <- function(){
# generate input data frame
character <- c("Jake", "Elwood", "Sister Mary")
character <- factor(character, levels = character)
m <- length(character)
n <- 10^4
ntot <- n * m
stop_date <- now()
period <- minutes(ntot)
start_date <- stop_date - period
date_time <- seq(start_date, stop_date, length.out = ntot)
event <- c("There was an earthquake", "A terrible flood", "Locusts")
event <- factor(event, levels = event)
x <- runif(ntot)
y <- rnorm(ntot)
probabilities <- rev(seq(0.1, 0.5, length.out = m))
my_df <- data.frame(character = sample(character, ntot, replace = TRUE,
prob = probabilities),
last_event = sample(event, ntot, replace = TRUE),
x = x,
y = y,
date_time = date_time)
my_df$last_event[sample(seq_len(ntot), n/10)] <- NA
return(my_df)
}
create_first_10_rows_of_output <- function(){
top <- structure(list(rowname = c("1", "2", "4", "7", "10", "12", "16",
"17", "19", "20"), character = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("Jake", "Elwood", "Sister Mary"
), class = "factor"), last_event = structure(c(1L, 3L, 1L, 2L,
1L, 2L, 3L, 2L, NA, 3L), .Label = c("There was an earthquake",
"A terrible flood", "Locusts"), class = "factor"), x = c(0.18488225992769,
0.943839338840917, 0.833448815625161, 0.23889475944452, 0.976398489437997,
0.387549542589113, 0.962644048267975, 0.132372003281489, 0.868861036840826,
0.514281762996688), y = c(-0.820686814466583, -0.143422434763756,
-0.264525547407185, -1.47082074646914, -0.142277039055211, 0.0521512918828832,
0.336463740931448, -0.5464779779466, 1.82319608312785, -0.218062867744612
), date_time = structure(c(1534790107.19217, 1534790347.20017,
1534790527.20617, 1534790767.21417, 1534791067.22417, 1534791367.23417,
1534791787.24817, 1534791847.25017, 1534792087.25818, 1534792147.26018
), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(1L,
2L, 4L, 7L, 10L, 12L, 16L, 17L, 19L, 20L), class = "data.frame")
return(top)
}
# create dataframe
my_df <- create_input_dataframe()
# show the first 20 rows of the test dataframe
head(my_df, 20)
#> character last_event x y
#> 1 Jake There was an earthquake 0.18488226 -0.8206868
#> 2 Sister Mary A terrible flood 0.70237404 0.1662410
#> 3 Sister Mary A terrible flood 0.57332633 0.1747081
#> 4 Elwood <NA> 0.16805192 1.0416555
#> 5 Jake Locusts 0.94383934 -0.1434224
#> 6 Elwood A terrible flood 0.94347496 1.2078019
#> 7 Jake There was an earthquake 0.12915898 0.1982122
#> 8 Jake There was an earthquake 0.83344882 -0.2645255
#> 9 Jake A terrible flood 0.46801852 -0.1216778
#> 10 Jake A terrible flood 0.54998374 1.9110023
#> 11 Elwood There was an earthquake 0.55267407 -0.4122948
#> 12 Jake A terrible flood 0.23889476 -1.4708207
#> 13 Elwood There was an earthquake 0.76051331 -1.1537385
#> 14 Jake There was an earthquake 0.18082010 2.2176152
#> 15 Elwood A terrible flood 0.40528218 -1.2535423
#> 16 Jake There was an earthquake 0.85354845 0.9300886
#> 17 Jake There was an earthquake 0.97639849 -0.1422770
#> 18 Elwood A terrible flood 0.22582546 1.6098654
#> 19 Sister Mary A terrible flood 0.44480923 -1.0891327
#> 20 Elwood A terrible flood 0.07497942 1.4692903
#> date_time
#> 1 2018-08-20 21:36:24
#> 2 2018-08-20 21:37:24
#> 3 2018-08-20 21:38:24
#> 4 2018-08-20 21:39:24
#> 5 2018-08-20 21:40:24
#> 6 2018-08-20 21:41:24
#> 7 2018-08-20 21:42:24
#> 8 2018-08-20 21:43:24
#> 9 2018-08-20 21:44:24
#> 10 2018-08-20 21:45:24
#> 11 2018-08-20 21:46:24
#> 12 2018-08-20 21:47:24
#> 13 2018-08-20 21:48:24
#> 14 2018-08-20 21:49:24
#> 15 2018-08-20 21:50:24
#> 16 2018-08-20 21:51:24
#> 17 2018-08-20 21:52:24
#> 18 2018-08-20 21:53:24
#> 19 2018-08-20 21:54:24
#> 20 2018-08-20 21:55:24
# show the first 10 rows of the desired output
(top <- create_first_10_rows_of_output())
#> rowname character last_event x y
#> 1 1 Jake There was an earthquake 0.1848823 -0.82068681
#> 2 2 Jake Locusts 0.9438393 -0.14342243
#> 4 4 Jake There was an earthquake 0.8334488 -0.26452555
#> 7 7 Jake A terrible flood 0.2388948 -1.47082075
#> 10 10 Jake There was an earthquake 0.9763985 -0.14227704
#> 12 12 Jake A terrible flood 0.3875495 0.05215129
#> 16 16 Jake Locusts 0.9626440 0.33646374
#> 17 17 Jake A terrible flood 0.1323720 -0.54647798
#> 19 19 Jake <NA> 0.8688610 1.82319608
#> 20 20 Jake Locusts 0.5142818 -0.21806287
#> date_time
#> 1 2018-08-20 20:35:07
#> 2 2018-08-20 20:39:07
#> 4 2018-08-20 20:42:07
#> 7 2018-08-20 20:46:07
#> 10 2018-08-20 20:51:07
#> 12 2018-08-20 20:56:07
#> 16 2018-08-20 21:03:07
#> 17 2018-08-20 21:04:07
#> 19 2018-08-20 21:08:07
#> 20 2018-08-20 21:09:07
Created on 2018-09-10 by the reprex package (v0.2.0).
You don't need to look at the functions create_input_dataframe
and create_first_10_rows_of_output
. You can just concentrate on the input dataframe my_df
(the first 20 rows are also printed) and on the output dataframe: since building the whole output dataframe would have been too long, I only show the first 10 rows (dataframe top
). As you can see, the output has the following characteristics:
- it's ordered by
character
: this is not strictly a requirement, I can alwaysarrange(character)
- for each
character
, two consecutive rows inlast_event
never have the same value: only the rows corresponding to a change inlast_event
have been retained, while all the others have been removed
This is the output I need (of course, not just the first 10 lines, but all of it). Thanks!
PS the reason why I didn't build a smaller example is that the solution must be fast - I need to work on larg-ish dataframes ( ~ 10^6 x 100 ), thus a slow solution won't cut it.