Hello everyone,
I wanted to share a quick solution for a problem I often ran into while programming in R. I spent quite a bit of time searching for an efficient way to handle this, but couldn't find a straightforward answer online.
I eventually came up with a simple, generic solution that that works well, and I thought it might save some of you a few headaches. Here are some details about the problem and my approach (simple R code is provided at the end of the post)
The problem:
Technically, the problem is the following: I have a column in a data frame (or even just a vector), with many missing values (NAa). I’d like to replace each NAs with the last preceding value in the vector. For example:
# transforming:
myV
c(88, NA, NA, NA, 13, NA, 73, NA, NA, 111, NA)
# into
myV
c(88, 88, 88, 88, 13, 13, 73, 73, 73, 111, 111)
(I denoted this operation as “value smearing”. I just found out that it has an official name, LOCF: last observation carry forward)
Classic use cases are data frames with events on a timeline, some of which may affect particular resources, and I would like to keep track of the status of these resources at each timestamp. For example, consider a data frame with financial operation on various accounts, each operation showing the resulting balance on that account, and I would like to track the balance on a specific account at each timestamp:
# initial cash for both accounts: $1000
Acount Amount AccountBalance
1 Cash 200 1200
2 Visa -100 900
3 Cash -400 800
4 Visa 300 1200
5 Cash -150 650
6 Cash -100 550
and I would like to add a "Cash" column tracking the amount of cash I have at each timestamp. (I have plenty of other examples if this seems artificial.)
Using ifelse, it is trivial to create a column with only the balances for Cash and NAs elsewhere, but "remembering" the last observation for the Cash account is not that trivial without iterating row by row. I did not encounter any non iterative efficient solution for a while. Until:
Key observation: if the values in the vector with NAs would be monotonically increasing, then a simple cummax operation, when ignoring NAs (e.g. by replacing them with -Inf) would do the trick.
Obviously, we cannot count on the values to be "smeared" being monotonically increasing. However, the values of the index vector are! And, the resulting index vector after cummax will include exactly the indices in the original vector that I would like to have in the smeared one:
> myV
c(88, NA, NA, NA, 13, NA, 73, NA, NA, 111, NA)
> data.frame(myV=myV,indx=ifelse(is.na(myV),0,1:length(myV)))
myV indx
1 88 1
2 NA 0
3 NA 0
4 NA 0
5 13 5
6 NA 0
7 73 7
8 NA 0
9 NA 0
10 111 10
11 NA 0
# adding cummax on index:
> data.frame(myV=myV,indx=ifelse(is.na(myV),0,1:length(myV))) %>%
mutate(indx = cummax(indx))
myV indx
1 88 1
2 NA 0
3 NA 0
4 NA 0
5 13 5
6 NA 0
7 73 7
8 NA 0
9 NA 0
10 111 10
11 NA 0
# and finally using indx on myV:
> data.frame(myV=myV,indx=ifelse(is.na(myV),0,1:length(myV))) %>%
mutate(indx = cummax(indx)) %>%
mutate(myV = myV[indx]
mutate(MyV = MyV[indx])
MyV indx
1 88 1
2 88 1
3 88 1
4 88 1
5 13 5
6 13 5
7 73 7
8 73 7
9 73 7
10 111 10
11 111 10
Very simple, 4 lines function (really 2 the other deals with filling the beginning of the vector):
# Using cummax on index vector for LOCF:
smear_vec <- function(vec) {
# Create a mask of indices, keeping only where values are NOT NA
idx <- seq_along(vec)
idx[is.na(vec)] <- 0
# Use cummax to "carry forward" the last valid index
combined_idx <- cummax(idx)
# Replace leading zeros with NA so we don't index at position 0
combined_idx[combined_idx == 0] <- NA
return(vec[combined_idx])
}
Final notes about the approach:
- Memory Efficiency: It doesn't duplicate the actual data multiple times; it just manipulates an integer vector of indices.
- Generic: It works for characters, numbers, and factors because R's
[indexing doesn't care about the underlying data type.
I hope people find it useful