Value smearing (aka LOCF)

Hello everyone,

I wanted to share a quick solution for a problem I often ran into while programming in R. I spent quite a bit of time searching for an efficient way to handle this, but couldn't find a straightforward answer online.

I eventually came up with a simple, generic solution that that works well, and I thought it might save some of you a few headaches. Here are some details about the problem and my approach (simple R code is provided at the end of the post)

The problem:
Technically, the problem is the following: I have a column in a data frame (or even just a vector), with many missing values (NAa). I’d like to replace each NAs with the last preceding value in the vector. For example:


# transforming:
myV
  c(88, NA, NA, NA, 13, NA, 73, NA, NA, 111, NA)

# into
myV
  c(88, 88, 88, 88, 13, 13, 73, 73, 73, 111, 111)

(I denoted this operation as “value smearing”. I just found out that it has an official name, LOCF: last observation carry forward)

Classic use cases are data frames with events on a timeline, some of which may affect particular resources, and I would like to keep track of the status of these resources at each timestamp. For example, consider a data frame with financial operation on various accounts, each operation showing the resulting balance on that account, and I would like to track the balance on a specific account at each timestamp:

# initial cash for both accounts: $1000

  Acount Amount AccountBalance
1   Cash    200           1200
2   Visa   -100            900
3   Cash   -400            800
4   Visa    300           1200
5   Cash   -150            650
6   Cash   -100            550

and I would like to add a "Cash" column tracking the amount of cash I have at each timestamp. (I have plenty of other examples if this seems artificial.)

Using ifelse, it is trivial to create a column with only the balances for Cash and NAs elsewhere, but "remembering" the last observation for the Cash account is not that trivial without iterating row by row. I did not encounter any non iterative efficient solution for a while. Until:

Key observation: if the values in the vector with NAs would be monotonically increasing, then a simple cummax operation, when ignoring NAs (e.g. by replacing them with -Inf) would do the trick.
Obviously, we cannot count on the values to be "smeared" being monotonically increasing. However, the values of the index vector are! And, the resulting index vector after cummax will include exactly the indices in the original vector that I would like to have in the smeared one:

> myV
  c(88, NA, NA, NA, 13, NA, 73, NA, NA, 111, NA)
> data.frame(myV=myV,indx=ifelse(is.na(myV),0,1:length(myV)))
   myV indx
1   88    1
2   NA    0
3   NA    0
4   NA    0
5   13    5
6   NA    0
7   73    7
8   NA    0
9   NA    0
10 111   10
11  NA    0

# adding cummax on index:

> data.frame(myV=myV,indx=ifelse(is.na(myV),0,1:length(myV))) %>%
   mutate(indx = cummax(indx))
   myV indx
1   88    1
2   NA    0
3   NA    0
4   NA    0
5   13    5
6   NA    0
7   73    7
8   NA    0
9   NA    0
10 111   10
11  NA    0

# and finally using indx on myV:

> data.frame(myV=myV,indx=ifelse(is.na(myV),0,1:length(myV))) %>%
   mutate(indx = cummax(indx)) %>%
   mutate(myV = myV[indx]
mutate(MyV = MyV[indx])
   MyV indx
1   88    1
2   88    1
3   88    1
4   88    1
5   13    5
6   13    5
7   73    7
8   73    7
9   73    7
10 111   10
11 111   10

Very simple, 4 lines function (really 2 the other deals with filling the beginning of the vector):

# Using cummax on index vector for LOCF:
smear_vec <- function(vec) {
  # Create a mask of indices, keeping only where values are NOT NA
  idx <- seq_along(vec)
  idx[is.na(vec)] <- 0
  
  # Use cummax to "carry forward" the last valid index
  combined_idx <- cummax(idx)
  
  # Replace leading zeros with NA so we don't index at position 0
  combined_idx[combined_idx == 0] <- NA
  
  return(vec[combined_idx])
}

Final notes about the approach:

  • Memory Efficiency: It doesn't duplicate the actual data multiple times; it just manipulates an integer vector of indices.
  • Generic: It works for characters, numbers, and factors because R's [ indexing doesn't care about the underlying data type.

I hope people find it useful

Just FYI, please see data.table::nafill or tidyr::fill.