Conditional summation inside a for loop

djmangen · March 5, 2024, 6:49pm

Hello. You're dealing with a comparative newbie here.

I am trying to write a for loop to calculate some weighted sums across a number of variables. Here is some sample data.

library(tidyverse)
# Define weights

wt1 <- c(0.806, 0.586, 0.785)

# Create test data

col1 <- c(0, 4, 3, NA, 1)
col2 <- c(1, 3, 4, 2, NA)
col3 <- c(NA, 0, 3, 2, 2)
df <- as.data.frame(cbind(col1, col2, col3))

# Initialize summary values.

df$Num1 <- 0 
df$Denom <- 0

# Create loop
for(i in nrows(df)) {
    for(j in ncol(df) {
        if(!is.na(df[i,j]) 
            df$Num1[i] = df$Num1[i] + (df[i,j] * wt1[j])
            df$Denom[i] = df$Denom[i] + (4 * wt1[j])
    }
}

I'm trying to loop across both rows and columns and create a two summary values, both of which are conditional on valid data for the column. I'm expecting that Num1 in Row 1 should equal 0.8060 + 0.586 * 1 = 0.586, and that the Denom equals 4.806 + 4*.586 = 5.568.

EDIT: I found errors in the code regarding referencing the new variables with [i,j] when it should have been simply [i]. Updated now.

The error I receive is:

Error: unexpected '{' in:
"for(i in nrows(df)) {
for(j in ncol(df) {"

Any help would be greatly appreciated, including helping me learn a way to do this without using a loop. I have seen and used a matrix-based solution that requires complete data.

Thank you in advance.

prubin · March 5, 2024, 7:57pm

You have all sorts of syntax errors here.

The correct function to get the number of rows is nrow, not nrows.
To index over all rows, you want i in 1:nrow(df), not i in nrow(df) (which will cause the outer loop to be executed just once, with i set to the number of rows in df).
You have mismatched parentheses. An old school debugging trick is to start counting from 0, adding 1 for each opening parenthesis and subtracting 1 for each closing parenthesis. When you reach the end of an expression, the count should be 0.
After supplying a missing closing parenthesis for the if statement, you need to surround the next two lines with braces. Otherwise, the df$Num1 line is inside the if statement but the next line is not.

Then there is at least one logic error. The problem with index j running from 1 to ncol(df) is that ncol(df) now includes the two added columns (Num1 and Denom), which I'm pretty sure you do not want to include in the loop.

djmangen · March 5, 2024, 8:18pm

Thank you for your help. You have helped me solve this problem, so your answer can be checked as the best solution. I certainly appreciate the help.

If there are alternative solutions I'm open to hearing about them, if only because that loop is frightfully slow.

Thanks again.

prubin · March 5, 2024, 8:25pm

Your calculation seems to treat NAs as zeros. Is that intentional? (The answer will influence finding a faster way.)

djmangen · March 5, 2024, 8:43pm

In the calculation of the numerator variable, yes 0 is the equivalent. However, when calculating the denominator we have a somewhat different situation where if the numerator variable is NA then we need to add 0 to the denominator.

prubin · March 5, 2024, 8:59pm

How is adding 0 in the denominator (i.e., no change) different from the numerator?

djmangen · March 6, 2024, 5:36pm

You know, I suspect you're right.

I was going down the path of thinking that since the element in wt1 would never be missing that it would likely be multiplied and summed regardless of what was going on in the numerator calculation.

FWIW a step after this is to create an index where the numerator is divided by the denominator to get a weighted proportion, and I want to adjust the denominator downward to account for missing data where I am uncomfortable inferring a value. I'll add that "too much" missing data is also factored into the mix to push the scale score to missing as well if that threshold is exceeded.

After many years of coding in SAS I am trying to really wrap my head around the vector-based model of R, with varying degrees of success. Thank you very much for your input.

prubin · March 6, 2024, 8:38pm

R can be a bit quirky, to put it mildly. The following code does (I think) what you requested in the original post. I won't swear it's the most efficient route, but it works. It uses two auxiliary functions (f1 and f2) that calculate your Num1 and Denom on a row by row basis. The mutate operator creates new columns based on computational formulas and rowwise gets R to apply the formulas on a row by row basis.

f1 <- function(x) sum(x * wt1, na.rm = TRUE)
f2 <- function(x) (wt1[!is.na(x)] * 4) |> sum()

df <- df |> rowwise() |> mutate(Num1 = f1(c(col1, col2, col3)), Denom = f2(c(col1, col2, col3)))

djmangen · March 6, 2024, 8:51pm

WOW! Thank you; this is extremely generous of you. I greatly appreciate it. It certainly looks like it will be vastly more efficient that the spaghetti code loops.

With your presumed , I will download this code and test it. I probably won't be able to do that immediately, but I will do so, test it and compare to the looping model, and let you know.

THANK YOU!

system · March 13, 2024, 8:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.