Function from several for loops

jrdubbleu · April 23, 2023, 6:17pm

I have these several consecutive loops in some data-wrangling code. Obviously, this is not the most efficient method for completing these tasks. I want to convert them into a function I can call with each of the variable names. In addition, to help with this, I'm curious if someone can walk me through their thought process on converting the loop into a function. What do you look at first, and then how do you get to the end product? I don't have a technical know-how issue with the function, so much as I'm not quite sure where to start with dissecting the code logically. Sorry for no reprex.

# Loop through post_mice_sas to check for previous 
# value of > max_percent_missing and if it is greater than that value to
# reset the variable for that specific case to "NA" to prepare for imputing
# total scores only

# Loop through each row of the post_mice_sas data frame
for (i in 1:nrow(post_mice_sas)) {
  # Check if the value in the lazy_sas column of the current row is greater 
  # than max_percent_missing
  if (post_mice_sas$lazy_sas[i] > max_percent_missing) {
    # If the condition is true, loop through each column of the current row
    for (j in 2:ncol(post_mice_sas)) { 
      # Set the value of the current cell to NA
      post_mice_sas[i, j] <- NA 
    }
  }
}

# Repeat same test for each measure's variables
# PMS
for (i in 1:nrow(post_mice_pms)) {
  if (post_mice_pms$lazy_pms[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_pms)) { 
      post_mice_pms[i, j] <- NA 
    }
  }
}

# GAD
for (i in 1:nrow(post_mice_gad)) {
  if (post_mice_gad$lazy_gad[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_gad)) { 
      post_mice_gad[i, j] <- NA 
    }
  }
}

# PHQ
for (i in 1:nrow(post_mice_phq)) {
  if (post_mice_phq$lazy_phq[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_phq)) { 
      post_mice_phq[i, j] <- NA 
    }
  }
}

# DTS
for (i in 1:nrow(post_mice_dts)) {
  if (post_mice_dts$lazy_dts[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_dts)) { 
      post_mice_dts[i, j] <- NA 
    }
  }
}

# RTS
for (i in 1:nrow(post_mice_rts)) {
  if (post_mice_rts$lazy_rts[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_rts)) { 
      post_mice_rts[i, j] <- NA 
    }
  }
}

# UPPS
for (i in 1:nrow(post_mice_upps)) {
  if (post_mice_upps$lazy_upps[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_upps)) { 
      post_mice_upps[i, j] <- NA 
    }
  }
}

# BSMAS
for (i in 1:nrow(post_mice_bsmas)) {
  if (post_mice_bsmas$lazy_bsmas[i] > max_percent_missing) {
    for (j in 2:ncol(post_mice_bsmas)) { 
      post_mice_bsmas[i, j] <- NA 
    }
  }
}

technocrat · April 23, 2023, 7:34pm

Here's my mechanical approach; I'm come back with the thought process on the overall pattern.

In the snippet, f(x,y) is a function that takes the name of the data frame and the value for max_percent_missing as its arguments. It is a wrapper to call the eight loops sequentially, and those loops will be encapsulated in separate functions f1 .. f8

f <- function(x,y){
  f1(x,y)
  f2(x,y)
  f3(x,y)
  f4(x,y)
  f5(x,y)
  f6(x,y)
  f7(x,y)
  f8(x,y)
  }
  
f1 <- function(x,y) {
  # Loop through each row of the x data frame
  for (i in seq_along(x)) {
    # Check if the value in the lazy_sas column of the current row is greater 
    # than max_percent_missing
    if (x$lazy_sas[i] > y) {
      # If the condition is true, loop through each column of the current row
      for (j in 2:ncol(x)) { 
        # Set the value of the current cell to NA
        x[i, j] <- NA 
      }
    }
  }
  
# f2 .. similar

Lacking representative data this is untested, so I assume each for loop, and therefore each function expressing it, works satisfactorily. Creating the prototype f1 function used the RStudio menu bar Code | Extract Function feature after removing its guesses on arguments to eliminate the two index internal arguments.

technocrat · April 23, 2023, 7:40pm

Here's part one of the philosophical approach. You've probably heard me on this anyway. Part two will be how I would refactor the code based on that approach.

The Tao of Analysis

Any R task benefits from an over-arching mental model—f(x) = y, just as in school algebra. x is what is to hand, y is what is required and f is what is available to transform x into y. Each of these objects may be, and usually is, composite. x may contain columns and rows of numeric and character values, y may be a table of summary statistics and f might be in the form of f(g(x)). In R everything is an object, including functions, and because functions can be arguments to other functions, it is said that *in R, functions are "first class" objects.

Notice that this model is missing how. In programming, this type of model is called a functional style. R as it presents to the user is a functional programming language. The principal style of programming is called procedural/imperative—do this, do that, then do this other thing the first way and that other thing the second way \dots. A functional orientation helps to keep the eyes on the ball and the goal. When it is important to do something very specific, a procedural orientation helps to keep the eyes on the patch of grass beneath it.

This type of model also is called analysis. Analysis is hard because it is unnatural. We do not go through our daily lives minutely examining every situation and breaking it down into its smallest pieces. Rather, we are constantly scanning and integrating sources of information in our environment simultaneously. Legal education in the United States is heavily focused on analysis in its first year. Some students have had university experience in chemistry, linguistics or philosophy and have less difficulty. Most, however, lack previous exposure and will go to great lengths to avoid learning analysis. That tendency is countered by assignment of more court cases to review than can be comfortably read with the threat of being called upon in lecture to present a case with no notice. Despite the severe pressure of time and materials, students will spend hours extra reading student guidebooks and meeting with other students to try collectively to find the right answers. It only slowly becomes apparent that while answer to any particular type of case turns on the specific facts, the questions remain the same.

technocrat · April 23, 2023, 8:15pm

Here is my refactoring of the OP code based on my Tao.

x is a tabular object, preferably a numeric matrix
y is ultimately the same object modified to impute values when certain conditions are satisfied
y for the immediate purpose is to flag the row:column entries that satisfy the conditions
f is the function or functions to accomplish the immediate purpose

certain conditions are met based on a logical inequality test between an element of x and a numeric constant, k, therefore the return value of f in the first instance is typeof logical. Accordingly f requires two arguments, so that f = f(x,k) and the return value y is typeof TRUE or FALSE.

Given the return value, f can be composed of a single logical test applied to x in which the TRUE return index positions are assigned to typeof logical NA.

Demonstration

x <- structure(
  c(
    6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8,
    4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4, 4, 4, 4, 3, 3, 3,
    3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 5, 4, 4, 3, 3, 4, 3, 4, 4,
    4, 4, 4, 4, 4, 3, 3, 2, 3, 3, 3, 4, 3, 3, 3, 3, 4, 4, 4, 5, 5,
    5, 2, 2, 2, 2, 4, 3, 4, 4, 2, 2, 2, 3, 3, 4, 3, 4, 4, 4, 3, 3,
    3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4,
    5, 5, 5, 5, 5, 4, 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4,
    4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2
  ),
  dim = c(32L, 5L), dimnames = list(NULL, NULL)
)

x
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    6    4    3    4    4
#>  [2,]    6    4    3    4    4
#>  [3,]    4    4    2    4    1
#>  [4,]    6    3    3    3    1
#>  [5,]    8    3    3    3    2
#>  [6,]    6    3    3    3    1
#>  [7,]    8    3    4    3    4
#>  [8,]    4    4    3    4    2
#>  [9,]    4    4    3    4    2
#> [10,]    6    4    3    4    4
#> [11,]    6    4    3    4    4
#> [12,]    8    3    4    3    3
#> [13,]    8    3    4    3    3
#> [14,]    8    3    4    3    3
#> [15,]    8    3    5    3    4
#> [16,]    8    3    5    3    4
#> [17,]    8    3    5    3    4
#> [18,]    4    4    2    4    1
#> [19,]    4    5    2    4    2
#> [20,]    4    4    2    4    1
#> [21,]    4    4    2    3    1
#> [22,]    8    3    4    3    2
#> [23,]    8    3    3    3    2
#> [24,]    8    4    4    3    4
#> [25,]    8    3    4    3    2
#> [26,]    4    4    2    4    1
#> [27,]    4    4    2    5    2
#> [28,]    4    4    2    5    2
#> [29,]    8    4    3    5    4
#> [30,]    6    4    3    5    6
#> [31,]    8    4    4    5    8
#> [32,]    4    4    3    4    2

k <- 5

x[x > k] <- NA
x
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]   NA    4    3    4    4
#>  [2,]   NA    4    3    4    4
#>  [3,]    4    4    2    4    1
#>  [4,]   NA    3    3    3    1
#>  [5,]   NA    3    3    3    2
#>  [6,]   NA    3    3    3    1
#>  [7,]   NA    3    4    3    4
#>  [8,]    4    4    3    4    2
#>  [9,]    4    4    3    4    2
#> [10,]   NA    4    3    4    4
#> [11,]   NA    4    3    4    4
#> [12,]   NA    3    4    3    3
#> [13,]   NA    3    4    3    3
#> [14,]   NA    3    4    3    3
#> [15,]   NA    3    5    3    4
#> [16,]   NA    3    5    3    4
#> [17,]   NA    3    5    3    4
#> [18,]    4    4    2    4    1
#> [19,]    4    5    2    4    2
#> [20,]    4    4    2    4    1
#> [21,]    4    4    2    3    1
#> [22,]   NA    3    4    3    2
#> [23,]   NA    3    3    3    2
#> [24,]   NA    4    4    3    4
#> [25,]   NA    3    4    3    2
#> [26,]    4    4    2    4    1
#> [27,]    4    4    2    5    2
#> [28,]    4    4    2    5    2
#> [29,]   NA    4    3    5    4
#> [30,]   NA    4    3    5   NA
#> [31,]   NA    4    4    5   NA
#> [32,]    4    4    3    4    2

inpute <- 6

x[is.na(x)] <- inpute

x
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    6    4    3    4    4
#>  [2,]    6    4    3    4    4
#>  [3,]    4    4    2    4    1
#>  [4,]    6    3    3    3    1
#>  [5,]    6    3    3    3    2
#>  [6,]    6    3    3    3    1
#>  [7,]    6    3    4    3    4
#>  [8,]    4    4    3    4    2
#>  [9,]    4    4    3    4    2
#> [10,]    6    4    3    4    4
#> [11,]    6    4    3    4    4
#> [12,]    6    3    4    3    3
#> [13,]    6    3    4    3    3
#> [14,]    6    3    4    3    3
#> [15,]    6    3    5    3    4
#> [16,]    6    3    5    3    4
#> [17,]    6    3    5    3    4
#> [18,]    4    4    2    4    1
#> [19,]    4    5    2    4    2
#> [20,]    4    4    2    4    1
#> [21,]    4    4    2    3    1
#> [22,]    6    3    4    3    2
#> [23,]    6    3    3    3    2
#> [24,]    6    4    4    3    4
#> [25,]    6    3    4    3    2
#> [26,]    4    4    2    4    1
#> [27,]    4    4    2    5    2
#> [28,]    4    4    2    5    2
#> [29,]    6    4    3    5    4
#> [30,]    6    4    3    5    6
#> [31,]    6    4    4    5    6
#> [32,]    4    4    3    4    2

^{Created on 2023-04-23 with reprex v2.0.2}

Once tested, this could be converted to f(x,k,im) where

x is the original tabular object
k is the constant to identify NA
im is the constant or return value of a function to be developed

Intended lesson: Once the what questions are answered, the functions to accomplish each step are obvious. Ask what do I have, what do I want and what off-the-shelf function might be available to get me there in one or more steps? Easier than how do I improve this for loop?

system · April 30, 2023, 8:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.