Creating a new variable that is an average of two out of three measurements

bkrishna · July 29, 2019, 10:37am

In a dataset of blood pressure readings, I'm trying to create a new average variable for blood pressure that takes two out of three based on the best readings. If a third reading is present, I take that with the first and average it. If not, its an average of the first and second. The variables are continous. This is the code i'm using, but I get a really long error message:

Code in R:

CARRS$sbp_avg = NA

CARRS[which(is.na(CARRS$sbp3)==F),]$sbp_avg = (CARRS[which(is.na(CARRS$sbp3)==F),]$sbp2+
  CARRS[which(is.na(CARRS$sbp3)==F),]$sbp3)/2

CARRS[which(is.na(CARRS$sbp3)==T),]$sbp_avg = (CARRS[which(is.na(CARRS$sbp3)==T),]$sbp2+
  CARRS[which(is.na(CARRS$sbp3)==T),]$sbp1)/2

Error message:

Coerced double RHS to logical to match the type of the target column (column 69 named 'sbp_avg'). If the target column's type logical is correct, it's best for efficiency to avoid the coercion and create the RHS as type logical. To achieve that consider R's type postfix: typeof(0L) vs typeof(0), and typeof(NA) vs typeof(NA_integer_) vs typeof(NA_real_). You can wrap the RHS with as.logical() to avoid this warning, but that will still perform the coercion. If the target column's type is not correct, it's best to revisit where the DT was created and fix the column type there; e.g., by using colClasses= in fread(). Otherwise, you can change the column type now by plonking a new column (of the desired type) over the top of it; e.g. DT[, sbp_avg :=as.double( sbp_avg )]. If the RHS of := has nrow(DT) elements then the assignment is called a column plonk and is the way to change a column's type. Column types can be observed with sapply(DT,typeof).Coerced double RHS to logical to match the type of the target column (column 69 named 'sbp_avg'). If the target column's type logical is correct, it's best for efficiency to avoid the coercion and create the RHS as type logical. To achieve that consider R's type postfix: typeof(0L) vs typeof(0), and typeof(NA) vs typeof(NA_integer_) vs typeof(NA_real_). You can wrap the RHS with as.logical() to avoid this warning, but that will still perform the coercion. If the target column's type is not correct, it's best to revisit where the DT was created and fix the column type there; e.g., by using colClasses= in fread(). Otherwise, you can change the column type now by plonking a new column (of the desired type) over the top of it; e.g. DT[, sbp_avg :=as.double( sbp_avg )]. If the RHS of := has nrow(DT) elements then the assignment is called a column plonk and is the way to change a column's type. Column types can be observed with sapply(DT,typeof).

pieterjanvc · July 29, 2019, 11:19am

Hi,

I'm not entirely sure if I understand your question, but this is what I came up with:

#Generate fake data
CARRS = data.frame(sbp1 = runif(10), sbp1 = runif(10), sbp3 = runif(10))
CARRS$sbp3[c(2, 7, 9)] = NA 

#Use row-apply (1 = row, 2 is column) to do calculations for each row of data
CARRS$sbp_avg = apply(CARRS, 1, function(x){
  
  #x will be a row in the data frame
  #Check if third value (sbp3) is NA
  if(is.na(x[3])){
    mean(x[1:2])
  } else {
    mean(x[c(1,3)])
  }
  
})

The output will look something like this (I use random numbers so they will be different):

round(CARRS, 2)
   sbp1 sbp1.1 sbp3 sbp_avg
1  0.09   0.79 0.60    0.35
2  0.62   0.41   NA    0.51
3  0.67   0.77 0.54    0.61
4  0.12   0.71 0.71    0.42
5  0.96   0.70 0.75    0.85
6  0.97   0.77 0.21    0.59
7  0.13   0.98   NA    0.55
8  0.29   0.80 0.02    0.15
9  0.27   0.59   NA    0.43
10 0.55   0.08 0.68    0.62

Hope this helps,
PJ

bkrishna · July 29, 2019, 3:27pm

Thank you so much for taking the time to respond, pieterjanvc! I modified the code a bit to this below and while the syntax makes sense to me, it returns the following error:

Error in `[<-.data.table`(x, j = name, value = value) : Supplied 70 items to be assigned to 15432 items of column 'sbp_avg'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

This is the code I used:

CARRS$sbp_avg = apply(CARRS, 2, function(x){
  
  #x will be a row in the data frame
  #Check if third value (sbp3) is NA
  if(is.na(x[CARRS$sbp3])){
    mean(x[CARRS$sbp1:CARRS$sbp2])
  } else {
    mean(x[c(CARRS$sbp1,CARRS$sbp3)])
  }
  
})

Here's a screenshot of what the data looks like. My guess is the multiple NAs is causing some issue. Is there any way you can guide me in addressing that?

Yarnabrina · July 29, 2019, 3:35pm

Please do not post screenshots. People on this community are trying to help you, so you can try to help them to help you easily by sharing your dataset in a copy paste friendly format.

If you don't know how to do so, please share dput(head(CARRS)). It's better to produce a minimal reproducible example for your problem, which is nicely described in the following post:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Also, just for curiosity, how do you manage to get such detailed error messages in R?

Risfun · July 29, 2019, 5:35pm

Hi bkrishna,

How would you like to manipulate rows with all NAs?

You could use below code to remove the rows with all NAs first
CARRS=CARRS[rowSums(is.na(CARRS))<ncol(CARRS),,]

You may also want to change your apply from column apply to row apply and then run the code you posted.

 CARRS$sbp_avg = apply(CARRS,1, function(x){

#x will be a row in the data frame
#Check if third value (sbp3) is NA
 if(is.na(x[CARRS$sbp3])){
 mean(x[CARRS$sbp1:CARRS$sbp2])
 } else {
 mean(x[c(CARRS$sbp1,CARRS$sbp3)])
 }

})

bkrishna · July 29, 2019, 5:57pm

Thank you for the feedback, yarnabrina! I've reproduced a minimal reproducible dataset as advised below. I've also included the additional participant ID (pid) column for reference.

datapasta::df_paste(head(CARRS, 10)[, c('pid', 'sbp1', 'sbp2', 'sbp3')])


data.frame(
            pid = c(20001, 20001, 20001, 20002, 20002, 20002, 20003, 20003, 20003,
                    20004),
           sbp1 = c(91, NA, NA, 119, 123, NA, 103, 105, 116, 105),
           sbp2 = c(95, NA, NA, 111, 125, NA, 103, 107, 118, 104),
           sbp3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 118, NA)
   )

As for the detailed error messages, I have no clue! I just copy-pasted the error message from RStudio.

bkrishna · July 29, 2019, 6:15pm

Hi Risfun,

I plan to omit all NAs from the new sbp_avg variable during analysis. The data is also in a long format since it is for longitudinal analysis - so would changing from a column apply to a row apply affect that?

Risfun · July 29, 2019, 7:06pm

It won't affect that. Add an omit function at very last, now the code should work.


datapasta::df_paste(head(CARRS, 10)[, c('pid', 'sbp1', 'sbp2', 'sbp3')])


CARRS=data.frame(
  pid = c(20001, 20001, 20001, 20002, 20002, 20002, 20003, 20003, 20003,
          20004),
  sbp1 = c(91, NA, NA, 119, 123, NA, 103, 105, 116, 105),
  sbp2 = c(95, NA, NA, 111, 125, NA, 103, 107, 118, 104),
  sbp3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 118, NA)
)


CARRS$sbp_avg = apply(CARRS,1, function(x){
  #x will be a row in the data frame
  #Check if third value (sbp3) is NA
  if(is.na(x[4])){
    mean(x[2:3])
  } else {
    mean(x[c(3,4)])
  }
})

# omit rows having NAs in sbp_avg
CARRS=CARRS[-which(is.na(CARRS[4])),]

pieterjanvc · July 29, 2019, 9:41pm

Other way of writing what @Risfun suggested:

CARRS = CARRS[!is.na(CARRS$sbp_avg),]

andresrcs · July 29, 2019, 10:01pm

This would be a tidyverse based solution that I personally find more human readable.

library(tidyverse)

CARRS <- data.frame(
    pid = c(20001, 20001, 20001, 20002, 20002, 20002, 20003, 20003, 20003,
            20004),
    sbp1 = c(91, NA, NA, 119, 123, NA, 103, 105, 116, 105),
    sbp2 = c(95, NA, NA, 111, 125, NA, 103, 107, 118, 104),
    sbp3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 118, NA)
)


CARRS %>% 
    rowwise() %>% 
    mutate(sbp_avg = if_else(is.na(sbp3), mean(c(sbp1, sbp2)), mean(c(sbp1, sbp3)))) %>% 
    drop_na(sbp_avg) %>% 
    ungroup()
#> # A tibble: 7 x 5
#>     pid  sbp1  sbp2  sbp3 sbp_avg
#>   <dbl> <dbl> <dbl> <dbl>   <dbl>
#> 1 20001    91    95    NA     93 
#> 2 20002   119   111    NA    115 
#> 3 20002   123   125    NA    124 
#> 4 20003   103   103    NA    103 
#> 5 20003   105   107    NA    106 
#> 6 20003   116   118   118    117 
#> 7 20004   105   104    NA    104.

^{Created on 2019-07-29 by the reprex package (v0.3.0.9000)}

bkrishna · July 30, 2019, 6:02am

Hi @andresrcs,

This worked! Only question though is that the output shows up as a logical object and not as numeric. Any idea why this may be the case?

andresrcs · July 30, 2019, 12:22pm

At least with your sample data the result is numeric (double), not logical, so to help you with this you would have to provide a minimal REPRoducible EXample (reprex). A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

system · August 20, 2019, 12:22pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.