Saving z-scores to original data

Jason.C · April 30, 2019, 7:44pm

Greetings,
I am trying to use scale() to create z-scores for several variables at once.
Although I have figured out how to create the z-scores, I don't know how to save them to my data without overwriting the original variables. How can I add a prefix "z" to the new variables and save them to the original dataset?

Jason

#rstatsnewbie

# title: "reprex_z-scores"
# author: "Jason"
# date: "2019_04_30"

library(tidyverse)
library(reprex)
library(purrr)
library(here)

## IMPORT DATA FOR THE PILOT ANALYSES
dat <- read.csv(url('https://raw.githubusercontent.com/BrainStormCenter/ASQ_pilot/master/ASQ_pain_pilot_2019_04_19_v2.csv'), header = TRUE)
dat <- as_tibble(dat)
dat <- filter(dat, Groups != "other")

#       CREATE A SMALLER DATA SET
dat.asq <- select(dat, ID, sex, asq_1a, asq_2a, asq_3a, asq_4a)

#		CLACULATE Z-SCORES
scale(dat.asq[,c(grep("a$", colnames(dat.asq)))], center = TRUE, scale = TRUE)
#>             asq_1a       asq_2a     asq_3a      asq_4a
#>   [1,]          NA           NA         NA          NA
#>   [2,]          NA           NA         NA          NA
#>   [3,] -1.02572353 -0.530116086 -0.4462601  0.11837965
#>   [4,] -0.06098897  0.256197347  0.1792068  0.11837965
...
#> [136,]  0.90374560  0.256197347  0.1792068  0.11837965
#> [137,]  0.90374560           NA  0.1792068  1.28737874
#> attr(,"scaled:center")
#>   asq_1a   asq_2a   asq_3a   asq_4a 
#> 2.063218 2.674179 2.713483 1.898734 
#> attr(,"scaled:scale")
#>    asq_1a    asq_2a    asq_3a    asq_4a 
#> 1.0365546 1.2717575 1.5988057 0.8554327

^{Created on 2019-04-30 by the reprex package (v0.2.1)}

FJCC · April 30, 2019, 9:27pm

I do not know of an elegant way to do this. Here is a multi-step method.

library(dplyr)

df <- data.frame(Dat1 = rnorm(8, 3, 4), Dat2 = rnorm(8, 6,2))
df
#>         Dat1     Dat2
#> 1 -1.1584662 5.347810
#> 2  5.6154816 6.066798
#> 3  0.9517124 2.258065
#> 4  7.8930756 5.177651
#> 5  9.6121578 8.828259
#> 6  6.2444510 3.703474
#> 7  1.0889977 7.890299
#> 8 -1.0091224 6.407307
dftmp <- as.data.frame(scale(df))
names(dftmp) <- paste0("z", names(dftmp))
df <-  bind_cols(df, dftmp)
summary(df)
#>       Dat1              Dat2           zDat1              zDat2          
#>  Min.   :-1.1585   Min.   :2.258   Min.   :-1.14912   Min.   :-1.628599  
#>  1st Qu.: 0.4615   1st Qu.:4.809   1st Qu.:-0.76237   1st Qu.:-0.425020  
#>  Median : 3.3522   Median :5.707   Median :-0.07223   Median :-0.001252  
#>  Mean   : 3.6548   Mean   :5.710   Mean   : 0.00000   Mean   : 0.000000  
#>  3rd Qu.: 6.6566   3rd Qu.:6.778   3rd Qu.: 0.71666   3rd Qu.: 0.503927  
#>  Max.   : 9.6122   Max.   :8.828   Max.   : 1.42227   Max.   : 1.471211

^{Created on 2019-04-30 by the reprex package (v0.2.1)}

andresrcs · April 30, 2019, 10:33pm

You can use dplyr::mutate_at() and add a suffix

library(tidyverse)
dat <- read.csv(url('https://raw.githubusercontent.com/BrainStormCenter/ASQ_pilot/master/ASQ_pain_pilot_2019_04_19_v2.csv'), header = TRUE)
dat %>%
    as_tibble() %>% 
    filter(Groups != "other") %>% 
    select(ID, sex, asq_1a, asq_2a, asq_3a, asq_4a) %>% 
    mutate_at(vars(starts_with("asq")), list(z = ~as.vector(scale(.))))
#> # A tibble: 137 x 10
#>    ID      sex asq_1a asq_2a asq_3a asq_4a asq_1a_z asq_2a_z asq_3a_z
#>    <fct> <int>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
#>  1 IP003     1     NA     NA     NA     NA  NA        NA       NA    
#>  2 IP005     1     NA     NA     NA     NA  NA        NA       NA    
#>  3 IP008     2      1      2      2      2  -1.03     -0.530   -0.446
#>  4 IP009     2      2      3      3      2  -0.0610    0.256    0.179
#>  5 IP010     1     NA     NA     NA     NA  NA        NA       NA    
#>  6 IP013     1     NA     NA     NA     NA  NA        NA       NA    
#>  7 IP015     1      1      2      2      1  -1.03     -0.530   -0.446
#>  8 IP016     1      2      4      3      2  -0.0610    1.04     0.179
#>  9 IP017     1      3      3      3      1   0.904     0.256    0.179
#> 10 IP019     2      2      4      3     NA  -0.0610    1.04     0.179
#> # … with 127 more rows, and 1 more variable: asq_4a_z <dbl>

Jason.C · April 30, 2019, 11:10pm

Thank you. This seem to be exactly what I was hoping to find. Now I just need to better understand what 'list is doing.

Question Can I use group_by to calculate the z-scores by sex and have the results combined into a single variable/vector?

Answer Yes. Simply adding - group_by(sex) - calculated the z-scores for each sex independently and put the scores in the same variable/vector.

dat <- read.csv(url('https://raw.githubusercontent.com/BrainStormCenter/ASQ_pilot/master/ASQ_pain_pilot_2019_04_19_v2.csv'), header = TRUE)
dat %>%
    as_tibble() %>% 
    filter(Groups != "other") %>% 
    select(ID, sex, asq_1a, asq_2a, asq_3a, asq_4a) %>% 
	group_by(sex) %>%
    mutate_at(vars(starts_with("asq")), list(z = ~as.vector(scale(.))))

#rstatsnewbie

Jason.C · May 3, 2019, 3:56pm

I just learned that I was scoring the questionnaire incorrectly. Instead of calculating the z-scores by vector, they have to be calculated by individual (i.e., row-wise). I’ve tried this and was only successful in creating the new variables. However, all the values are NaN. Any help or guidance on what I’ve done wrong is greatly appreciated.
Cheers,
Jason

# title: "reprex_Zscores"
# author: "Jason"
# date: "2019_04_30"
#modified: "03/05/2019"

library(tidyverse)
library(reprex)
# library(purrr)
# library(here)

## Community answer
dat <- read.csv(url('https://raw.githubusercontent.com/BrainStormCenter/ASQ_pilot/master/ASQ_pain_pilot_2019_04_19_v2.csv'), header = TRUE)
#       This creates z-scores for a variable/vector and works
# z.dat1 <- dat %>%
#   as_tibble() %>%
#   filter(Groups != "other") %>%
#   select(ID, sex, asq_1a, asq_2a, asq_3a, asq_4a, asq_5a, asq_6a,
#          asq_7a, asq_8a, asq_9a) %>%
#   mutate_at(vars(starts_with("asq")), list(z1 = ~as.vector(scale(.))))

#       Trying to calculate z-scores rowwise
#       Colums are made but valvues are NaN
z.dat2 <- dat %>%
    as_tibble() %>%
    filter(Groups != "other") %>%
    select(ID, sex, asq_1a, asq_2a, asq_3a, asq_4a, asq_5a, asq_6a,
           asq_7a, asq_8a, asq_9a) %>%
    rowwise() %>%
    mutate_at(vars(starts_with("asq")), list(zz = ~as.vector(scale(.))))

^{Created on 2019-05-03 by the reprex package (v0.2.1)}

andresrcs · May 3, 2019, 5:03pm

Your code is triying to scale a single number for each column and that doesn't makes sense, that is why you get NaNs, one way to do this would be by reshaping your data
(Note: I'm using the new pivot_wider() function from the development version of dplyr)

library(tidyverse)

dat <- read.csv(url('https://raw.githubusercontent.com/BrainStormCenter/ASQ_pilot/master/ASQ_pain_pilot_2019_04_19_v2.csv'), header = TRUE)
z.dat2 <- dat %>%
    as_tibble() %>%
    filter(Groups != "other") %>%
    select(ID, sex, asq_1a, asq_2a, asq_3a, asq_4a, asq_5a, asq_6a,
           asq_7a, asq_8a, asq_9a) %>%
    gather(asq, value, -ID, -sex) %>% 
    group_by(ID) %>% 
    mutate(z = as.vector(scale(value))) %>% 
    pivot_wider(names_from = asq, values_from = c(value, z))

z.dat2
#> # A tibble: 137 x 20
#>    ID      sex value_asq_1a value_asq_2a value_asq_3a value_asq_4a
#>    <fct> <int>        <dbl>        <dbl>        <dbl>        <dbl>
#>  1 IP003     1           NA           NA           NA           NA
#>  2 IP005     1           NA           NA           NA           NA
#>  3 IP008     2            1            2            2            2
#>  4 IP009     2            2            3            3            2
#>  5 IP010     1           NA           NA           NA           NA
#>  6 IP013     1           NA           NA           NA           NA
#>  7 IP015     1            1            2            2            1
#>  8 IP016     1            2            4            3            2
#>  9 IP017     1            3            3            3            1
#> 10 IP019     2            2            4            3           NA
#> # … with 127 more rows, and 14 more variables: value_asq_5a <dbl>,
#> #   value_asq_6a <dbl>, value_asq_7a <dbl>, value_asq_8a <dbl>,
#> #   value_asq_9a <dbl>, z_asq_1a <dbl>, z_asq_2a <dbl>, z_asq_3a <dbl>,
#> #   z_asq_4a <dbl>, z_asq_5a <dbl>, z_asq_6a <dbl>, z_asq_7a <dbl>,
#> #   z_asq_8a <dbl>, z_asq_9a <dbl>

Jason.C · May 3, 2019, 5:27pm

Thank you. Because I use rowwise() to average these variables/subject, I, incorrectly, thought I could use scale() the same way. Thanks for the note about the pivot_wider() function. How do I get the development version of dplyr so I can try this solution?

andresrcs · May 3, 2019, 5:39pm

Install it from GitHub

devtools::install_github("tidyverse/tidyr")

Jason.C · May 3, 2019, 6:25pm

Thank you. You have been very helpful. I have managed to actually include all the asq variables to scale. I have been trying to alter your code so that the results are added to z.dat2, but haven't had any success. I’ve tried different variations of mutate_at and select(everything()) but nothing has worked. Any suggestions on what I am missing or not understanding?

z.dat2 <- dat %>% 
	as_tibble() %>%
	filter(Groups != "other") %>%
	select(ID, sex, ends_with("a"), everything()) %>% 
	gather(asq, value, -ID, -sex, -everything(), na.rm = TRUE) %>%
	group_by(ID) %>%
	mutate(z = as.vector(scale(value))) %>%
	pivot_wider(names_from = asq, values_from = c(value, z))

#rstatsnewbie

andresrcs · May 3, 2019, 11:01pm

The problem is here, by excluding -everything() you are getting no effect with this command, what are you trying to accomplish by doing this?

I think that in this case it would be better to select the variables you want to gather instead of excluding the ones you don't. Try with something like this.

# This gathers all variables starting with "asq" and ending with "a"
gather(asq, value, matches("^asq.+a$"), na.rm = TRUE) %>%

Jason.C · May 4, 2019, 8:59pm

My goal was to add the z-score variables to the original data set. I figured out how with the code below but I thought there might be an easier way.

Thanks for all the help. I really appreciate the effort.

Cheers,
Jason

dat <- filter(dat, Groups != "other")
# Create z-scores by row
z.dat2 <- dat %>%
	as_tibble() %>%
	#filter(Groups != "other") %>%
	select(ID, sex, ends_with("a")) %>%
	gather(asq, value, -ID, -sex, na.rm = FALSE) %>%
	group_by(ID) %>%
	mutate(z = as.vector(scale(value))) %>%
	pivot_wider(names_from = asq, values_from = c(value, z))

# New dataset with ID and z-scores
z.dat3 <- select(z.dat2, ID, starts_with("z"))
# Add the z-scores to the original dataset
dat <- merge(dat, z.dat3, by = "ID")

system · May 11, 2019, 8:59pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.