Recoding range of columns based on criteria set from another column

cactus · March 10, 2022, 2:36am

Hi community,

I have an example dataset "student_score" where I have variables - age and score for 9 years for 11 different individuals .

I want to change the value in the column score1:score9 based on the value in the age column (age1:age9). The condition would be that if someone is less then age 20, then all the values of their score for that year will not be counted and will thus be converted into "0".
i.e if age1:age9 < 20, then values for those years will be = 0 and won't be counted.

e.g for subjectid "b", their score for the first 3 year will not be counted as during those year their age was 18,19 and 20 years.

I need to do this across a large number of columns so looking for a way to do it at once rather than individually - column by column which i figured out.

Thankyou for the help !

student_score <- data.frame(
  subjectid = c("a","b","c","d","e","f","g","h","i","j","k"),
  age1 = c(27,18,28,14,11,32,20,19,12,23,34),
  age2 = c(28,19,29,15,12,33,21,20,13,24,35),
  age3 = c(29,20,30,16,13,34,22,21,14,25,36),
  age4 = c(30,21,31,17,14,35,23,22,15,26,37),
  age5 = c(31,22,32,18,15,36,24,23,16,27,38),
  age6 = c(32,23,33,19,16,37,25,24,17,28,39),
  age7 = c(33,24,34,20,17,38,26,25,18,29,40),
  age8 = c(34,25,35,21,18,39,27,26,19,30,41),
  age9 = c(35,26,36,22,19,40,28,27,20,31,42),
  score1 = c(1,4,2,1,1,0,0,1,0,0,0),
  score2 = c(1,0,0,1,0,0,2,1,0,2,0),
  score3 = c(1,5,2,1,0,0,1,1,1,0,0),
  score4 = c(1,1,0,1,2,0,1,1,5,1,1),
  score5 = c(0,4,2,1,0,0,0,1,1,1,1),
  score6 = c(1,2,0,1,0,2,1,1,0,1,0),
  score7 = c(1,1,0,0,1,2,1,0,0,1,1),
  score8 = c(1,1,0,0,2,5,1,0,0,3,1),
  score9 = c(1,1,3,0,1,0,1,4,5,1,1))

rdr · March 10, 2022, 5:46am

I spent way more time on this than I thought I would and tried a bunch of fancy things before reverting to some really basic wrangling techniques. See what you think, @cactus.

First, I split your student_score data frame into two separate data frames, one with ages and one with scores:

library(dplyr)

age_data <- student_score %>% 
  select(subjectid, starts_with("age"))

score_data <- student_score %>% 
  select(subjectid, starts_with("score"))

These data frames have the same dimensions—and this process requires that to be the case. In other words, you have to have as many age variables (i.e., columns) as you do scores variables and the same number of observations (i.e., rows) in both sets.

With the this setup done it was a simple matter to set all scores to zero where the corresponding age was under 20. Recall that age_data < 20 yields a logical vector indicating where that condition is TRUE and where it's not:

      subjectid  age1  age2  age3  age4  age5  age6  age7  age8  age9
 [1,]     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [2,]     FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [3,]     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [4,]     FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [5,]     FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [6,]     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [7,]     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [8,]     FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [9,]     FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[10,]     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,]     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

We can use that result to update score_data, setting its values to zero where the corresponding values from age_data so indicate:

score_data[age_data < 20] = 0

Now we can simply recombine the data with a simple join:

result <- left_join(age_data, score_data)

Here's the result:

   subjectid age1 age2 age3 age4 age5 age6 age7 age8 age9 score1 score2 score3 score4 score5 score6 score7 score8 score9
1          a   27   28   29   30   31   32   33   34   35      1      1      1      1      0      1      1      1      1
2          b   18   19   20   21   22   23   24   25   26      0      0      5      1      4      2      1      1      1
3          c   28   29   30   31   32   33   34   35   36      2      0      2      0      2      0      0      0      3
4          d   14   15   16   17   18   19   20   21   22      0      0      0      0      0      0      0      0      0
5          e   11   12   13   14   15   16   17   18   19      0      0      0      0      0      0      0      0      0
6          f   32   33   34   35   36   37   38   39   40      0      0      0      0      0      2      2      5      0
7          g   20   21   22   23   24   25   26   27   28      0      2      1      1      0      1      1      1      1
8          h   19   20   21   22   23   24   25   26   27      0      1      1      1      1      1      0      0      4
9          i   12   13   14   15   16   17   18   19   20      0      0      0      0      0      0      0      0      5
10         j   23   24   25   26   27   28   29   30   31      0      2      0      1      1      1      1      3      1
11         k   34   35   36   37   38   39   40   41   42      0      0      0      1      1      0      1      1      1

dvetsch75 · March 10, 2022, 1:33pm

I would start by getting this into a tidy format i.e. where every variable is a column and every observation is a row. In your student_score dataset, I see three variables: year, age, and score. My opinion is that it would be a lot easier to work with this dataset if it were shaped something like this:

library(dplyr)
library(tidyr)
library(stringr)

student_score <- data.frame(
    subjectid = c("a","b","c","d","e","f","g","h","i","j","k"),
    age1 = c(27,18,28,14,11,32,20,19,12,23,34),
    age2 = c(28,19,29,15,12,33,21,20,13,24,35),
    age3 = c(29,20,30,16,13,34,22,21,14,25,36),
    age4 = c(30,21,31,17,14,35,23,22,15,26,37),
    age5 = c(31,22,32,18,15,36,24,23,16,27,38),
    age6 = c(32,23,33,19,16,37,25,24,17,28,39),
    age7 = c(33,24,34,20,17,38,26,25,18,29,40),
    age8 = c(34,25,35,21,18,39,27,26,19,30,41),
    age9 = c(35,26,36,22,19,40,28,27,20,31,42),
    score1 = c(1,4,2,1,1,0,0,1,0,0,0),
    score2 = c(1,0,0,1,0,0,2,1,0,2,0),
    score3 = c(1,5,2,1,0,0,1,1,1,0,0),
    score4 = c(1,1,0,1,2,0,1,1,5,1,1),
    score5 = c(0,4,2,1,0,0,0,1,1,1,1),
    score6 = c(1,2,0,1,0,2,1,1,0,1,0),
    score7 = c(1,1,0,0,1,2,1,0,0,1,1),
    score8 = c(1,1,0,0,2,5,1,0,0,3,1),
    score9 = c(1,1,3,0,1,0,1,4,5,1,1))

tidy_scores <- student_score %>% 
    pivot_longer(cols = -subjectid) %>% 
    mutate(
        year = str_sub(name, start = nchar(name)) %>% #Extract the year from the original variable name and encode it as its own variable
            as.numeric,
        name = str_sub(name, start = 1, end = nchar(name) - 1) # remove the year from the name since it now has the 'year' column
    ) %>% 
    pivot_wider(
        id_cols = c(subjectid, year),
        names_from = name,
        values_from = value
    )

tidy_scores
#> # A tibble: 99 x 4
#>    subjectid  year   age score
#>    <fct>     <dbl> <dbl> <dbl>
#>  1 a             1    27     1
#>  2 a             2    28     1
#>  3 a             3    29     1
#>  4 a             4    30     1
#>  5 a             5    31     0
#>  6 a             6    32     1
#>  7 a             7    33     1
#>  8 a             8    34     1
#>  9 a             9    35     1
#> 10 b             1    18     4
#> # ... with 89 more rows

From this comes a clear use for dplyr::case_when:

tidy_scores %>% 
    mutate(
        score = case_when(
            age < 20 ~ 0,
            T ~ score
        )
    )
#> # A tibble: 99 x 4
#>    subjectid  year   age score
#>    <fct>     <dbl> <dbl> <dbl>
#>  1 a             1    27     1
#>  2 a             2    28     1
#>  3 a             3    29     1
#>  4 a             4    30     1
#>  5 a             5    31     0
#>  6 a             6    32     1
#>  7 a             7    33     1
#>  8 a             8    34     1
#>  9 a             9    35     1
#> 10 b             1    18     0
#> # ... with 89 more rows

Hopefully that helps!

cactus · March 10, 2022, 2:30pm

Thank you @rdr ! This worked perfectly. I had to do some tweaks but this did the trick

cactus · March 10, 2022, 2:31pm

thank you @dvetsch75 . Agreed that setting into long makes this task easier !

system · March 17, 2022, 2:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.