New column in df1 based on two columns in df2

ManyQuestions · August 20, 2022, 3:53pm

Hi! I have thought of a pretty ugly way to solve this problem but am wondering if there are any solutions that are more elegant.

I have scores that correspond to a certain rank in df2 and want to create a new column in df1 called "rank" based off the information in df2. I thought about using "recode" for this and manually typing 171 = 1, 172 = 2, 173 = 3, and so on and so forth (and will do this if it's the only way) but there has to be a better way. I tired to short cut this by taking the columns in df2 as vectors and setting them equal in the recode function but it didn't work and I'm honestly not surprised. Here is the example:

# Example

score <- c(171, 175, 176, 188, 196)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- tibble(participant, score)


Rscore <- c(171, 172, 173, 174, 175, 176, 177, 178, 179,180, 181, 182, 183, 184, 185, 186, 187, 188, 190, 191, 192, 193, 194, 195, 196)
Rrank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 ,14 ,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)

df2 <- tibble(Rscore, Rrank)

df1 <- df1 %>% mutate(rank = recode(rank, Rscore = Rrank ))
## ^ this, not unsurprisingly, did not work

In the real data set I have thousands of scores that need to be aligned with a rank. Any thoughts would be warmly welcomed!

Sincerely,
M

FJCC · August 20, 2022, 4:30pm

I am not sure I understand your goal. Is this what you want?

library(dplyr)

score <- c(171, 175, 176, 188, 196)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- tibble(participant, score)


Rscore <- c(171, 172, 173, 174, 175, 176, 177, 178, 179,180, 181, 182, 183, 184, 185, 186, 187, 188, 190, 191, 192, 193, 194, 195, 196)
Rrank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 ,14 ,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)

df2 <- tibble(Rscore, Rrank)

df1 <- left_join(df1,df2,by=c(score="Rscore"))
df1
#> # A tibble: 5 x 3
#>   participant score Rrank
#>         <dbl> <dbl> <dbl>
#> 1        1021   171     1
#> 2        1023   175     5
#> 3        1027   176     6
#> 4        1024   188    18
#> 5        1005   196    25

^{Created on 2022-08-20 by the reprex package (v2.0.1)}

technocrat · August 20, 2022, 9:24pm

Like @FJCC, I am uncertain as to your intent, because there is no obvious criterion by which to match the score variables of df1, which has only 5 rows, with the vectors RScore and Rrank which are of length 25. If they were equally sized with df1, the direct way would be

score <- c(171, 175, 176, 188, 196)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- data.frame(participant, score)

Rscore <- c(192, 193, 194, 195, 196)
Rrank <- c(1, 2, 3, 4, 5)

df1 <- cbind(df1,Rscore,Rrank)
df1
#>   participant score Rscore Rrank
#> 1        1021   171    192     1
#> 2        1023   175    193     2
#> 3        1027   176    194     3
#> 4        1024   188    195     4
#> 5        1005   196    196     5

ManyQuestions · August 23, 2022, 2:58am

@technocrat , @FJCC thank you both for your replies. My apologies for the lack of clarity. I am thinking of it this way:

d2 is a reference data frame. It contains all the possible scores and corresponding ranks. df1 is the sample population with the actual score 'score' and a participant Id number 'participant'. I want to use the reference data to apply the correct rank to the sample population data. FJCC did work well but unearthed another issue. The reference data frame actually looks more like this (note I will be using a shorter vector length for this example):

# The sample population data remains the same
score <- c(171, 175, 176, 188, 196)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- tibble(participant, score) # sample population data 

# Notice the reference data does not explicitly list 172, 175, or  176,  though they correspond to a rank of 1, 3, and 3 respectively. 

Rscore <- c(171, 173, 174, 177, 178, 179, 180, 181, 182, 183, )
Rrank <- c(.  1,        2,      3,     4,      5,      6,      7,      7,       7,     8)

df2 <- tibble(Rscore, Rrank) # reference data

So not all rank values are listed and some Rrank values correspond to the same Rscore so if a participant has a score of 175, @FJCC method introduces NA's. and duplicates. I am less worried about the duplicates and more wondering about the NA's. I think the best way to handle it is to maybe add rows to explicitly list the unlisted values, run @FJCC's method and filter out duplicates but I'm open to any creative thoughts or insights.

Cheers,
M

FJCC · August 23, 2022, 4:29am

There is probably a more elegant solution than this.

library(dplyr)
library(tidyr)
score <- c(171, 175, 176, 188, 196)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- tibble(participant, score) # sample population data 

# Notice the reference data does not explicitly list 172, 175, or  176,  though they correspond to a rank of 1, 3, and 3 respectively. 

Rscore <- c(171, 173, 174, 177, 178, 179, 180, 181, 182, 183 )
Rrank <- c(1, 2,3, 4, 5,6,  7,  7,  7, 8)

df2 <- tibble(Rscore, Rrank) 
df2
#> # A tibble: 10 x 2
#>    Rscore Rrank
#>     <dbl> <dbl>
#>  1    171     1
#>  2    173     2
#>  3    174     3
#>  4    177     4
#>  5    178     5
#>  6    179     6
#>  7    180     7
#>  8    181     7
#>  9    182     7
#> 10    183     8
AllScore <- min(df2$Rscore):max(df1$score)
Missing <- AllScore[!AllScore %in% df2$Rscore]
MissingDF <- tibble(Rscore = Missing, Rrank = NA)
df3 <- rbind(df2, MissingDF) |> arrange(Rscore) |> 
  fill(Rrank)
df3
#> # A tibble: 26 x 2
#>    Rscore Rrank
#>     <dbl> <dbl>
#>  1    171     1
#>  2    172     1
#>  3    173     2
#>  4    174     3
#>  5    175     3
#>  6    176     3
#>  7    177     4
#>  8    178     5
#>  9    179     6
#> 10    180     7
#> # ... with 16 more rows

df1 <- left_join(df1,df3,by=c(score="Rscore"))
df1
#> # A tibble: 5 x 3
#>   participant score Rrank
#>         <dbl> <dbl> <dbl>
#> 1        1021   171     1
#> 2        1023   175     3
#> 3        1027   176     3
#> 4        1024   188     8
#> 5        1005   196     8

^{Created on 2022-08-22 by the reprex package (v2.0.1)}

ManyQuestions · August 23, 2022, 3:56pm

FJCC:

AllScore <- min(df2$Rscore):max(df1$score)
Missing <- AllScore[!AllScore %in% df2$Rscore]
MissingDF <- tibble(Rscore = Missing, Rrank = NA)
df3 <- rbind(df2, MissingDF) |> arrange(Rscore) |> 
  fill(Rrank)
df3

@FJCC Even if there is a more elegant solution, this works beautifully. Thank you so much

ManyQuestions · August 23, 2022, 4:22pm

@FJCC So riddle me this...

The reference data now has the missing Rscore values filled in but there are Rranks that correspond to the same Rscore so when I go to left join I get duplicates. I'd like to take the average of the duplicated participant ranks but the best way I know how to do this is export into excel and manually slog through it.

Rscore <- c(171, 172, 173, 174, 175, 176, 177, 178, 179, 180)
Rrank <- c(1, 1, 2, 2, 2, 3, 4, 5, 6, 7)

df1 <- tibble(Rscore, Rrank) # reference data

score <- c( 171, 172, 178, 180, 180, 172, 175)
participant <- c( 1002, 1003, 1045, 1067, 1045, 1091, 1034)

df2 <- tibble (participant, score) # sample population data

df2 <- left_join (df1, df2, by =c (score = "Rscore")

So I'm wondering if theres a bit better option than manually crunching the average of the duplicates. Hope that makes sense.

FJCC · August 23, 2022, 4:35pm

Do you want to calculate the average score withing each Rrank? Note that I changed the code of the left_join because what you posted threw an error.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
Rscore <- c(171, 172, 173, 174, 175, 176, 177, 178, 179, 180)
Rrank <- c(1, 1, 2, 2, 2, 3, 4, 5, 6, 7)

df1 <- tibble(Rscore, Rrank) # reference data

score <- c( 171, 172, 178, 180, 180, 172, 175)
participant <- c( 1002, 1003, 1045, 1067, 1045, 1091, 1034)

df2 <- tibble (participant, score) # sample population data

df2 <- left_join (df2, df1, by =c(score = "Rscore"))

df2
#> # A tibble: 7 x 3
#>   participant score Rrank
#>         <dbl> <dbl> <dbl>
#> 1        1002   171     1
#> 2        1003   172     1
#> 3        1045   178     5
#> 4        1067   180     7
#> 5        1045   180     7
#> 6        1091   172     1
#> 7        1034   175     2
dfAvg <- df2 |> group_by(Rrank) |> summarize(Avg = mean(score))                  
dfAvg
#> # A tibble: 4 x 2
#>   Rrank   Avg
#>   <dbl> <dbl>
#> 1     1  172.
#> 2     2  175 
#> 3     5  178 
#> 4     7  180

^{Created on 2022-08-23 by the reprex package (v2.0.1)}

ManyQuestions · August 23, 2022, 5:05pm

Ah yes I left the last ) off.

I also wrote the example wrong:

score <- c(171, 175, 173, 174, 175)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- tibble(participant, score) # sample population

Rscore <- c(171, 171, 172, 173, 173, 174, 174, 175, 175, 176, 177 )
Rrank <- c(1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

df2 <- tibble(Rscore, Rrank) # reference data 

df3 <- left_join(df1, df2, by = c(score = "Rscore"))

So it generates duplicates now.

I need the average of the duplicated participant's ranks. So averages of rank by rows of the duplicated participant's. It's a sticky one.

FJCC · August 23, 2022, 5:49pm

Like this?

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
score <- c(171, 175, 173, 174, 175)
participant <- c(1021, 1023, 1027, 1024, 1005)

df1 <- tibble(participant, score) # sample population

Rscore <- c(171, 171, 172, 173, 173, 174, 174, 175, 175, 176, 177 )
Rrank <- c(1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

df2 <- tibble(Rscore, Rrank) # reference data 

df3 <- left_join(df1, df2, by = c(score = "Rscore"))

dfAvg <- df3 |> group_by(participant) |> summarize(AvgRank = mean(Rrank)) 
dfAvg
#> # A tibble: 5 x 2
#>   participant AvgRank
#>         <dbl>   <dbl>
#> 1        1005     7.5
#> 2        1021     1  
#> 3        1023     7.5
#> 4        1024     5.5
#> 5        1027     3.5

^{Created on 2022-08-23 by the reprex package (v2.0.1)}

ManyQuestions · August 23, 2022, 5:51pm

YES!!!! THANK YOU!

system · August 30, 2022, 5:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.