mutate+c_across + matching with case_when

data %>%
rowwise() %>%
    mutate(match = case_when(c_across(cols = years2:years50),
    any(abs(years_standard - .x) < 1) ~ "MATCHED",
    TRUE ~ "NO_MATCH")

above is my code. I take my data and for each row I want to see if there is a match between "years_standard" and one of the variables between "years2:years50". I want to allow for 1 year tolerance as well . so if it one year more or less, it should be ok. my error is "must be a two-sided formula, not a double vector". is there any way u could assist?<3thankss. I also want to create a column called "first_match" that tells me what column there was the FIRST match in between "yeares2:years50" with a one year tolerance . if someone could assist there that would be so helpful too.

the output dataset would look like below

name years_standard years2 years3 ...... years50  match      first_match
Ryan  10              8       9            9      MATCHED      years3
Adam   100            94     99           100     MATCHED       years50

Assume in Adam example that the first match was in "years50".please let me know if they are questions. I appreciate.

if someone could help with the second part "I also want to create a column called "first_match" that tells me what column there was the FIRST match in between "yeares2:years50" with a one year tolerance . if someone could assist there that would be so helpful too."

I would be so grateful.

Happy to help. Can you post a sample of your data please? You can use dput(head(data, 100)) for this if you'd like.

thank you so much <3
I'll make a sample of the data. I only need assistance now with: i want to create a column called "first_match" that tells me what column there was the FIRST match in between "years2:years50" with a one year tolerance .if there is no match, a simple "NA" or "NO MATCH" is fine.

years2:years50 are all numeric some are decimal like 10.4 , 40.4, etc.

name years_standard years2 years3 ...... years50  match      first_match
Ryan  10              8       9            9      MATCHED      years3         
Adam   100            94     99           100     MATCHED      years50
Ali    383           100    2323          382     MATCHED      years50
LIa     48            34     46           54      NOT MATCHED   NA
TIA    45            45      45           45      MATCHED       years2
#example data
set.seed(42)
datalong <- data.frame(
  years_standard=c(rep(28,50),rep(29,50),rep(30,50)),
  yearsnum=c(sample.int(100,100,replace=TRUE),rep(10,50))
) |> group_by(years_standard) |> mutate(rn=row_number())

data <- pivot_wider(datalong,
                    names_from = "rn",
                    values_from = "yearsnum",
                    names_prefix = "years")

d2 <- data %>%
  rowwise() %>%
  mutate( ml = list(abs(years_standard - c_across(cols = years2:years50)) <= 1),
    match = case_when( any(ml) ~ "MATCHED",
                           TRUE ~ "NO_MATCH"),
    fm_1 = list(head(which(ml),n=1)),
    fm_2 = 1+ifelse(identical(fm_1,integer(0)),NA,fm_1), #shift by one because cols scanned are only years2:50 not year1 etc.
    first_match = ifelse(is.na(fm_2),NA,paste0("year",fm_2))
    )
1 Like

nigra, thank u but the columns in between years2:years50 are not all just named years2, 3,, 4. there can be a column named "numbers_49" and random things like that.is there a way to account for in "first_match"? thank u so much. I just dont totally get what fm_2 does

fm_2 accounts for the fact that if theres no match, you cant report the first match. Also as in the data the matches align with the year numbers its easy to go from the column that matched number to the year name by adding 1.

If you have significantly different data so that you think the solution would need to be different and you would like additional help, i recommend you make a reprex.

1 Like

the output dataset would look like below

name years_standard years2 years3 x_cats_ISD...... years50  match      first_match
Ryan  10              8       9      9              48    MATCHED      years3
Adam   100            94     99     100             44    MATCHED      years50
Neil    132           23    132.9   233             34    MATCHED       years3
Ali      193         33     191    28823           8328   UNMATCHED     NA

thank u. I made a reprex. so the variables between years2 and years50 could have any name possible essentially but I know where the start and end is

what you shared is not the output from dput() or anything similar.
If I copy and paste what you provide into my R session, I will not have anything meaningful...

> name years_standard years2 years3 x_cats_ISD...... years50  match      first_match
Error: unexpected symbol in "name years_standard"
> Ryan  10              8       9      9              48    MATCHED      years3
Error: unexpected numeric constant in "Ryan  10"
> Adam   100            94     99     100             44    MATCHED      years50
Error: unexpected numeric constant in "Adam   100"
> Neil    132           23    132.9   233             34    MATCHED       years3
Error: unexpected numeric constant in "Neil    132"
> Ali      193         33     191    28823           8328   UNMATCHED     NA
Error: unexpected numeric constant in "Ali      193"

... hence the advice to use dput() etc

1 Like

here it is. omg thank u lol im dumb <3

structure(list(name = c("Ryan", "Adam", "Neil", "Ali", "Ari"),
years_standard = c(10, 100, 132, 193, 939.3), years2 = c(8,
94, 23, 33, 102.3), years3 = c(9, 99, 134, 191, 392), x_cats_ISD = c(9,
99, 132.9, 191, 9409.2), years_cats_DDP = c(11, 83, 32, 32939,
3993), years50 = c(48, 44, 34, 8328, 939.8), match = c("MATCHED",
"MATCHED", "MATCHED", "UNMATCHED", "MATCHED"), first_match = c("years3",
"years3", "x_cats_ISD", "NA", "years50")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

Hope this helps !
It was an interesting question for me to work on.

library(tidyverse)
df_target <- structure(list(name = c("Ryan", "Adam", "Neil", "Ali", "Ari"),
               years_standard = c(10, 100, 132, 193, 939.3), years2 = c(8,
                                                                        94, 23, 33, 102.3), years3 = c(9, 99, 134, 191, 392), x_cats_ISD = c(9,
                                                                                                                                             99, 132.9, 191, 9409.2), years_cats_DDP = c(11, 83, 32, 32939,
                                                                                                                                                                                         3993), years50 = c(48, 44, 34, 8328, 939.8), match = c("MATCHED",
                                                                                                                                                                                                                                                "MATCHED", "MATCHED", "UNMATCHED", "MATCHED"), first_match = c("years3",
                                                                                                                                                                                                                                                                                                               "years3", "x_cats_ISD", "NA", "years50")), row.names = c(NA,
                                                                                                                                                                                                                                                                                                                                                                        -5L), class = c("tbl_df", "tbl", "data.frame"))

(df_start <- df_target |> select(-match,-first_match))


(names_of_cols <- df_start |> select(years2:years50) |> names())
 

d2 <- df_start %>%
  rowwise() %>%
  mutate( ml = list(abs(years_standard - c_across(cols = all_of(names_of_cols))) <= 1),
          match = case_when( any(ml) ~ "MATCHED",
                             TRUE ~ "UNMATCHED"),
          fm_1 = list(head(which(ml),n=1)),
          first_match = ifelse(identical(integer(0),fm_1),NA,names_of_cols[fm_1])) |> 
  ungroup() |>
  select(-ml,-fm_1)


#to check its the same
#install.packages("waldo")
waldo::compare(df_target,d2)
1 Like

thank u so much. sorry its kind of complicated:(

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.