mutate+c_across + matching with case_when

help · August 15, 2022, 7:29pm

data %>%
rowwise() %>%
    mutate(match = case_when(c_across(cols = years2:years50),
    any(abs(years_standard - .x) < 1) ~ "MATCHED",
    TRUE ~ "NO_MATCH")

above is my code. I take my data and for each row I want to see if there is a match between "years_standard" and one of the variables between "years2:years50". I want to allow for 1 year tolerance as well . so if it one year more or less, it should be ok. my error is "must be a two-sided formula, not a double vector". is there any way u could assist?<3thankss. I also want to create a column called "first_match" that tells me what column there was the FIRST match in between "yeares2:years50" with a one year tolerance . if someone could assist there that would be so helpful too.

the output dataset would look like below

name years_standard years2 years3 ...... years50  match      first_match
Ryan  10              8       9            9      MATCHED      years3
Adam   100            94     99           100     MATCHED       years50

Assume in Adam example that the first match was in "years50".please let me know if they are questions. I appreciate.

help · August 15, 2022, 9:47pm

if someone could help with the second part "I also want to create a column called "first_match" that tells me what column there was the FIRST match in between "yeares2:years50" with a one year tolerance . if someone could assist there that would be so helpful too."

I would be so grateful.

dvetsch75 · August 16, 2022, 3:03am

Happy to help. Can you post a sample of your data please? You can use dput(head(data, 100)) for this if you'd like.

help · August 16, 2022, 3:08am

thank you so much <3
I'll make a sample of the data. I only need assistance now with: i want to create a column called "first_match" that tells me what column there was the FIRST match in between "years2:years50" with a one year tolerance .if there is no match, a simple "NA" or "NO MATCH" is fine.

years2:years50 are all numeric some are decimal like 10.4 , 40.4, etc.

name years_standard years2 years3 ...... years50  match      first_match
Ryan  10              8       9            9      MATCHED      years3         
Adam   100            94     99           100     MATCHED      years50
Ali    383           100    2323          382     MATCHED      years50
LIa     48            34     46           54      NOT MATCHED   NA
TIA    45            45      45           45      MATCHED       years2

nirgrahamuk · August 16, 2022, 9:29am

#example data
set.seed(42)
datalong <- data.frame(
  years_standard=c(rep(28,50),rep(29,50),rep(30,50)),
  yearsnum=c(sample.int(100,100,replace=TRUE),rep(10,50))
) |> group_by(years_standard) |> mutate(rn=row_number())

data <- pivot_wider(datalong,
                    names_from = "rn",
                    values_from = "yearsnum",
                    names_prefix = "years")

d2 <- data %>%
  rowwise() %>%
  mutate( ml = list(abs(years_standard - c_across(cols = years2:years50)) <= 1),
    match = case_when( any(ml) ~ "MATCHED",
                           TRUE ~ "NO_MATCH"),
    fm_1 = list(head(which(ml),n=1)),
    fm_2 = 1+ifelse(identical(fm_1,integer(0)),NA,fm_1), #shift by one because cols scanned are only years2:50 not year1 etc.
    first_match = ifelse(is.na(fm_2),NA,paste0("year",fm_2))
    )

help · August 16, 2022, 3:12pm

nigra, thank u but the columns in between years2:years50 are not all just named years2, 3,, 4. there can be a column named "numbers_49" and random things like that.is there a way to account for in "first_match"? thank u so much. I just dont totally get what fm_2 does

nirgrahamuk · August 16, 2022, 4:43pm

fm_2 accounts for the fact that if theres no match, you cant report the first match. Also as in the data the matches align with the year numbers its easy to go from the column that matched number to the year name by adding 1.

If you have significantly different data so that you think the solution would need to be different and you would like additional help, i recommend you make a reprex.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

help · August 16, 2022, 4:56pm

the output dataset would look like below

name years_standard years2 years3 x_cats_ISD...... years50  match      first_match
Ryan  10              8       9      9              48    MATCHED      years3
Adam   100            94     99     100             44    MATCHED      years50
Neil    132           23    132.9   233             34    MATCHED       years3
Ali      193         33     191    28823           8328   UNMATCHED     NA

thank u. I made a reprex. so the variables between years2 and years50 could have any name possible essentially but I know where the start and end is

nirgrahamuk · August 16, 2022, 5:11pm

what you shared is not the output from dput() or anything similar.
If I copy and paste what you provide into my R session, I will not have anything meaningful...

> name years_standard years2 years3 x_cats_ISD...... years50  match      first_match
Error: unexpected symbol in "name years_standard"
> Ryan  10              8       9      9              48    MATCHED      years3
Error: unexpected numeric constant in "Ryan  10"
> Adam   100            94     99     100             44    MATCHED      years50
Error: unexpected numeric constant in "Adam   100"
> Neil    132           23    132.9   233             34    MATCHED       years3
Error: unexpected numeric constant in "Neil    132"
> Ali      193         33     191    28823           8328   UNMATCHED     NA
Error: unexpected numeric constant in "Ali      193"

... hence the advice to use dput() etc

help · August 16, 2022, 5:20pm

here it is. omg thank u lol im dumb <3

help · August 16, 2022, 5:20pm

structure(list(name = c("Ryan", "Adam", "Neil", "Ali", "Ari"),
years_standard = c(10, 100, 132, 193, 939.3), years2 = c(8,
94, 23, 33, 102.3), years3 = c(9, 99, 134, 191, 392), x_cats_ISD = c(9,
99, 132.9, 191, 9409.2), years_cats_DDP = c(11, 83, 32, 32939,
3993), years50 = c(48, 44, 34, 8328, 939.8), match = c("MATCHED",
"MATCHED", "MATCHED", "UNMATCHED", "MATCHED"), first_match = c("years3",
"years3", "x_cats_ISD", "NA", "years50")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

nirgrahamuk · August 16, 2022, 5:42pm

Hope this helps !
It was an interesting question for me to work on.

library(tidyverse)
df_target <- structure(list(name = c("Ryan", "Adam", "Neil", "Ali", "Ari"),
               years_standard = c(10, 100, 132, 193, 939.3), years2 = c(8,
                                                                        94, 23, 33, 102.3), years3 = c(9, 99, 134, 191, 392), x_cats_ISD = c(9,
                                                                                                                                             99, 132.9, 191, 9409.2), years_cats_DDP = c(11, 83, 32, 32939,
                                                                                                                                                                                         3993), years50 = c(48, 44, 34, 8328, 939.8), match = c("MATCHED",
                                                                                                                                                                                                                                                "MATCHED", "MATCHED", "UNMATCHED", "MATCHED"), first_match = c("years3",
                                                                                                                                                                                                                                                                                                               "years3", "x_cats_ISD", "NA", "years50")), row.names = c(NA,
                                                                                                                                                                                                                                                                                                                                                                        -5L), class = c("tbl_df", "tbl", "data.frame"))

(df_start <- df_target |> select(-match,-first_match))


(names_of_cols <- df_start |> select(years2:years50) |> names())
 

d2 <- df_start %>%
  rowwise() %>%
  mutate( ml = list(abs(years_standard - c_across(cols = all_of(names_of_cols))) <= 1),
          match = case_when( any(ml) ~ "MATCHED",
                             TRUE ~ "UNMATCHED"),
          fm_1 = list(head(which(ml),n=1)),
          first_match = ifelse(identical(integer(0),fm_1),NA,names_of_cols[fm_1])) |> 
  ungroup() |>
  select(-ml,-fm_1)


#to check its the same
#install.packages("waldo")
waldo::compare(df_target,d2)

help · August 16, 2022, 9:19pm

thank u so much. sorry its kind of complicated:(

system · September 6, 2022, 9:20pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.