library(tidyverse)
# toy data
df <- tibble(v = c(123456, 12045604, 45678, 89079, 20345, 2034512, 12345604))
df
#> # A tibble: 7 x 1
#> v
#> <dbl>
#> 1 123456
#> 2 12045604
#> 3 45678
#> 4 89079
#> 5 20345
#> 6 2034512
#> 7 12345604
How do I keep those observations for whom only the last two digits do not match? For example, Row 1 is 12345 and Row 7 is 12345604: everything matches except the last two digits. So, I want to keep them. Similary, row 5 and 6.
Here is my desired output:
#> # A tibble: 4 x 1
#> v
#> <dbl>
#> 1 123456
#> 2 12345604
#> 3 20345
#> 4 2034512
This works with your toy data. I hope it is not to specialized to work with you actual data. Note that the final values are characters but they could be changed to number with as.numeric().
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
# toy data
df <- tibble(v = c(123456, 12045604, 45678, 89079, 20345, 2034512, 12345604))
df
#> # A tibble: 7 x 1
#> v
#> <dbl>
#> 1 123456
#> 2 12045604
#> 3 45678
#> 4 89079
#> 5 20345
#> 6 2034512
#> 7 12345604
df <- df |> mutate(v = as.character(v))
NewDF <- df |> mutate(TRUNC = str_sub(v,1,-3))
df |> inner_join(NewDF, by = c(v = "TRUNC")) |>
pivot_longer(cols = everything()) |>
select(value)
#> # A tibble: 4 x 1
#> value
#> <chr>
#> 1 123456
#> 2 12345604
#> 3 20345
#> 4 2034512
@FJCC Many thanks for solving my problem! Could please explain this part inner_join(NewDF, by = c(v = "TRUNC"))? In particular, I would like to know what v = "TRUNC" is doing. Thanks again!
takes each row of df and looks for a row in NewDF where the v value of df matches the TRUNC value of NewDF. That condition is expressed as by = c(v = "TRUNC"). That is, do the matching by the condition that v = TRUNC. The result of that is this
df |> inner_join(NewDF, by = c(v = "TRUNC"))
# A tibble: 2 x 2
v v.y
<chr> <chr>
1 123456 12345604
2 20345 2034512
Two possibly surprising things about that result are the column named v.y and the lack of a TRUNC column. The v.y column is just the v column from NewDF. The inner_join function avoids returning a data frame with identical column names by appending .y to the columns from the second data frame in the function call. The TRUNC column has disappeared because its values are identical to the v values of df. We know that because that was the matching condition used in the join. We could keep the TRUNC column by setting the keep argument in inner_join.
You can see there that the v value from df (now named v.x) matches TRUNC. I didn't do that in the original code because TRUNC gets in the way of reshaping the data with pivot_longer().
The inner_join has the property that it only keeps rows where the matching condition is met. There are other kinds of joins. left_join, right_join, semi_join, etc., with different behavior. Joins are very useful for preparing data and are well worth studying.