I have a database with many variables that I need to recode their values, I have an excel table with the recoded values. How can I do to recode quickly and automatically without having to do it manually?
I give a small and easy example. I have this base:
sport music city
<chr> <chr> <chr>
1 soccer tango San Pablo
2 voley pop Buenos Aires
3 rugby heavy metal Buenos Aires
4 basket bossa nova Buenos Aires
5 tenis rock San Pablo
6 tenis pop Buenos Aires
7 voley bossa nova London
8 rugby heavy metal London
9 basket tango San Pablo
10 tenis bossa nova San Pablo
11 basket heavy metal Paris
12 basket tango Paris
13 basket tango Buenos Aires
14 voley tango Buenos Aires
15 tenis heavy metal London
And here are the excel charts with how to recode.
sport
sport_bis
rugby
sport1
soccer
sport2
basquet
sport1
voley
sport2
tenis
sport3
music
music_bis
rock
genre1
tango
genre1
pop
genre2
heavy metal
genre2
bossa nova
genre2
city
city_bis
london
city_1
barcelona
city_2
paris
city_1
san pablo
city_1
buenos aires
city_2
This example can be done manually without problem, but I'm posing the question for a 20-variable scenario where each variable has 20 values ββto recode.
You can do this using the left_join() function from dplyr. If your original data frame is named DF and the Excel table has been imported as the object NAMES, you can do
library(dplyr)
DF <- left_join(DF, NAMES, by = c(sport = "sport")
DF will now have a column named sport_bis with the appropriate value from NAMES in each row. This assumes that NAMES has columns named sport and sport_bis. If there is no matching value of sport in NAMES, the sport_bis column will have NA. I particularly mention this because your original data has the value basket in some rows and the Excel data has the value basquet.
You can reshape the data and use a single list for the renaming process. I adjusted the capitalization of the cities so that they would match.
library(dplyr)
library(tidyr)
DF <- read.csv("~/R/Play/Dummy.csv")
DF
#> sport music city
#> 1 soccer tango San Pablo
#> 2 voley pop Buenos Aires
#> 3 rugby heavy metal Buenos Aires
#> 4 basket bossa nova Buenos Aires
#> 5 tenis rock San Pablo
#> 6 tenis pop Buenos Aires
#> 7 voley bossa nova London
#> 8 rugby heavy metal London
#> 9 basket tango San Pablo
#> 10 tenis bossa nova San Pablo
#> 11 basket heavy metal Paris
#> 12 basket tango Paris
#> 13 basket tango Buenos Aires
#> 14 voley tango Buenos Aires
#> 15 tenis heavy metal London
DF2 <- read.csv("~/R/Play/Dummy2.csv")
DF2
#> Name1 Name2
#> 1 rugby sport1
#> 2 soccer sport2
#> 3 basquet sport1
#> 4 voley sport2
#> 5 tenis sport3
#> 6 rock genre1
#> 7 tango genre1
#> 8 pop genre2
#> 9 heavy metal genre2
#> 10 bossa nova genre2
#> 11 London city_1
#> 12 Barcelona city_2
#> 13 Paris city_1
#> 14 San Pablo city_1
#> 15 Buenos Aires city_2
DFlong <- DF |> pivot_longer(cols = everything())
DFlong
#> # A tibble: 45 Γ 2
#> name value
#> <chr> <chr>
#> 1 sport soccer
#> 2 music tango
#> 3 city San Pablo
#> 4 sport voley
#> 5 music pop
#> 6 city Buenos Aires
#> 7 sport rugby
#> 8 music heavy metal
#> 9 city Buenos Aires
#> 10 sport basket
#> # β¦ with 35 more rows
DFlong <- left_join(DFlong,DF2, by = c(value="Name1"))
DFlong
#> # A tibble: 45 Γ 3
#> name value Name2
#> <chr> <chr> <chr>
#> 1 sport soccer sport2
#> 2 music tango genre1
#> 3 city San Pablo city_1
#> 4 sport voley sport2
#> 5 music pop genre2
#> 6 city Buenos Aires city_2
#> 7 sport rugby sport1
#> 8 music heavy metal genre2
#> 9 city Buenos Aires city_2
#> 10 sport basket <NA>
#> # β¦ with 35 more rows
DFWide <- DFlong |> select(-value) |>
mutate(ROW=rep(1:15,each=3)) |>
pivot_wider(names_from = "name",values_from = "Name2")
DFWide
#> # A tibble: 15 Γ 4
#> ROW sport music city
#> <int> <chr> <chr> <chr>
#> 1 1 sport2 genre1 city_1
#> 2 2 sport2 genre2 city_2
#> 3 3 sport1 genre2 city_2
#> 4 4 <NA> genre2 city_2
#> 5 5 sport3 genre1 city_1
#> 6 6 sport3 genre2 city_2
#> 7 7 sport2 genre2 city_1
#> 8 8 sport1 genre2 city_1
#> 9 9 <NA> genre1 city_1
#> 10 10 sport3 genre2 city_1
#> 11 11 <NA> genre2 city_1
#> 12 12 <NA> genre1 city_1
#> 13 13 <NA> genre1 city_2
#> 14 14 sport2 genre1 city_2
#> 15 15 sport3 genre2 city_1
That line of code puts in a row number so that the original rows can be reconstructed with pivot_wider(). It would have made more sense to put the row number in earlier, like this: