recode variables based on a table in excel

juandmaz · September 5, 2022, 9:15pm

I have a database with many variables that I need to recode their values, I have an excel table with the recoded values. How can I do to recode quickly and automatically without having to do it manually?
I give a small and easy example. I have this base:

sport  music       city        
   <chr>  <chr>       <chr>       
 1 soccer tango       San Pablo   
 2 voley  pop         Buenos Aires
 3 rugby  heavy metal Buenos Aires
 4 basket bossa nova  Buenos Aires
 5 tenis  rock        San Pablo   
 6 tenis  pop         Buenos Aires
 7 voley  bossa nova  London      
 8 rugby  heavy metal London      
 9 basket tango       San Pablo   
10 tenis  bossa nova  San Pablo   
11 basket heavy metal Paris       
12 basket tango       Paris       
13 basket tango       Buenos Aires
14 voley  tango       Buenos Aires
15 tenis  heavy metal London

And here are the excel charts with how to recode.

sport	sport_bis
rugby	sport1
soccer	sport2
basquet	sport1
voley	sport2
tenis	sport3


music	music_bis
rock	genre1
tango	genre1
pop	genre2
heavy metal	genre2
bossa nova	genre2


city	city_bis
london	city_1
barcelona	city_2
paris	city_1
san pablo	city_1
buenos aires	city_2

This example can be done manually without problem, but I'm posing the question for a 20-variable scenario where each variable has 20 values to recode.

FJCC · September 5, 2022, 9:38pm

You can do this using the left_join() function from dplyr. If your original data frame is named DF and the Excel table has been imported as the object NAMES, you can do

library(dplyr)
DF <- left_join(DF, NAMES, by = c(sport = "sport")

DF will now have a column named sport_bis with the appropriate value from NAMES in each row. This assumes that NAMES has columns named sport and sport_bis. If there is no matching value of sport in NAMES, the sport_bis column will have NA. I particularly mention this because your original data has the value basket in some rows and the Excel data has the value basquet.

juandmaz · September 6, 2022, 1:47am

Thanks, but if I have 20 variables to recode, I have to make 20 left_join? Isn't there an easier way?

FJCC · September 6, 2022, 4:57am

You can reshape the data and use a single list for the renaming process. I adjusted the capitalization of the cities so that they would match.

library(dplyr)
library(tidyr)
DF <- read.csv("~/R/Play/Dummy.csv")
DF
#>     sport       music         city
#> 1  soccer       tango    San Pablo
#> 2   voley         pop Buenos Aires
#> 3   rugby heavy metal Buenos Aires
#> 4  basket  bossa nova Buenos Aires
#> 5   tenis        rock    San Pablo
#> 6   tenis         pop Buenos Aires
#> 7   voley  bossa nova       London
#> 8   rugby heavy metal       London
#> 9  basket       tango    San Pablo
#> 10  tenis  bossa nova    San Pablo
#> 11 basket heavy metal        Paris
#> 12 basket       tango        Paris
#> 13 basket       tango Buenos Aires
#> 14  voley       tango Buenos Aires
#> 15  tenis heavy metal       London
DF2 <- read.csv("~/R/Play/Dummy2.csv")
DF2
#>           Name1  Name2
#> 1         rugby sport1
#> 2        soccer sport2
#> 3       basquet sport1
#> 4         voley sport2
#> 5         tenis sport3
#> 6          rock genre1
#> 7         tango genre1
#> 8           pop genre2
#> 9   heavy metal genre2
#> 10   bossa nova genre2
#> 11       London city_1
#> 12    Barcelona city_2
#> 13        Paris city_1
#> 14    San Pablo city_1
#> 15 Buenos Aires city_2
DFlong <- DF |> pivot_longer(cols = everything())
DFlong
#> # A tibble: 45 × 2
#>    name  value       
#>    <chr> <chr>       
#>  1 sport soccer      
#>  2 music tango       
#>  3 city  San Pablo   
#>  4 sport voley       
#>  5 music pop         
#>  6 city  Buenos Aires
#>  7 sport rugby       
#>  8 music heavy metal 
#>  9 city  Buenos Aires
#> 10 sport basket      
#> # … with 35 more rows
DFlong <- left_join(DFlong,DF2, by = c(value="Name1"))
DFlong
#> # A tibble: 45 × 3
#>    name  value        Name2 
#>    <chr> <chr>        <chr> 
#>  1 sport soccer       sport2
#>  2 music tango        genre1
#>  3 city  San Pablo    city_1
#>  4 sport voley        sport2
#>  5 music pop          genre2
#>  6 city  Buenos Aires city_2
#>  7 sport rugby        sport1
#>  8 music heavy metal  genre2
#>  9 city  Buenos Aires city_2
#> 10 sport basket       <NA>  
#> # … with 35 more rows
DFWide <- DFlong |> select(-value) |> 
  mutate(ROW=rep(1:15,each=3)) |> 
  pivot_wider(names_from = "name",values_from = "Name2")
DFWide
#> # A tibble: 15 × 4
#>      ROW sport  music  city  
#>    <int> <chr>  <chr>  <chr> 
#>  1     1 sport2 genre1 city_1
#>  2     2 sport2 genre2 city_2
#>  3     3 sport1 genre2 city_2
#>  4     4 <NA>   genre2 city_2
#>  5     5 sport3 genre1 city_1
#>  6     6 sport3 genre2 city_2
#>  7     7 sport2 genre2 city_1
#>  8     8 sport1 genre2 city_1
#>  9     9 <NA>   genre1 city_1
#> 10    10 sport3 genre2 city_1
#> 11    11 <NA>   genre2 city_1
#> 12    12 <NA>   genre1 city_1
#> 13    13 <NA>   genre1 city_2
#> 14    14 sport2 genre1 city_2
#> 15    15 sport3 genre2 city_1

^{Created on 2022-09-05 with reprex v2.0.2}

juandmaz · September 6, 2022, 6:52pm

Thanks a lot, I don't unterstand this line of code:
mutate(ROW=rep(1:15,each=3))

What does it does?

FJCC · September 6, 2022, 7:17pm

That line of code puts in a row number so that the original rows can be reconstructed with pivot_wider(). It would have made more sense to put the row number in earlier, like this:

library(dplyr)
library(tidyr)
DF <- read.csv("~/R/Play/Dummy.csv")
DF2 <- read.csv("~/R/Play/Dummy2.csv")
DFlong <- DF |> mutate(ROW=row_number()) |> 
  pivot_longer(cols = -ROW)
DFlong <- left_join(DFlong,DF2, by = c(value="Name1"))
DFWide <- DFlong |> select(-value) |> 
  pivot_wider(names_from = "name",values_from = "Name2")

system · September 13, 2022, 7:18pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.