Convert Dummy Variables to Factors

carolineL · December 31, 2019, 10:47pm

I have instances where I need to convert dummy variables to factors. My first question is
A) Does anyone know of a function out there that does this? I have not been able to find one.
Given that said function does not exist, I have created one. I am using the German Credit data from UCI MLR included in the caret package for demonstration purposes. It has many dummy variables and so works nicely. The function works really well for small numbers of observations but for large data sets with lots of dummy variables (the German Credit data is 1000 obs with 41 dummy variables compromising 11 factor variables) it can be quite slow. It takes almost 3 mins to run on the German data, does anyone have any suggestion for ways to improve the speed, the bottleneck is in the last step when the nested variables are converted. I am planning on putting this in a package which is why I have not used library() or pipes.

data(GermanCredit, package = "caret")
tibble::glimpse(GermanCredit)
#> Rows: 1,000
#> Columns: 62
#> $ Duration                               <int> 6, 48, 12, 42, 24, 36, 24, 36,…
#> $ Amount                                 <int> 1169, 5951, 2096, 7882, 4870, …
#> $ InstallmentRatePercentage              <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, …
#> $ ResidenceDuration                      <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, …
#> $ Age                                    <int> 67, 22, 49, 45, 53, 35, 53, 35…
#> $ NumberExistingCredits                  <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, …
#> $ NumberPeopleMaintenance                <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
#> $ Telephone                              <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, …
#> $ ForeignWorker                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Class                                  <fct> Good, Bad, Good, Good, Bad, Go…
#> $ CheckingAccountStatus.lt.0             <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
#> $ CheckingAccountStatus.0.to.200         <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, …
#> $ CheckingAccountStatus.gt.200           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ CheckingAccountStatus.none             <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, …
#> $ CreditHistory.NoCredit.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ CreditHistory.ThisBank.AllPaid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ CreditHistory.PaidDuly                 <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, …
#> $ CreditHistory.Delay                    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ CreditHistory.Critical                 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
#> $ Purpose.NewCar                         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
#> $ Purpose.UsedCar                        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
#> $ Purpose.Furniture.Equipment            <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
#> $ Purpose.Radio.Television               <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
#> $ Purpose.DomesticAppliance              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Purpose.Repairs                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Purpose.Education                      <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, …
#> $ Purpose.Vacation                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Purpose.Retraining                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Purpose.Business                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Purpose.Other                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ SavingsAccountBonds.lt.100             <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, …
#> $ SavingsAccountBonds.100.to.500         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ SavingsAccountBonds.500.to.1000        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
#> $ SavingsAccountBonds.gt.1000            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
#> $ SavingsAccountBonds.Unknown            <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
#> $ EmploymentDuration.lt.1                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ EmploymentDuration.1.to.4              <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
#> $ EmploymentDuration.4.to.7              <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
#> $ EmploymentDuration.gt.7                <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
#> $ EmploymentDuration.Unemployed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
#> $ Personal.Male.Divorced.Seperated       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
#> $ Personal.Female.NotSingle              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Personal.Male.Single                   <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, …
#> $ Personal.Male.Married.Widowed          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
#> $ Personal.Female.Single                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ OtherDebtorsGuarantors.None            <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
#> $ OtherDebtorsGuarantors.CoApplicant     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ OtherDebtorsGuarantors.Guarantor       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ Property.RealEstate                    <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
#> $ Property.Insurance                     <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
#> $ Property.CarOther                      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
#> $ Property.Unknown                       <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
#> $ OtherInstallmentPlans.Bank             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ OtherInstallmentPlans.Stores           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ OtherInstallmentPlans.None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Housing.Rent                           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
#> $ Housing.Own                            <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, …
#> $ Housing.ForFree                        <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
#> $ Job.UnemployedUnskilled                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Job.UnskilledResident                  <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, …
#> $ Job.SkilledEmployee                    <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, …
#> $ Job.Management.SelfEmp.HighlyQualified <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
dummy_to_factor <- function(data, 
                              variables = everything(), 
                              sep = '.') {
  variables <- rlang::enquo(variables)
  # get the variables names for included variables  
  data_names <- names(dplyr::select(data, !!variables))
  
  # create a names list that can be used in nest with the group and
  # the variables that are in that group
  groups <- 
    dplyr::tibble(var_names = 
             data_names[dplyr::contains(sep, vars = data_names)])
  if(!all(dplyr::select(data, groups$var_names) == 0 |
        dplyr::select(data, groups$var_names) == 1)) {
        stop('All dummy values must be 0 or 1')
    }
  groups <- 
    dplyr::mutate(groups, 
                  group = stringr::str_remove(var_names, 
                              paste0("[", sep, "].*$"))
    )
  groups <- 
    dplyr::group_by(groups, group)
  groups <- 
    tidyr::nest(groups, grouped_cols = var_names)
  groups <- 
    dplyr::mutate(groups, grouped_cols = purrr::map(grouped_cols, c))
  groups <- 
    tidyr::unnest(groups, cols = grouped_cols)
  groups <- 
    tibble::deframe(groups)
   
  # function for determining which column has a 1 and retrieving that column 
  # name (and drop the group name)
   convert <- function(x){
     if(sum(x) > 1) return('multiple')
     if(sum(x) <= 0) return(NA_character_)
     x <- dplyr::rename_all(x, stringr::str_remove, 
                            paste0('^[^', sep, ']*[', sep, ']'))
     x <- tidyr::pivot_longer(x, cols = everything(), 
                       names_to = 'V1', 
                       values_to = 'V2')
     x <- dplyr::filter(x, V2 == 1) 
     return(x$V1)
   }
  
   # nest the dummy groups and convert them to factors
   data <- dplyr::group_by(data, id = dplyr::row_number())
   data <- 
    tidyr::nest(data, !!!groups) 
   data <- dplyr::mutate_at(data, names(groups), purrr::map_chr, convert)
   data <- dplyr::ungroup(data)
   data <- dplyr::select(data, -id)
}
new_dat <- dummy_to_factor(GermanCredit[1:10, ])
tibble::glimpse(new_dat)
#> Rows: 10
#> Columns: 21
#> $ Duration                  <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30
#> $ Amount                    <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6…
#> $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4
#> $ ResidenceDuration         <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2
#> $ Age                       <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28
#> $ NumberExistingCredits     <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2
#> $ NumberPeopleMaintenance   <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1
#> $ Telephone                 <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1
#> $ ForeignWorker             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ Class                     <fct> Good, Bad, Good, Good, Bad, Good, Good, Goo…
#> $ CheckingAccountStatus     <chr> "lt.0", "0.to.200", "none", "lt.0", "lt.0",…
#> $ CreditHistory             <chr> "Critical", "PaidDuly", "Critical", "PaidDu…
#> $ Purpose                   <chr> "Radio.Television", "Radio.Television", "Ed…
#> $ SavingsAccountBonds       <chr> "Unknown", "lt.100", "lt.100", "lt.100", "l…
#> $ EmploymentDuration        <chr> "gt.7", "1.to.4", "4.to.7", "4.to.7", "1.to…
#> $ Personal                  <chr> "Male.Single", "Female.NotSingle", "Male.Si…
#> $ OtherDebtorsGuarantors    <chr> "None", "None", "None", "Guarantor", "None"…
#> $ Property                  <chr> "RealEstate", "RealEstate", "RealEstate", "…
#> $ OtherInstallmentPlans     <chr> "None", "None", "None", "None", "None", "No…
#> $ Housing                   <chr> "Own", "Own", "Own", "ForFree", "ForFree", …
#> $ Job                       <chr> "SkilledEmployee", "SkilledEmployee", "Unsk…
system.time(dummy_to_factor(GermanCredit[1:10, ]))
#>    user  system elapsed 
#>   1.673   0.006   1.679
system.time(dummy_to_factor(GermanCredit[1:50, ]))
#>    user  system elapsed 
#>   8.148   0.033   8.201
system.time(dummy_to_factor(GermanCredit))
#>    user  system elapsed 
#> 159.749   0.715 160.927

^{Created on 2019-12-31 by the reprex package (v0.3.0)}

technocrat · January 1, 2020, 4:18am

I'd be inclined to simplify.

data(GermanCredit, package = "caret")
# variables can be easily obtained
variables <- colnames(GermanCredit)
variables
#>  [1] "Duration"                              
#>  [2] "Amount"                                
#>  [3] "InstallmentRatePercentage"             
#>  [4] "ResidenceDuration"                     
#>  [5] "Age"                                   
#>  [6] "NumberExistingCredits"                 
#>  [7] "NumberPeopleMaintenance"               
#>  [8] "Telephone"                             
#>  [9] "ForeignWorker"                         
#> [10] "Class"                                 
#> [11] "CheckingAccountStatus.lt.0"            
#> [12] "CheckingAccountStatus.0.to.200"        
#> [13] "CheckingAccountStatus.gt.200"          
#> [14] "CheckingAccountStatus.none"            
#> [15] "CreditHistory.NoCredit.AllPaid"        
#> [16] "CreditHistory.ThisBank.AllPaid"        
#> [17] "CreditHistory.PaidDuly"                
#> [18] "CreditHistory.Delay"                   
#> [19] "CreditHistory.Critical"                
#> [20] "Purpose.NewCar"                        
#> [21] "Purpose.UsedCar"                       
#> [22] "Purpose.Furniture.Equipment"           
#> [23] "Purpose.Radio.Television"              
#> [24] "Purpose.DomesticAppliance"             
#> [25] "Purpose.Repairs"                       
#> [26] "Purpose.Education"                     
#> [27] "Purpose.Vacation"                      
#> [28] "Purpose.Retraining"                    
#> [29] "Purpose.Business"                      
#> [30] "Purpose.Other"                         
#> [31] "SavingsAccountBonds.lt.100"            
#> [32] "SavingsAccountBonds.100.to.500"        
#> [33] "SavingsAccountBonds.500.to.1000"       
#> [34] "SavingsAccountBonds.gt.1000"           
#> [35] "SavingsAccountBonds.Unknown"           
#> [36] "EmploymentDuration.lt.1"               
#> [37] "EmploymentDuration.1.to.4"             
#> [38] "EmploymentDuration.4.to.7"             
#> [39] "EmploymentDuration.gt.7"               
#> [40] "EmploymentDuration.Unemployed"         
#> [41] "Personal.Male.Divorced.Seperated"      
#> [42] "Personal.Female.NotSingle"             
#> [43] "Personal.Male.Single"                  
#> [44] "Personal.Male.Married.Widowed"         
#> [45] "Personal.Female.Single"                
#> [46] "OtherDebtorsGuarantors.None"           
#> [47] "OtherDebtorsGuarantors.CoApplicant"    
#> [48] "OtherDebtorsGuarantors.Guarantor"      
#> [49] "Property.RealEstate"                   
#> [50] "Property.Insurance"                    
#> [51] "Property.CarOther"                     
#> [52] "Property.Unknown"                      
#> [53] "OtherInstallmentPlans.Bank"            
#> [54] "OtherInstallmentPlans.Stores"          
#> [55] "OtherInstallmentPlans.None"            
#> [56] "Housing.Rent"                          
#> [57] "Housing.Own"                           
#> [58] "Housing.ForFree"                       
#> [59] "Job.UnemployedUnskilled"               
#> [60] "Job.UnskilledResident"                 
#> [61] "Job.SkilledEmployee"                   
#> [62] "Job.Management.SelfEmp.HighlyQualified"
# the test for a dummy column is also simple
max(GermanCredit$Housing.Own) ==  1 & min(GermanCredit$Housing.Own) == 0
#> [1] TRUE

^{Created on 2019-12-31 by the reprex package (v0.3.0)}

So, cycle the functionalized test over the columns in variables to identify those needing to be piped to mutate(a_variable = as.factor(a_variable)

carolineL · January 2, 2020, 6:31pm

I'm not sure I'm understanding. Identifying dummy variables in and of themselves is not challenging. Identifying their groups so that they can be combined into one factor is where the challenge lies.
mutate_at(vars_that_are_dummies, as.factor) would give you a bunch of 0/1 factor variables which is not what I am trying to accomplish - the output should look like the new_dat in the original post. Can you post an example of your solution working?

technocrat · January 2, 2020, 7:49pm

I should have been clear that I was suggesting simpler, potentially more efficient, ways of dealing with column names and identifying the columns that were dummies.

For the second part,

data(GermanCredit, package = "caret")
library(stringr)
(split_vars <- str_split(colnames(GermanCredit),"[.]"))
#> [[1]]
#> [1] "Duration"
#> 
#> [[2]]
#> [1] "Amount"
#> 
#> [[3]]
#> [1] "InstallmentRatePercentage"
#> 
#> [[4]]
#> [1] "ResidenceDuration"
#> 
#> [[5]]
#> [1] "Age"
#> 
#> [[6]]
#> [1] "NumberExistingCredits"
#> 
#> [[7]]
#> [1] "NumberPeopleMaintenance"
#> 
#> [[8]]
#> [1] "Telephone"
#> 
#> [[9]]
#> [1] "ForeignWorker"
#> 
#> [[10]]
#> [1] "Class"
#> 
#> [[11]]
#> [1] "CheckingAccountStatus" "lt"                    "0"                    
#> 
#> [[12]]
#> [1] "CheckingAccountStatus" "0"                     "to"                   
#> [4] "200"                  
#> 
#> [[13]]
#> [1] "CheckingAccountStatus" "gt"                    "200"                  
#> 
#> [[14]]
#> [1] "CheckingAccountStatus" "none"                 
#> 
#> [[15]]
#> [1] "CreditHistory" "NoCredit"      "AllPaid"      
#> 
#> [[16]]
#> [1] "CreditHistory" "ThisBank"      "AllPaid"      
#> 
#> [[17]]
#> [1] "CreditHistory" "PaidDuly"     
#> 
#> [[18]]
#> [1] "CreditHistory" "Delay"        
#> 
#> [[19]]
#> [1] "CreditHistory" "Critical"     
#> 
#> [[20]]
#> [1] "Purpose" "NewCar" 
#> 
#> [[21]]
#> [1] "Purpose" "UsedCar"
#> 
#> [[22]]
#> [1] "Purpose"   "Furniture" "Equipment"
#> 
#> [[23]]
#> [1] "Purpose"    "Radio"      "Television"
#> 
#> [[24]]
#> [1] "Purpose"           "DomesticAppliance"
#> 
#> [[25]]
#> [1] "Purpose" "Repairs"
#> 
#> [[26]]
#> [1] "Purpose"   "Education"
#> 
#> [[27]]
#> [1] "Purpose"  "Vacation"
#> 
#> [[28]]
#> [1] "Purpose"    "Retraining"
#> 
#> [[29]]
#> [1] "Purpose"  "Business"
#> 
#> [[30]]
#> [1] "Purpose" "Other"  
#> 
#> [[31]]
#> [1] "SavingsAccountBonds" "lt"                  "100"                
#> 
#> [[32]]
#> [1] "SavingsAccountBonds" "100"                 "to"                 
#> [4] "500"                
#> 
#> [[33]]
#> [1] "SavingsAccountBonds" "500"                 "to"                 
#> [4] "1000"               
#> 
#> [[34]]
#> [1] "SavingsAccountBonds" "gt"                  "1000"               
#> 
#> [[35]]
#> [1] "SavingsAccountBonds" "Unknown"            
#> 
#> [[36]]
#> [1] "EmploymentDuration" "lt"                 "1"                 
#> 
#> [[37]]
#> [1] "EmploymentDuration" "1"                  "to"                
#> [4] "4"                 
#> 
#> [[38]]
#> [1] "EmploymentDuration" "4"                  "to"                
#> [4] "7"                 
#> 
#> [[39]]
#> [1] "EmploymentDuration" "gt"                 "7"                 
#> 
#> [[40]]
#> [1] "EmploymentDuration" "Unemployed"        
#> 
#> [[41]]
#> [1] "Personal"  "Male"      "Divorced"  "Seperated"
#> 
#> [[42]]
#> [1] "Personal"  "Female"    "NotSingle"
#> 
#> [[43]]
#> [1] "Personal" "Male"     "Single"  
#> 
#> [[44]]
#> [1] "Personal" "Male"     "Married"  "Widowed" 
#> 
#> [[45]]
#> [1] "Personal" "Female"   "Single"  
#> 
#> [[46]]
#> [1] "OtherDebtorsGuarantors" "None"                  
#> 
#> [[47]]
#> [1] "OtherDebtorsGuarantors" "CoApplicant"           
#> 
#> [[48]]
#> [1] "OtherDebtorsGuarantors" "Guarantor"             
#> 
#> [[49]]
#> [1] "Property"   "RealEstate"
#> 
#> [[50]]
#> [1] "Property"  "Insurance"
#> 
#> [[51]]
#> [1] "Property" "CarOther"
#> 
#> [[52]]
#> [1] "Property" "Unknown" 
#> 
#> [[53]]
#> [1] "OtherInstallmentPlans" "Bank"                 
#> 
#> [[54]]
#> [1] "OtherInstallmentPlans" "Stores"               
#> 
#> [[55]]
#> [1] "OtherInstallmentPlans" "None"                 
#> 
#> [[56]]
#> [1] "Housing" "Rent"   
#> 
#> [[57]]
#> [1] "Housing" "Own"    
#> 
#> [[58]]
#> [1] "Housing" "ForFree"
#> 
#> [[59]]
#> [1] "Job"                 "UnemployedUnskilled"
#> 
#> [[60]]
#> [1] "Job"               "UnskilledResident"
#> 
#> [[61]]
#> [1] "Job"             "SkilledEmployee"
#> 
#> [[62]]
#> [1] "Job"             "Management"      "SelfEmp"         "HighlyQualified"

^{Created on 2020-01-02 by the reprex package (v0.3.0)}

Gets you a list of lists of the name components. I'd have to think about classifying them based on the length() of each nested list and then rejoining to make a hash. I'll see with what I come up with. What the motivation behind my suggestions is to speed things by making as much as possible vectorized.

cderv · January 2, 2020, 8:59pm

@carolineL I did not rework closely to your function but I tried to reproduce the logic using more classic data manipulation.

Here is how I would have done it and it is pretty quick. I let you try, and maybe put this into the function you are looking for.

data(GermanCredit, package = "caret")

# convert to tibble for printing
tab <- tibble::as_tibble(GermanCredit)

# transform to long format the dummy columns
tab_long <- tidyr::pivot_longer(tab, 
                                cols = tidyselect::contains("."),
                                names_to = c("groups", "levels"),
                                names_pattern = "^([^.]*)[.](.*)")
# get the groups name for column selection after
groups <- unique(tab_long$groups)
# keep only non dummy value and do not keep temp value col
tab_filter <- dplyr::select(
  dplyr::filter(tab_long, value == 1),
  -value)
# tranform to wide format   
tab_wide <- tidyr::pivot_wider(
  tab_filter,
  names_from = groups, 
  values_from = levels)
# convert to factors the groups column
new_tab <- dplyr::mutate_at(
  tab_wide,
  groups,
  ~ forcats::as_factor(.)
)
dplyr::glimpse(new_tab)
#> Observations: 1,000
#> Variables: 21
#> $ Duration                  <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, …
#> $ Amount                    <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6…
#> $ InstallmentRatePercentage <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2…
#> $ ResidenceDuration         <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4…
#> $ Age                       <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25,…
#> $ NumberExistingCredits     <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1…
#> $ NumberPeopleMaintenance   <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Telephone                 <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1…
#> $ ForeignWorker             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Class                     <fct> Good, Bad, Good, Good, Bad, Good, Good, Goo…
#> $ CheckingAccountStatus     <fct> lt.0, 0.to.200, none, lt.0, lt.0, none, non…
#> $ CreditHistory             <fct> Critical, PaidDuly, Critical, PaidDuly, Del…
#> $ Purpose                   <fct> Radio.Television, Radio.Television, Educati…
#> $ SavingsAccountBonds       <fct> Unknown, lt.100, lt.100, lt.100, lt.100, Un…
#> $ EmploymentDuration        <fct> gt.7, 1.to.4, 4.to.7, 4.to.7, 1.to.4, 1.to.…
#> $ Personal                  <fct> Male.Single, Female.NotSingle, Male.Single,…
#> $ OtherDebtorsGuarantors    <fct> None, None, None, Guarantor, None, None, No…
#> $ Property                  <fct> RealEstate, RealEstate, RealEstate, Insuran…
#> $ OtherInstallmentPlans     <fct> None, None, None, None, None, None, None, N…
#> $ Housing                   <fct> Own, Own, Own, ForFree, ForFree, ForFree, O…
#> $ Job                       <fct> SkilledEmployee, SkilledEmployee, Unskilled…

^{Created on 2020-01-02 by the reprex package (v0.3.0.9001)}

Not sure you I understood very well your issue with converting to factor, but it seems to get the same result and it does not take 3 mins.

Hope it helps.

carolineL · January 7, 2020, 5:22pm

cderv:

data(GermanCredit, package = "caret") 
# convert to tibble for printing 
tab <- tibble::as_tibble(GermanCredit) 
# transform to long format the dummy columns 
tab_long <- tidyr::pivot_longer(tab, 
                                cols = tidyselect::contains("."), 
                                names_to = c("groups", "levels"), 
                                names_pattern = "^([^.]*)[.](.*)") 
# get the groups name for column selection after 
groups <- unique(tab_long$groups) 
# keep only non dummy value and do not keep temp value 
col tab_filter <- dplyr::select( 
                                dplyr::filter(tab_long, value == 1), 
                                -value) 
# tranform to wide format 
tab_wide <- tidyr::pivot_wider(
                              tab_filter, 
                              names_from = groups, 
                              values_from = levels) 
# convert to factors the groups column 
new_tab <- dplyr::mutate_at(tab_wide, groups, 
                             ~ forcats::as_factor(.) ) 
dplyr::glimpse(new_tab)

This is much faster - it's similar logic to what I had but I put the pivot longer and pivot wider into a purrr::map. I think I can work this in nicely thanks!

system · January 14, 2020, 5:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.