turn all variables into numeric if compatible

abeavers · July 10, 2023, 6:31pm

I have a dataset that has nearly 200 variables. Some are numberic and others are character. Some character variables contain letters and/or special characters, while others are character variables that contain only numbers, and are compatible with being turned into a numeric class.

I am looking for an efficient way to turn all of my character variables into numeric if they are numeric-compatible, while leaving the class of all other variables unchanged.

FJCC · July 10, 2023, 7:29pm

This isn't elegant but it works. I count the number of NA values in each column, apply is.numeric() to all of the columns, count the number of NA values in the transformed columns, and apply as.numeric() to those columns where the number of NA values did not change.

DF <- data.frame(A = c("1.2","2.4"), B = c("1A3", "555"), c = 1:2)
summary(DF)
#>       A                  B                   c       
#>  Length:2           Length:2           Min.   :1.00  
#>  Class :character   Class :character   1st Qu.:1.25  
#>  Mode  :character   Mode  :character   Median :1.50  
#>                                        Mean   :1.50  
#>                                        3rd Qu.:1.75  
#>                                        Max.   :2.00
library(dplyr)
OrigNa <- apply(DF,2, function(x) sum(is.na(x)))

tmp <- mutate(DF, across(.cols = everything(), .fns = as.numeric))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(.cols = everything(), .fns = as.numeric)`.
#> Caused by warning:
#> ! NAs introduced by coercion
NewNa <- apply(tmp,2, function(x) sum(is.na(x)))
DF <- DF |> mutate(across(.cols = which(OrigNa == NewNa), .fns = as.numeric))
summary(DF)
#>        A            B                   c       
#>  Min.   :1.2   Length:2           Min.   :1.00  
#>  1st Qu.:1.5   Class :character   1st Qu.:1.25  
#>  Median :1.8   Mode  :character   Median :1.50  
#>  Mean   :1.8                      Mean   :1.50  
#>  3rd Qu.:2.1                      3rd Qu.:1.75  
#>  Max.   :2.4                      Max.   :2.00

^{Created on 2023-07-10 with reprex v2.0.2}

jrauser · July 11, 2023, 1:21am

This is the exact same idea as @FJCC, but a bit more readable, IMO. The map_df() call takes advantage of the fact that a data frame is just a list.

library(tidyverse)
DF <- data.frame(A = c("1.2","2.4"), B = c("1A3", "555"), c = 1:2)

safely_make_numeric <- function(col) {
  col_numeric <- suppressWarnings(as.numeric(col))
  if (sum(is.na(col) == sum(is.na(col_numeric)))) {
    return(col_numeric)
  } else {
    return(col)
  }
}

map_df(DF, ~safely_make_numeric(.))

Beware that e.g. "1 " with a trailing whitespace will be converted to numeric by this function. I'm not sure if that's desirable.

system · August 22, 2023, 1:21am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.