[HELP] Changing characters to numbers in dataframes to perform calculations with them

Ayerbe · August 27, 2023, 4:17pm

Hello,
I am having issues with the class and mode of the variables in a dataframe (df_a):

glimpse(df_a)

Rows: 116
Columns: 9

`Provincias y Comunidades Autonomas` <chr> "A Coruña", "Lugo", "Ourense", "Pontevedra", "GALICIA", "P. DE ASTURIAS", … Superficie Secano (ha) "1.897", "3.271", "9.413", "326", "14.907", "50", "462", "26.040", "26.040…
`Superficie Regadio (ha)` <chr> "–", "–", "–", "–", "–", "–", "–", "–", "–", "13.374", "5.309", "16.828", … Superficie Total (ha) "1.897", "3.271", "9.413", "326", "14.907", "50", "462", "26.040", "26.040…
`Produccion de grano (T)` <chr> "6.317", "8.537", "29.086", "1.187", "45.127", "50", "1.132", "154.938", "… Paja cosechada (T) "3.980", "5.849", "16.780", "810", "27.419", "100", "2.060", "113.700", "1…
Año <chr> "2010", "2010", "2010", "2010", "2010", "2010", "2010", "2010", "2010", "2… Rendimiento (kg/ha) secano regadio_1 "3.330", "2.610", "3.090", "3.640", "3.027", "1.000", "2.450", "5.950", "5…
$ Rendimiento (kg/ha) secano regadio_2 "–", "–", "–", "–", "–", "–", "–", "–", "–", "5.359", "5.000", "4.680", "4…

After converting the df to numeric values:

df_a2<- as.data.frame(lapply(df_a, as.numeric))

Also when I do this I loose the first column, that are the names

and checking that values are converted to numbers:

sapply(df_a2,mode)

Provincias.y.Comunidades.Autonomas Superficie.Secano..ha. Superficie.Regadio..ha.
"numeric" "numeric" "numeric"
Superficie.Total..ha. Produccion.de.grano..T. Paja.cosechada..T.
"numeric" "numeric" "numeric"
Año Rendimiento..kg.ha..secano.regadio_1 Rendimiento..kg.ha..secano.regadio_2
"numeric" "numeric" "numeric"

sapply(df_a2, class)

Provincias.y.Comunidades.Autonomas Superficie.Secano..ha. Superficie.Regadio..ha.
"numeric" "numeric" "numeric"
Superficie.Total..ha. Produccion.de.grano..T. Paja.cosechada..T.
"numeric" "numeric" "numeric"
Año Rendimiento..kg.ha..secano.regadio_1 Rendimiento..kg.ha..secano.regadio_2
"numeric" "numeric" "numeric"

When I try to perform any calculation (in this example trying to have the mean of "Superficie.Secano..ha." I get the next error.

df_a2 %>% drop_na() %>% summarize(mean_bl = mean("Superficie.Secano..ha."))

mean_bl
1 NA

Warning message:
There was 1 warning in summarize().
In argument: mean_bl = mean("Superficie.Secano..ha.").
Caused by warning in mean.default():
! argument is not numeric or logical: returning NA

Does anyone know:
1: why the program is not able to work with the fields once converted to numbers? Is there any issue?
2: Is there a possibility to change the class of all the columns in a dataframe from character to number except one?

thank you in advance

HanOostdijk · August 27, 2023, 7:22pm

See the code:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df_a <-  data.frame(
   ProvinciasyComunidadesAutonomas= c("ACoruña","Lugo","Ourense","Pontevedra") ,
   `Superficie Secano (ha)` = c( "1.897", "3.271", "9.413", "326"),
   `SuperficieRegadio(ha)` =  c("–","13.374","5.309","16.0")
  )

print(df_a)
#>   ProvinciasyComunidadesAutonomas Superficie.Secano..ha. SuperficieRegadio.ha.
#> 1                         ACoruña                  1.897                     –
#> 2                            Lugo                  3.271                13.374
#> 3                         Ourense                  9.413                 5.309
#> 4                      Pontevedra                    326                  16.0

# convert to numeric all columns except one
df_b <- df_a |>
  dplyr::mutate(dplyr::across(!ProvinciasyComunidadesAutonomas, as.numeric))
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `dplyr::across(!ProvinciasyComunidadesAutonomas, as.numeric)`.
#> Caused by warning:
#> ! NAs introduced by coercion
print(df_b)
#>   ProvinciasyComunidadesAutonomas Superficie.Secano..ha. SuperficieRegadio.ha.
#> 1                         ACoruña                  1.897                    NA
#> 2                            Lugo                  3.271                13.374
#> 3                         Ourense                  9.413                 5.309
#> 4                      Pontevedra                326.000                16.000

df_b |> tidyr:: drop_na() # this drops one observation
#>   ProvinciasyComunidadesAutonomas Superficie.Secano..ha. SuperficieRegadio.ha.
#> 1                            Lugo                  3.271                13.374
#> 2                         Ourense                  9.413                 5.309
#> 3                      Pontevedra                326.000                16.000

df_b |> tidyr:: drop_na() |> summarize(mean_bl = mean("Superficie.Secano..ha.")) # not okay: mean of character value
#> Warning: There was 1 warning in `summarize()`.
#> ℹ In argument: `mean_bl = mean("Superficie.Secano..ha.")`.
#> Caused by warning in `mean.default()`:
#> ! argument is not numeric or logical: returning NA
#>   mean_bl
#> 1      NA

df_b |> tidyr:: drop_na() |> summarize(mean_bl = mean(Superficie.Secano..ha.)) # okay : mean of numeric column
#>    mean_bl
#> 1 112.8947

df_b |> summarize(mean_bl = mean(Superficie.Secano..ha.)) # observation with missing `SuperficieRegadio(ha)` included
#>    mean_bl
#> 1 85.14525

df_b |> summarize(mean_bl = mean(Superficie.Secano..ha.,na.rm = TRUE)) # in case some values are NA
#>    mean_bl
#> 1 85.14525
Created on 2023-08-27 with reprex v2.0.2

technocrat · August 28, 2023, 9:13am

# the analyst will be familiar with the data to be able
# to use abbreviated variable names, which is more
# convenient and less error prone
# descriptive names can be re-introduced for presentation
# tables for readers needing more explanatory labels

# data frame with second and third columns to be converted to numeric
df_a <-  data.frame(
  prov = c("ACoruña","Lugo","Ourense","Pontevedra") ,
  sec = c( "1.897", "3.271", "9.413", "326"),
  reg =  c("–","13.374","5.309","16.0")
)

# replace the second and third column by their
# numeric reprentations; purpose of converting
# to a matrix is to avoid the internal representation
# of arrays as lists in data frames; whenever dealing
# with blocks of all numeric data, a matrix provides
# a much more tractable data format

df_a[,c(2,3)] <- as.numeric(as.matrix(df_a[,c(2,3)]))
#> Warning: NAs introduced by coercion
df_a
#>         prov     sec    reg
#> 1    ACoruña   1.897     NA
#> 2       Lugo   3.271 13.374
#> 3    Ourense   9.413  5.309
#> 4 Pontevedra 326.000 16.000

^{Created on 2023-08-28 with reprex v2.0.2}

The warning message is due to the use of a hyphen in the reg column's first row. It is converted to NA because there is no numeric equivalent.

system · September 4, 2023, 9:14am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.