Converting data set with factor and integer variables to numeric

I have a data set (below) that contains several factor and integer variables. I want to calculate the descriptive statistics (mean, median, standard deviation, range and variance) across all columns. How do I convert the data set so that it makes it easier to compute the descriptive statistics? And are the codes that I can use to calculate mean, median, standard deviation, range and variance?

Observations: 285
Variables: 10
no.recurrence.events <fct> no-recurrence-events, no-recurrence-events, no-recu... X30.39 40-49, 40-49, 60-69, 40-49, 60-69, 50-59, 60-69, 40...
premeno <fct> premeno, premeno, ge40, premeno, ge40, premeno, ge4... X30.34 20-24, 20-24, 15-19, 0-4, 15-19, 25-29, 20-24, 50-5...
X0.2 <fct> 0-2, 0-2, 0-2, 0-2, 0-2, 0-2, 0-2, 0-2, 0-2, 0-2, 0... no no, no, no, no, no, no, no, no, no, no, no, no, no,...
X3 <int> 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 2, 1, 3, 3, 1, 2, 3, ... left right, left, right, right, left, left, left, left, ...
left_low <fct> right_up, left_low, left_up, right_low, left_low, l... no.1 no, no, no, no, no, no, no, no, no, no, no, no, no,...

There are several ways to take a quick look at summary statistics for a whole dataset. Many collected in a nice blog post, below

In base R there's `summary()` (note: I'm just attaching the ggplot2 library for the diamonds dataset, since it has several factor variables). I'm also a big fan of skimr, as it separates out variables by type β see its documentation for more details.

``````library(ggplot2)
diamonds
#> # A tibble: 53,940 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # β¦ with 53,930 more rows
summary(diamonds)
#>      carat               cut        color        clarity
#>  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065
#>  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258
#>  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194
#>  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171
#>  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066
#>  Max.   :5.0100                     I: 5422   VVS1   : 3655
#>                                     J: 2808   (Other): 2531
#>      depth           table           price             x
#>  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000
#>  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710
#>  Median :61.80   Median :57.00   Median : 2401   Median : 5.700
#>  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731
#>  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540
#>  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740
#>
#>        y                z
#>  Min.   : 0.000   Min.   : 0.000
#>  1st Qu.: 4.720   1st Qu.: 2.910
#>  Median : 5.710   Median : 3.530
#>  Mean   : 5.735   Mean   : 3.539
#>  3rd Qu.: 6.540   3rd Qu.: 4.040
#>  Max.   :58.900   Max.   :31.800
#>

skimr::skim(diamonds)
#> Skim summary statistics
#>  n obs: 53940
#>  n variables: 10
#>
#> ββ Variable type:factor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>  variable missing complete     n n_unique
#>   clarity       0    53940 53940        8
#>     color       0    53940 53940        7
#>       cut       0    53940 53940        5
#>                                     top_counts ordered
#>   SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171    TRUE
#>            G: 11292, E: 9797, F: 9542, H: 8304    TRUE
#>  Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906    TRUE
#>
#> ββ Variable type:integer βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>  variable missing complete     n   mean      sd  p0 p25  p50     p75  p100
#>     price       0    53940 53940 3932.8 3989.44 326 950 2401 5324.25 18823
#>      hist
#>  ββββββββ
#>
#> ββ Variable type:numeric βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>  variable missing complete     n  mean   sd   p0   p25   p50   p75  p100
#>     carat       0    53940 53940  0.8  0.47  0.2  0.4   0.7   1.04  5.01
#>     depth       0    53940 53940 61.75 1.43 43   61    61.8  62.5  79
#>     table       0    53940 53940 57.46 2.23 43   56    57    59    95
#>         x       0    53940 53940  5.73 1.12  0    4.71  5.7   6.54 10.74
#>         y       0    53940 53940  5.73 1.14  0    4.72  5.71  6.54 58.9
#>         z       0    53940 53940  3.54 0.71  0    2.91  3.53  4.04 31.8
#>      hist
#>  ββββββββ
#>  ββββββββ
#>  ββββββββ
#>  ββββββββ
#>  ββββββββ
#>  ββββββββ
``````

Created on 2019-03-21 by the reprex package (v0.2.1)

3 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.