Converting Characters to Numeric

jrmorse · March 6, 2020, 6:37pm

Hi!
I am trying to scrape a table from the web, and ultimately convert a column from a character string to numeric.

My complete set up:

library(readxl)
library(janitor)
library(tidyverse)
library(gt)
library(rvest)
library(reprex)

Scraping the table didn't provide any errors, but here's the code I used:

table_costs_messy <- list()

table_costs_messy <- read_html("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426671/")%>% 
  html_node("table") %>% 
  html_table(header = TRUE) %>% 
  clean_names() %>% 
  slice(-1) %>% 
  slice(1:17)

Where the problem starts:
I've scraped a table and cleaned it, and have tried to remove special characters from two columns so that I will be able to convert those values to integers later. This part of the code does not throw error messages, but I think this may be where my problem starts.

table_costs_messy[table_costs_messy == "not applicable"] <-NA
colnames(table_costs_messy)[2:3] <- c("xF", "xM")
gsub(x = table_costs_messy, pattern = "\\$|\\*", "")

Now that I've removed the unwanted characters, I'd hope to be able to make a
new table.This is where I get an error that I have not properly removed special characters.

table_costs_clean <- table_costs_messy %>%
  pivot_longer(cols = starts_with("x"),
               names_to = "Sex",
               names_prefix = "x",
               values_to = "Cost",
               values_ptypes = list(Cost = integer()),
               values_drop_na = FALSE
               )

table_costs_clean
#> Error: Lossy cast from <character> to <integer>.
* Locations: 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

Thanks very much for any help or advice you might have!

^{Created on 2020-03-06 by the reprex package (v0.3.0)}

mfherman · March 6, 2020, 6:50pm

Looks like you're pretty close! My approach would be to explicitly do the character to numeric conversion after the pivot_longer() step. My favorite function for this is readr::parse_number() which does a lot of the parsing work for you!

The only rows it is unable to parse are the "not applicable" rows, and it returns NA helpfully.

library(tidyverse)
library(rvest)
library(janitor)

table_costs_messy <- read_html("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426671/")%>% 
  html_node("table") %>% 
  html_table(header = TRUE) %>% 
  clean_names() %>% 
  slice(-1) %>% 
  slice(1:17)

table_costs_messy %>% 
  rename(xF = value_applied, xM = x) %>% 
  pivot_longer(
    cols = starts_with("x"),
    names_to = "Sex",
    names_prefix = "x",
    values_to = "Cost",
    values_drop_na = FALSE
  ) %>% 
  mutate(Cost = parse_number(Cost))
#> Warning: 2 parsing failures.
#> row col expected         actual
#>   2  -- a number not applicable
#>   3  -- a number not applicable
#> # A tibble: 34 x 4
#>    parameter                                       range_applied Sex    Cost
#>    <chr>                                           <chr>         <chr> <dbl>
#>  1 Average cost per case of PID [23–25]            ± 50%         F      1995
#>  2 Average cost per case of PID [23–25]            ± 50%         M        NA
#>  3 Average cost per case of epididymitis [26]      ± 50%         F        NA
#>  4 Average cost per case of epididymitis [26]      ± 50%         M       274
#>  5 Average sequelae costs per case of syphilis [5] ± 50%         F       572
#>  6 Average sequelae costs per case of syphilis [5] ± 50%         M       572
#>  7 Average cost per case of chlamydia [5]          ± 50%         F       315
#>  8 Average cost per case of chlamydia [5]          ± 50%         M        26
#>  9 Average cost per case of gonorrhea [5]          ± 50%         F       343
#> 10 Average cost per case of gonorrhea [5]          ± 50%         M        68
#> # ... with 24 more rows

^{Created on 2020-03-06 by the reprex package (v0.3.0)}

Nate · March 6, 2020, 7:02pm

Not a full answer, but your original gsub() call doesn't look like it was doing what you wanted it to (and wasn't being saved). Since gsub() only works on vectors it coerces the whole data frame to a vector. Instead you could use something like this to perform the gsub() on each column and keep the data frame.

table_costs_messy <- table_costs_messy %>%
  mutate_all(~ gsub(x = ., pattern = "\\$|\\*", ""))

dromano · March 6, 2020, 7:15pm

Thanks for pointing out parse_number()! I should take a better look at the readr pakcage to see what other useful functions it has

jrmorse · March 6, 2020, 8:05pm

Thanks very much to both of you!

system · March 13, 2020, 8:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.