str_extract not working on data collected using rvest but does work on same strings when copy/pasted.

Hi all,

So I think this problem might be to do with how I'm using rvest rather than str_extract() but I'm not sure as this is the first time I've used either... I think maybe it's some sort of encoding problem but I have no idea.

So I'm pulling temperature values from wikipedia. The temperature column has both celsius and fahrenheit. I'm trying to extract the celsius component. I've written a regex expression that works as expected when applied using str_extract() to data which is not pulled from wikipedia. When I apply str_extract() to the data from wikipedia, negative values are lost.

Here's a reprex that I hope illustrates the problem:

library(rvest)
#> Loading required package: xml2
library(tidyverse)
#> Warning: package 'forcats' was built under R version 3.6.3

# url of temperatures
url <- "https://en.wikipedia.org/wiki/List_of_cities_by_average_temperature"

# Import and clean data
temps <-
  url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table') %>%
  map_dfr(html_table) %>%
  janitor::clean_names() %>%
  select(-ref, -year) %>%
  pivot_longer(
    -c(country, city),
    values_to = "temp",
    names_to = "month"
  )

# Select some rows so I have a mix of +ve and -ve values in temp
test <-
  temps %>%
  filter(country == "Afghanistan") %>%
  slice(1:4)

### Problem starts here ###
# Regex for pulling out temperature in C
temp_regex <- "^(|-)\\d{1,2}([.]\\d{1,2}|)"


# This doesn't work. Note the str_extract returns NA for rows 1 and 2
test %>%
  mutate(temp_extract = str_extract(temp, temp_regex))
#> # A tibble: 4 x 5
#>   country     city  month temp       temp_extract
#>   <chr>       <chr> <chr> <chr>      <chr>       
#> 1 Afghanistan Kabul jan   -2.3(27.9) <NA>        
#> 2 Afghanistan Kabul feb   -0.7(30.7) <NA>        
#> 3 Afghanistan Kabul mar   6.3(43.3)  6.3         
#> 4 Afghanistan Kabul apr   12.8(55.0) 12.8

# This does work. Compare row 1 and 2 with above.
#dput(test)
structure(list(country = c("Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan"), city = c("Kabul", "Kabul", "Kabul", "Kabul"), 
    month = c("jan", "feb", "mar", "apr"), temp = c("-2.3(27.9)", 
    "-0.7(30.7)", "6.3(43.3)", "12.8(55.0)")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -4L)) %>%
  mutate(temp_extract = str_extract(temp, temp_regex)) 
#> # A tibble: 4 x 5
#>   country     city  month temp       temp_extract
#>   <chr>       <chr> <chr> <chr>      <chr>       
#> 1 Afghanistan Kabul jan   -2.3(27.9) -2.3        
#> 2 Afghanistan Kabul feb   -0.7(30.7) -0.7        
#> 3 Afghanistan Kabul mar   6.3(43.3)  6.3         
#> 4 Afghanistan Kabul apr   12.8(55.0) 12.8


# Pull temperature values out from df
pulled_val <- test %>%
  pull(temp)

# This fails in same way as above
str_extract(pulled_val, temp_regex)
#> [1] NA     NA     "6.3"  "12.8"

# Copy/pasted pulled_val output
copy_pasted <- c("-2.3(27.9)", "-0.7(30.7)", "6.3(43.3)", "12.8(55.0)")

# This now works...
str_extract(copy_pasted, temp_regex)
#> [1] "-2.3" "-0.7" "6.3"  "12.8"

For some reason, the character for negation is different in the extracted temperature -- it's longer than a minus, but I can't tell how to to process it properly.

Thanks for the reprex. Would have been hard to debug without it.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(stringr)) 
# created by filtering temps from rvest, to get some negatives
njal <- structure(list(country = c("Iceland", "Iceland", "Iceland", "Iceland", 
"Iceland", "Iceland", "Iceland", "Iceland", "Iceland", "Iceland", 
"Iceland", "Iceland"), city = c("Reykjavík", "Reykjavík", "Reykjavík", 
"Reykjavík", "Reykjavík", "Reykjavík", "Reykjavík", "Reykjavík", 
"Reykjavík", "Reykjavík", "Reykjavík", "Reykjavík"), month = c("jan", 
"feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", 
"nov", "dec"), temp = c("-0.5(31.1)", "0.4(32.7)", "0.5(32.9)", 
"2.9(37.2)", "6.3(43.3)", "9.0(48.2)", "10.6(51.1)", "10.3(50.5)", 
"7.4(45.3)", "4.4(39.9)", "1.1(34.0)", "-0.2(31.6)")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -12L))

# simplify regex by discarding ºF, everything from ( to end of line
fahr <-  "[(].*$"
ditch_fahr <- function(x) {str_remove(x,fahr)}
ditch_fahr(njal$temp) %>% as.numeric()
#>  [1] -0.5  0.4  0.5  2.9  6.3  9.0 10.6 10.3  7.4  4.4  1.1 -0.2

Created on 2020-03-22 by the reprex package (v0.3.0)

I thought it was \U+2013, an en dash – or a \U+2014 em dash, —, too, but I could detect any.

I was thinking it might be something like this but then I was confused as to why it wasn't a problem when using dput() which seemed to interpret the dashes correctly. I also wasn't sure how to check for en/em-dashes; would you mind showing me how you checked for those?

fahr <- "[(].*$"

Thanks for that as well... It is also the first time I've used regex, so I was haphazardly making it up as I went along :slight_smile:

I used

en_dash <- "\U2013"
str_detect(OBJECT,en_dash)

The whole regex thing is like

With great power comes great headaches

I've learned over my 35-year migraine with it to take advantages of gimmes like loping off unneeded parts at the end!

Hi @mrblobby and @technocrat, I think I found breadcrumbs to the rabbit hole: It seems that the web data had a true minus (\u2212) rather than a hyphen-minus, which is what the hyphen key on my (most?) keyboards produces and which R recognizes as a minus:

library(rvest)
#> Loading required package: xml2
library(tidyverse)

# url of temperatures
url <- "https://en.wikipedia.org/wiki/List_of_cities_by_average_temperature"

# extract desired test sample 
temps <-
  url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table') %>%
  map_dfr(html_table) %>%
  janitor::clean_names() %>%
  select(-ref, -year) %>%
  pivot_longer(
    -c(country, city),
    values_to = "temp",
    names_to = "month"
  ) %>% filter(country == "Afghanistan") %>%
  slice(1:4) %>% 
  pull(temp)

# inspect temps
temps
#> [1] "−2.3(27.9)" "−0.7(30.7)" "6.3(43.3)"  "12.8(55.0)"

# Regexs for pulling out temperature in C, with hyphen-minus, true minus
temp_regex_hm <- "^(|-)\\d{1,2}([.]\\d{1,2}|)"
temp_regex_tm <- "^(|\u2212)\\d{1,2}([.]\\d{1,2}|)"

str_extract(temps, temp_regex_hm)
#> [1] NA     NA     "6.3"  "12.8"
str_extract(temps, temp_regex_tm)
#> [1] "−2.3" "−0.7" "6.3"  "12.8"

Created on 2020-03-23 by the reprex package (v0.3.0)

I thought it might be an encoding issue, but it turns out the negative temps are the 'good' ones!

Encoding(temps)
#> [1] "UTF-8"   "UTF-8"   "unknown" "unknown"

Created on 2020-03-23 by the reprex package (v0.3.0)
So maybe it would be a good idea to set the encoding early, but I'm not sure how.

A couple of rabbit-hole guides I found helpful: String Encoding and R by Kevin Ushey, and README documentation for the signs package.

1 Like

good find @dromano. I've spent the last couple hours trying to figure this out, but am no closer to understanding what's going on. I know we can pass encoding = "UTF-8" to read_html() but that doesn't seem to help. I've resorted to using your improved regex and then using str_replace(temps, "\u2212", "-") before converting temps to numeric and continuing with my analysis.

That doesn't feel optimal but is a solution nonetheless, so I'll mark your post as such. Thanks :slight_smile:

@dromano: Who knew uni was drawing such fine distinctions. Odd that wiki would go to the trouble.

You're welcome, @mrblobby. Although I did mention I initially thought it might be an encoding issue, I discovered it wasn't -- it's just that the webpage had used the rendering of minus that is preferred in mathematical typesetting, 'true minus', whereas R uses the keyboard variant, 'hyphen-minus'. True minus is 'UTF-8', so not a problem from an encoding standpoint.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.