Unable to use {rvest} with {targets} in a Nix environment

I have a simple {targets} pipeline where I read the content of a webpage:

list(targets::tar_target(
     html_content, rvest::read_html("https://www.worldatlas.com/na/us/area-codes.html")
))

When I run the pipeline (i.e. targets::tar_make()) and check the content of the html_content target, this is what I get:

> tar_read(html_content)
$node
<pointer: (nil)>

$doc
<pointer: (nil)>

attr(,"class")
[1] "xml_document" "xml_node"    

Odd! However, I am able to read the webpage when I run the rvest::read_html(...) code in the console. I dug around a little bit and learned that the function does not do well in "saved environments" (source).

Now, speaking of saved environments, my entire Linux (Pop OS) configuration is managed by the reproducibility powerhouse Nix. It is not very well known in useRs' circles, but is becoming more and more known thanks to the efforts of Bruno Rodrigues and co. with their great rix package (an R package which creates Nix-based local reproducible environments for R projects). I am no longer using the {rix} package, but spent some time learning how Nix works and am using it directly. I suspect that this constitutes a "saved environment", which causes problems for the rvest::read_html() function.

Does anyone know how I can resolve this issue?
Also @wlandau apologies for tagging you.

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

> packageVersion("rvest")
[1] '1.0.4'
> packageVersion("targets")
[1] '1.7.1'

tar_make() saves the object to disk using saveRDS(), and then tar_read() reads the object into a different R session. Objects of type "xml_document" may not be designed to be loaded in different sessions than when they were created. In fact, when I run your pipeline and then call tar_read(), I see:

Error in `doc_type()`:
! external pointer is not valid

This is what I see using rvest without targets:

temp <- tempfile()
saveRDS(rvest::read_html("https://www.worldatlas.com/na/us/area-codes.html"), temp)
object <- readRDS(temp)
object
#> Error in doc_type():
#> ! external pointer is not valid

So I would recommend converting the data into a format that can be serialized. Maybe the target could return as.character(rvest::read_html()), then you could call rvest::read_html(tar_read(html_content)) to read it.

1 Like

Your prompt response (always) is much appreciated. Thank you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.