rvest inserting text in webform?

jfca283 · February 12, 2025, 6:39pm

Hello,
I need to download a table from a website.
The table I'm intereseted rvest shows me:

 `Income`              `Income`             
  <chr>                         <chr>                        
1 General                 610
2 Under 18              740
3 No info                  429
4 No info 2               410
5 select other day (mmaaaa): select other day (mmaaaa):

As you can see, the code should insert the date on the fifth row of the table and then make the query.
I mean, in order to get the data from january 2020, I must insert "012020" on the table.
And I can't do It. Chatgpt, Deepseek or Gemini fail to assign the date/text.

The code refereing the table is :
html_table(html_nodes(content, "table"), fill = TRUE)

In specific, It is :
html_table(html_nodes(content, "table"), fill = TRUE)[[4]]

What can I do? I fail every time I try to solve It.
Thanks for your time and interest.

margusl · February 13, 2025, 9:17am

As you are most likely dealing with some dynamic content, try replacing rvest::read_html() with rvest::read_html_live() .

For more detailed answers, you could perhaps share that page your are working with?

jfca283 · February 13, 2025, 6:41pm

https://ventanilla.dirtrab.cl/indicadores/webform1.aspx

margusl · February 14, 2025, 10:38am

We can use rvest from handling: html_form() + html_form_set() + html_form_submit()

Though it needs a little extra help. As that from uses image buttons for submission, i.e. <input type="image", name="btn_im", ...>, posted payload should also include coordinates of the click on that image, btn_im.x & btn_im.y, and apparently presence of those values is checked at server side.

As rvest by itself does not add those fields nor does it let us add new fields directly through html_form_set(), we need to alter form’s field list ourselves, that’s what form_add_xy() does.

Another helper, table_without_factor(), first removes table row(s) with form controls and then parses table.

library(rvest)

form_add_xy <- function(form, input_image_name = ""){
  xy <- 
    paste0(input_image_name, c(".x",".y")) |>  
    lapply(\(name) rvest:::rvest_field(type = "text", name = name, value = 10, attr = NA))
  form$fields <- c(form$fields, xy)
  form
}

table_without_factor <- function(x){
  html_elements(x, ".factor") |> xml2::xml_remove()
  html_table(x)
}

im_date <- "012020"

read_html("https://ventanilla.dirtrab.cl/indicadores/webform1.aspx") |> 
  html_form() |> 
  # html_form() alway returns a list, 
  # html_form_set() expects single element from that list  
  getElement(1) |> 
  html_form_set(txt_im = im_date) |> 
  form_add_xy("btn_im") |> 
  html_form_submit(submit = "btn_im") |> 
  read_html() |> 
  # pick table by table header text
  html_element(xpath = "//th[text() = 'INGRESO MÍNIMO']/ancestor::table") |> 
  table_without_factor()

Resulting frame:

#> # A tibble: 4 × 2
#>   `INGRESO MÍNIMO` `INGRESO MÍNIMO`
#>   <chr>            <chr>           
#> 1 General          $ 301000        
#> 2 Menor 18         $ 224704        
#> 3 No remunerado    $ 194164        
#> 4 Casa particular  $ 301000

jfca283 · February 14, 2025, 8:58pm

It worked!
Can you recommend me a book to learn rvest and scrapping in order to process websites similar to the site I provided? I only find rvest applied to simple queries as wikipedia and so on, but nothing working with input boxes (text/numbers/dates).
Thanks for your patience.

margusl · February 15, 2025, 11:57am

If you haven't gone through Web scraping chapter of R4DS yet, I'd start from there.
You could also check slides and examples of UseR 2024 rvest tutorial by Hadley Wickham - GitHub - hadley/web-scraping

And of course rvest documentation and Vignettes, it's relatively small package with not too many user-facing functions and methods, so it makes sense to read all of it. Make sure you understand examples and follow links to external resources, after that you should have no issues describing with your own words the differences between read_html & read_html_live or name at least one of read_html_live alternatives. BTW, any resource that still uses deprecated html_node(s) & co instead of current html_element(s) is probably somewhat outdated, though it may include some nice strategies and general approaches.

Some knowledge about underlying packages like httr, xml2 & chromote can also help. Though it's less about actual scraping framework or package you are using (rvest, chromote, httr2 / httr, even plain jsonlite at times; bs4 or scrapy Python libraries; something based on Selenium) and more about general web tech knowledge. You don't need to be a full stack web dev for scraping, but you should be fairly comfortable with dev.tools of your browser (and not just element inspector), know your way around different HTTP requests & headers/ cookies / HTML / CSS / JS / XPath, understand what's happening at server side & client side, preferably have some general knowledge about UI frameworks etc. I'm afraid without basics it's also quite difficult to work with LLM-assisted solutions.
Much of this can be gathered from developer.mozilla.org

Though process and steps for previous answer were something like this:

check request payload in browser's dev.tools
does it work with disabled javascript? (yes)
can I replicate with rvest session by setting just one form field? (no)
can I replicate with httr2, using payload identical to browser request but without setting cookies? (yes, but making that request with hardcoded view states does not look robust and is way too verbose)
without a session cookie, i.e. can use rvest without session
use httr verbose mode to check payload difference between rvest & httr2 requests (btn_im.x, btn_im.y missing)
I know from past experience that rvest does not allow adding new form fields, but it's easy to alter form object directly; having old code snippet saves few minutes
I know enough XPath to know I can identify elements by text content and find ancestors, checking cheat sheet for exact syntax to identify table by title
checking table structure in dev.tools, only form controls are kept in rows with factorclass
I know enough xml2 to remove factor rows before rather than after parsing the table

system · February 22, 2025, 11:57am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.