Scraping web page

jynusmac · August 16, 2023, 5:46pm

I'm total newcomer to this topic. I want to scrape this web: http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria.

I'm able to see, with SelectorGadget, where the info is:

//*[contains(concat( " ", @class, " " ), concat( " ", "ancho100", " " ))]
But I have no idea how to follow up and include this in a script. Any advice or ideas on where to start?

Destix · August 16, 2023, 6:48pm

It depends on the language you prefer to use.

If you like python, I'd recommend you to work through some Beautiful Soup tutorials.
If you prefer R, I'd recommend you to work through some rvest tutorials.

M_AcostaCH · August 17, 2023, 9:37pm

Hi @jynusmac, Im try to make with rvest but dont get the html. In this case you need use Rselenium. or maybe I dont select the correct node

But at the final of this table, this page show a bottom for download the data in .xlsx format if you want download.

library(rvest)
library(tidyverse)

url_data <- "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria"

url_data2 <- url_data %>%
  read_html() %>% 
  html_nodes(xpath='/html/body/div[1]/div[2]/div[3]/div/div/div/div/table') %>%  # xpath
  html_table() |> 
  data.frame() # show empty data frame
###
url_data2 <- url_data %>%
  read_html() %>% 
  html_nodes(css='body > div.container > div.contenido.conSombra > div.left_center > div > div > div > div > table') %>%  # css
  html_table() |> 
  data.frame() # show empty data frame

jynusmac · August 18, 2023, 8:21am

Thanks for the approach @M_AcostaCH. All all attempts with rvest or rselenium end up with an empty data frame. i have already seen the download button and with some Chrome extension like Instant Data Scraper it is easy to download the data, but it would be more interesting for me with R. I will keep testing.

jynusmac · August 19, 2023, 12:39pm

I try:

url <- "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria"
tr <- read_html(url)
tables <- tr %>% html_table(fill= TRUE)

and also

table <- tr %>% 
        rvest::html_nodes("table") %>% 
        html_table(fill = T)

but show also empty frame.

Destix · August 20, 2023, 8:10pm

First, load the page and take a look at what you've got:

library(rvest)

url <- "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria"
html <- read_html(url)

html %>% html_text()
html %>% html_nodes("*") %>% html_attr("class") %>% unique()

If you run this, you'll see that the data is not returned corretly. This can be caused by some bot-defense. In this case, however, there is something wrong entirely.

Starting a web session gives

session("http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria")
<session> http://saih.chminosil.es/index.php?url=/datos/mapas/mapa:H1/area:HID/acc:
  Status: 200
  Type:   text/html
  Size:   45929

Your link "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria" actually sends the session to "http://saih.chminosil.es/index.php?url=/datos/mapas/mapa:H1/area:HID/acc:". Thus, you first need to navigate to the resumenPluviometria page and then scrape it.

homepage <- session("http://saih.chminosil.es/index.php?url=/datos/mapas/mapa:H1/area:HID/acc:")
resumenPluviometria <- homepage %>% session_follow_link(xpath = "/html/body/div/div[2]/div[1]/div/div[2]/ul/li[1]/div/div[2]/div/div[1]/ul/li[3]/a")
# Navigating to index.php?url=/datos/resumenPluviometria

html <- resumenPluviometria %>% read_html()
table <- html %>% html_table()
print(table)

# A tibble: 279 x 9
   Provincia Estación                    Señal `Umbral Alerta (mm)` Umbral Prealerta (mm~1
   <chr>     <chr>                       <chr> <chr>                <chr>                 
 1 León      E003 - LAS ROZAS            Prec~ 60,00                30,00                 
 2 León      E003 - LAS ROZAS            Prec~ 120,00               80,00                 
 3 León      E003 - LAS ROZAS            Prec~ -                    -                     
 4 León      E005 - MATALAVILLA          Prec~ 60,00                30,00                 
 5 León      E005 - MATALAVILLA          Prec~ 120,00               80,00                 
 6 León      E005 - MATALAVILLA          Prec~ -                    -                     
 7 León      P008 - COLINAS DEL CAMPO    Prec~ 60,00                30,00                 
 8 León      P008 - COLINAS DEL CAMPO    Prec~ 120,00               80,00                 
 9 León      P008 - COLINAS DEL CAMPO    Prec~ -                    -                     
10 León      N036 - RIO TREMOR EN ALMAG~ Prec~ 60,00                30,00                 
# i 269 more rows
# i abbreviated name: 1: `Umbral Prealerta (mm)`
# i 4 more variables: `Umbral Activación (mm)` <chr>, `Valor actual (mm)` <chr>,
#   Fecha <chr>, Tendencia <lgl>
# i Use `print(n = ...)` to see more rows

jynusmac · August 21, 2023, 3:57pm

Thank you @Destix , I have learned a bunch of new tricks to use in the future.

M_AcostaCH · August 22, 2023, 6:46pm

Very interesting way for make this:

Im use this xpath and run well:

xpath = "//*[@id='maximenuck']/div[2]/ul/li[1]/div/div[2]/div/div[1]/ul/li[3]/a"

Using session is a form for avoid do with Rselenium?

Destix · August 22, 2023, 8:22pm

I do not write the xpath myself, I just copy-paste it. Just go into the element viewer, right-click the element, copy, and copy as xpath. This is what I got from that.

With regards to what session does, I'm afraid I can't give you much advice. I usually web scrape with Python so Rvest is novel to me. At first sight, It looks like a thinner version of selenium, but for most use cases, it provides a sufficient set of functions. You can take a look at the session functions here.

system · August 29, 2023, 8:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.