Scraping web page

I'm total newcomer to this topic. I want to scrape this web: http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria.

I'm able to see, with SelectorGadget, where the info is:

//*[contains(concat( " ", @class, " " ), concat( " ", "ancho100", " " ))]
But I have no idea how to follow up and include this in a script. Any advice or ideas on where to start?

It depends on the language you prefer to use.

If you like python, I'd recommend you to work through some Beautiful Soup tutorials.
If you prefer R, I'd recommend you to work through some rvest tutorials.

Hi @jynusmac, Im try to make with rvest but dont get the html. In this case you need use Rselenium. or maybe I dont select the correct node :person_shrugging:

But at the final of this table, this page show a bottom for download the data in .xlsx format if you want download.

library(rvest)
library(tidyverse)

url_data <- "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria"

url_data2 <- url_data %>%
  read_html() %>% 
  html_nodes(xpath='/html/body/div[1]/div[2]/div[3]/div/div/div/div/table') %>%  # xpath
  html_table() |> 
  data.frame() # show empty data frame
###
url_data2 <- url_data %>%
  read_html() %>% 
  html_nodes(css='body > div.container > div.contenido.conSombra > div.left_center > div > div > div > div > table') %>%  # css
  html_table() |> 
  data.frame() # show empty data frame


Thanks for the approach @M_AcostaCH. All all attempts with rvest or rselenium end up with an empty data frame. i have already seen the download button and with some Chrome extension like Instant Data Scraper it is easy to download the data, but it would be more interesting for me with R. I will keep testing.

I try:

url <- "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria"
tr <- read_html(url)
tables <- tr %>% html_table(fill= TRUE)

and also

table <- tr %>% 
        rvest::html_nodes("table") %>% 
        html_table(fill = T)

but show also empty frame.

First, load the page and take a look at what you've got:

library(rvest)

url <- "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria"
html <- read_html(url)

html %>% html_text()
html %>% html_nodes("*") %>% html_attr("class") %>% unique()

If you run this, you'll see that the data is not returned corretly. This can be caused by some bot-defense. In this case, however, there is something wrong entirely.

Starting a web session gives

session("http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria")
<session> http://saih.chminosil.es/index.php?url=/datos/mapas/mapa:H1/area:HID/acc:
  Status: 200
  Type:   text/html
  Size:   45929

Your link "http://saih.chminosil.es/index.php?url=/datos/resumenPluviometria" actually sends the session to "http://saih.chminosil.es/index.php?url=/datos/mapas/mapa:H1/area:HID/acc:". Thus, you first need to navigate to the resumenPluviometria page and then scrape it.

homepage <- session("http://saih.chminosil.es/index.php?url=/datos/mapas/mapa:H1/area:HID/acc:")
resumenPluviometria <- homepage %>% session_follow_link(xpath = "/html/body/div/div[2]/div[1]/div/div[2]/ul/li[1]/div/div[2]/div/div[1]/ul/li[3]/a")
# Navigating to index.php?url=/datos/resumenPluviometria

html <- resumenPluviometria %>% read_html()
table <- html %>% html_table()
print(table)
# A tibble: 279 x 9
   Provincia EstaciĆ³n                    SeƱal `Umbral Alerta (mm)` Umbral Prealerta (mm~1
   <chr>     <chr>                       <chr> <chr>                <chr>                 
 1 LeĆ³n      E003 - LAS ROZAS            Prec~ 60,00                30,00                 
 2 LeĆ³n      E003 - LAS ROZAS            Prec~ 120,00               80,00                 
 3 LeĆ³n      E003 - LAS ROZAS            Prec~ -                    -                     
 4 LeĆ³n      E005 - MATALAVILLA          Prec~ 60,00                30,00                 
 5 LeĆ³n      E005 - MATALAVILLA          Prec~ 120,00               80,00                 
 6 LeĆ³n      E005 - MATALAVILLA          Prec~ -                    -                     
 7 LeĆ³n      P008 - COLINAS DEL CAMPO    Prec~ 60,00                30,00                 
 8 LeĆ³n      P008 - COLINAS DEL CAMPO    Prec~ 120,00               80,00                 
 9 LeĆ³n      P008 - COLINAS DEL CAMPO    Prec~ -                    -                     
10 LeĆ³n      N036 - RIO TREMOR EN ALMAG~ Prec~ 60,00                30,00                 
# i 269 more rows
# i abbreviated name: 1: `Umbral Prealerta (mm)`
# i 4 more variables: `Umbral ActivaciĆ³n (mm)` <chr>, `Valor actual (mm)` <chr>,
#   Fecha <chr>, Tendencia <lgl>
# i Use `print(n = ...)` to see more rows
1 Like

Thank you @Destix , I have learned a bunch of new tricks to use in the future.

Very interesting way for make this:

Im use this xpath and run well:

xpath = "//*[@id='maximenuck']/div[2]/ul/li[1]/div/div[2]/div/div[1]/ul/li[3]/a" 

Using session is a form for avoid do with Rselenium?

I do not write the xpath myself, I just copy-paste it. Just go into the element viewer, right-click the element, copy, and copy as xpath. This is what I got from that.

With regards to what session does, I'm afraid I can't give you much advice. I usually web scrape with Python so Rvest is novel to me. At first sight, It looks like a thinner version of selenium, but for most use cases, it provides a sufficient set of functions. You can take a look at the session functions here.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.