scraping messages in forum using rvest

I'm trying to scrap all messages in this board

https://www.healthboards.com/boards/aspergers-syndrome/

For each post I'm trying to get the date, author, number of views and also the actual post and the corresponding replies for all the Asperger board.

I could find the title of each post, the author and views:

library(rvest)
library(dplyr)
library(stringr)

    url<-"https://www.healthboards.com/boards/aspergers-syndrome/"
    h <- read_html(url)

    threads <- h %>%
      html_nodes("#threadslist .alt1 a") %>%
      html_text()

    threads

    authors <-  h %>%
      html_nodes("#threadslist .alt1 .smallfont") %>%
      html_text()
    authors <- gsub('\\s+',' ',authors)

    authors

    views <-  h %>%
      html_nodes(".alt2:nth-child(6)") %>%
      html_text()

    views

To get the messages I'm using this

url<-"https://www.healthboards.com/boards/aspergers-syndrome/index2.html"

messages <- read_html(url) 
messages

threads<- cbind(messages %>% html_nodes("iframe , #threadslist .alt1 a") %>% html_text() )
threads

But I cannot get the body of the messages

:partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face: Welcome to the RStudio Community forum @anacho :partying_face::partying_face::partying_face::partying_face::partying_face::partying_face:

The following code scrapes all the data you need. It is obvious you have some experience in web scraping so I will not spend too much time explaining what the code does at the moment, but feel free to ask me questions and I will be more than glad to provide you with more details. :slight_smile: However, regarding the messages in the threads, what I did is to scrape the links to each thread and use them to access the thread pages and further scrape the messages. The final results of this script is a tidy data frame with one list-column (since that are several messages in each thread.

library(rvest)
library(dplyr)
library(stringr)
library(purrr)

# Scrape thread titles, thread links, authors and number of views

url <- "https://www.healthboards.com/boards/aspergers-syndrome/"
h <- read_html(url)

threads <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_text()

thread_links <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_attr(name = "href")

authors <- h %>%
  html_nodes("#threadslist .alt1 .smallfont") %>%
  html_text() %>%
  str_replace_all(pattern = "\t|\r|\n", replacement = "")

views <- h %>%
  html_nodes(".alt2:nth-child(6)") %>%
  html_text() %>%
  str_replace_all(pattern = ",", replacement = "") %>%
  as.numeric()


# Custom function to scrape messages in each thread

scrape_messages <- function(link){
  read_html(link) %>%
    html_nodes(css = ".smallfont~ hr+ div") %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}

# Create master dataset (and scrape messages in each thread in process)

master_data <- 
  tibble(threads, authors, views, thread_links) %>%
  mutate(messages = map(thread_links, scrape_messages)) %>%
  select(threads:views, messages, thread_links)

head(master_data)

  threads                            authors      views messages thread_links                                                            
  <chr>                              <chr>        <dbl> <list>   <chr>                                                                   
1 ADHD And Aspergers                 MyNameIsCra~  4973 <chr [3~ https://www.healthboards.com/boards/aspergers-syndrome/1035173-adhd-asp~
2 Adult Pants Pooping and Asperger'~ poopypants21  1680 <chr [4~ https://www.healthboards.com/boards/aspergers-syndrome/1037809-adult-pa~
3 I did NOT spoil him!               mery          5939 <chr [7~ https://www.healthboards.com/boards/aspergers-syndrome/921652-i-did-not~
4 ASD Assessment as an adult, how?   Dragonfly W~  1243 <chr [2~ https://www.healthboards.com/boards/aspergers-syndrome/1032212-asd-asse~
5 Sex and the single woman with AS   Madeofglass   7040 <chr [4~ https://www.healthboards.com/boards/aspergers-syndrome/973625-sex-singl~
6 I have aspergers and very severe ~ joe398        1445 <chr [2~ https://www.healthboards.com/boards/aspergers-syndrome/1029904-i-have-a~

Hope this helps.

1 Like

Thanks a lot!!!!! You are a genious!!!!!!

Just a last thing that I don't know if I should ask in another post:
I'm trying to save the messages (with their author, title, views) to a csv.
I think that first I must un-nest the messages

master_data <- 
  tibble(threads, authors, views, thread_links) %>%
  mutate(messages = map(thread_links, scrape_messages)) %>%
  select(threads:views, messages, thread_links)%>%
  unnest()
  write.csv("C:/asperger/resul.csv", na="")

But this is not working..

Hey @anacho,

Next time you ask a question, do not forget to tag me on the post so I get a notification. The only issue with your code is that the first argument of the write.csv() function should be master_data.

write.csv(master_data, "C:/asperger/resul.csv", na = "")
1 Like

Thanks so much @gueyenono !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

You're very welcome @anacho. I would also suggest that you mark the first response as the solution because it directly addresses your original request.

Happy coding :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.