The following code scrapes all the data you need. It is obvious you have some experience in web scraping so I will not spend too much time explaining what the code does at the moment, but feel free to ask me questions and I will be more than glad to provide you with more details. However, regarding the messages in the threads, what I did is to scrape the links to each thread and use them to access the thread pages and further scrape the messages. The final results of this script is a tidy data frame with one list-column (since that are several messages in each thread.
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
# Scrape thread titles, thread links, authors and number of views
url <- "https://www.healthboards.com/boards/aspergers-syndrome/"
h <- read_html(url)
threads <- h %>%
html_nodes("#threadslist .alt1 a") %>%
html_text()
thread_links <- h %>%
html_nodes("#threadslist .alt1 a") %>%
html_attr(name = "href")
authors <- h %>%
html_nodes("#threadslist .alt1 .smallfont") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "")
views <- h %>%
html_nodes(".alt2:nth-child(6)") %>%
html_text() %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric()
# Custom function to scrape messages in each thread
scrape_messages <- function(link){
read_html(link) %>%
html_nodes(css = ".smallfont~ hr+ div") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
# Create master dataset (and scrape messages in each thread in process)
master_data <-
tibble(threads, authors, views, thread_links) %>%
mutate(messages = map(thread_links, scrape_messages)) %>%
select(threads:views, messages, thread_links)
head(master_data)
threads authors views messages thread_links
<chr> <chr> <dbl> <list> <chr>
1 ADHD And Aspergers MyNameIsCra~ 4973 <chr [3~ https://www.healthboards.com/boards/aspergers-syndrome/1035173-adhd-asp~
2 Adult Pants Pooping and Asperger'~ poopypants21 1680 <chr [4~ https://www.healthboards.com/boards/aspergers-syndrome/1037809-adult-pa~
3 I did NOT spoil him! mery 5939 <chr [7~ https://www.healthboards.com/boards/aspergers-syndrome/921652-i-did-not~
4 ASD Assessment as an adult, how? Dragonfly W~ 1243 <chr [2~ https://www.healthboards.com/boards/aspergers-syndrome/1032212-asd-asse~
5 Sex and the single woman with AS Madeofglass 7040 <chr [4~ https://www.healthboards.com/boards/aspergers-syndrome/973625-sex-singl~
6 I have aspergers and very severe ~ joe398 1445 <chr [2~ https://www.healthboards.com/boards/aspergers-syndrome/1029904-i-have-a~
Just a last thing that I don't know if I should ask in another post:
I'm trying to save the messages (with their author, title, views) to a csv.
I think that first I must un-nest the messages
Next time you ask a question, do not forget to tag me on the post so I get a notification. The only issue with your code is that the first argument of the write.csv() function should be master_data.
write.csv(master_data, "C:/asperger/resul.csv", na = "")
You're very welcome @anacho. I would also suggest that you mark the first response as the solution because it directly addresses your original request.