1- As the code is now, I'm storing for each thread only the userid that originated it, not the user ids that reply to it, so in the "author" column of the csv file it would be great to put the author of each post.
2- In this forum there's information about each user, for example for this user https://www.medhelp.org/personal_pages/user/20824631
there's the "About me" information and I'm trying to create an "About me" column in the csv file.
But not all users have this information filled, for those that do not have it I'm just trying to leave it with NULL or NA but I couldn't....
I was able to complete your first request, which was to scrape the author IDs in each thread. I had to change a few variable and function names. I also used the RCurl::getURL() function to save the htmls from all links into a variable and then scrape the data of interest from the variable. This is a good practice because the code repeatedly scrapes directly from the website and some websites will lock you out for doing so.
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(RCurl)
# Scrape thread titles, thread links, authors and number of views
url <- "https://www.healthboards.com/boards/aspergers-syndrome/"
h <- read_html(url)
threads <- h %>%
html_nodes("#threadslist .alt1 a") %>%
html_text()
thread_links <- h %>%
html_nodes("#threadslist .alt1 a") %>%
html_attr(name = "href")
thread_starters <- h %>%
html_nodes("#threadslist .alt1 .smallfont") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "")
views <- h %>%
html_nodes(".alt2:nth-child(6)") %>%
html_text() %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric()
# Custom functions to scrape author IDs and posts
scrape_posts <- function(link){
read_html(link) %>%
html_nodes(css = ".smallfont~ hr+ div") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_author_ids <- function(link){
h <- read_html(link) %>%
html_nodes("div")
id_index <- h %>%
html_attr("id") %>%
str_which(pattern = "postmenu")
h %>%
`[`(id_index) %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
# Create master dataset
htmls <- map(thread_links, getURL)
master_data <-
tibble(threads, thread_starters, views, thread_links) %>%
mutate(
post_author_id = map(htmls, scrape_author_ids),
post = map(htmls, scrape_posts)
) %>%
select(threads:views, post_author_id, post, thread_links) %>%
unnest()
head(master_data)
threads thread_starters views thread_links post_author_id post
<chr> <chr> <dbl> <chr> <chr> <chr>
1 ADHD And Aspergers MyNameIsCrazy 5021 https://www.healthboards.com/boards/asperge~ MyNameIsCrazy I have adhd and asperger syndrome and was wondering abou~
2 ADHD And Aspergers MyNameIsCrazy 5021 https://www.healthboards.com/boards/asperge~ Dragonfly Win~ Hi there,My son has both, I have Inattentive ADHD and un~
3 ADHD And Aspergers MyNameIsCrazy 5021 https://www.healthboards.com/boards/asperge~ DuckyBaby03 Hello, I understand what your going through. I also have~
4 Adult Pants Pooping~ poopypants21 1705 https://www.healthboards.com/boards/asperge~ poopypants21 I am a 42 year old male with Asperger's Syndrome and occ~
5 Adult Pants Pooping~ poopypants21 1705 https://www.healthboards.com/boards/asperge~ 7ash7 Hi, to help answer your question, do you conciously and/~
6 Adult Pants Pooping~ poopypants21 1705 https://www.healthboards.com/boards/asperge~ poopypants21 Accidentally. My GF does wear cloth diapers because she ~
As for your second request, I am not sure how you accessed the "About me" page on the website.
Here is the code that will scrape all the data you need from the forum. However, it is important to note that:
the code itself will run for a long time because there is A LOT to scrape! For this reason, I only scrape the first page, but the code should be able to scrape everything if you make the right changes
there are often comments under posts in each thread and those are not scraped here
library(dplyr)
library(rvest)
library(purrr)
library(RCurl)
library(stringr)
library(tidyr)
# Estimate the number of pages on the forum by dividing the number of pages by 20
page1_html <- getURL("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=1")
n_pages <- page1_html %>%
read_html() %>%
html_node("div.forum_title") %>%
html_text() %>%
str_extract_all("\\d+") %>%
flatten_chr() %>%
as.numeric() %>%
`[`(3) %>%
{. / 20}
# Get all thread titles and thread links
page_urls <- paste0("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=", seq_len(n_pages))
page_htmls <- map_chr(page_urls[1], getURL) # use page_urls instead of page_urls[1] if you want to scrape everything!
scrape_thread_titles <- function(html){
read_html(html) %>%
html_nodes(".subj_title a") %>%
html_text()
}
scrape_thread_links <- function(html){
read_html(html) %>%
html_nodes(".subj_title a") %>%
html_attr("href") %>%
paste0("https://www.medhelp.org", .)
}
thread_titles <- map(page_htmls, scrape_thread_titles) %>%
discard(~ length(.x) == 0)
correct_n_pages <- length(thread_titles)
thread_titles <- thread_titles %>%
flatten_chr()
thread_links <- map(page_htmls, scrape_thread_links) %>%
`[`(seq_len(correct_n_pages)) %>%
flatten_chr()
master_data <- tibble(thread_titles, thread_links)
# Scrape all thread posts and poster's IDs
thread_htmls <- map_chr(master_data$thread_links, getURL)
html <- thread_htmls[1]
link <- master_data$thread_links[1]
scrape_poster_ids <- function(html){
read_html(html) %>%
html_nodes(css = "span span") %>%
html_text()
}
scrape_posts <- function(html){
read_html(html) %>%
html_nodes(".resp_body , #subject_msg") %>%
html_text() %>%
str_replace_all("\r|\n", "") %>%
str_trim()
}
master_data <- master_data %>%
mutate(
poster_ids = map(thread_htmls, scrape_poster_ids),
posts = map(thread_htmls, scrape_posts)
) %>%
unnest()
head(master_data, 15)
thread_titles thread_links poster_ids posts
<chr> <chr> <chr> <chr>
1 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ LearningGF My boyfriend has Asperger's Sydrome. If he gets too confused, uncomfortable or hurt. ~
2 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MJIthewriter When I shut down it's feeling overwhelmed. imagine if you were thrown out in a hughw~
3 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ Sally44 I have a son who will be 8 in February. When he gets overstimulated, or his expectat~
4 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MaryannesMom "My Aspie husband would go through cycles, every couple of months he would need to be~
5 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MJIthewriter Also headaches seem to trigger shutdowns. I had a bad one yesterday. Though the heada~
6 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ SueNYC "Though I would say that my husband definitely does not have Asperger's, he definitel~
7 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ teburgan hi Sue, I wanted to let you know I u derstand. I should never have married my husban~
8 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ ryans93 "I have had various shut downs. Our minds simply cannot comprehend or deal with the s~
9 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ nbarslou "My boyfriend of 9 months told me an old girlfriend said he had aspergers. My comment~
10 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ Debraydebor~ "So happy to read your post. I have been desperate for more information to help me in~
11 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ kristlep I saw that its been awhile since you made this post um are you still with him because~
12 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MadMaddox999 "kristlep,\" ... it hurts a hole lot because when we first got together he was out go~
13 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ RaeMinKai "hello there, i have been married to my husband for close to 7 years and we have 3 ki~
14 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ aerosmich My boyfriend has asperger's and we have been living together for just about 9 months.~
15 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ RUNNINGCATS how long does a shut down last for, if the person works with NT'smy friend started t~
As for the "About me" page, I am not sure exactly what you want to pull for that.
@gueyenono this is great!!! thanks!
I think that with this code I'll try to find out how to include the About me info and I'll create a new post if I'm not able to do it so I will label this as solved, thanks a lot!