



Hello @tamara 




Welcome to the wonderful RStudio Community.
The code I provide below will help you achieve your scraping goals in the www.essentialbaby.com website. Be warned; however, that I use slightly different column names in my code, so if you are not really sure, do not hesitate to let me know.
It's important that you know this! You would like to not only scrape the first 20 threads/posts, but all of them! This can be problematic as it might take a really long time to run (there are 166 pages with 20 threads in each page). It is; however, not an impossible task. For this reason, I created a function (i.e. scrape_ebaby_bypage()
) in which you can specify the pages that you would like to scrape using the (only) argument page_numbers
(e.g scrape_ebaby_bypage(page_numbers = 1:5)
will scrape the first 5 pages)). Therefore, calling the function with page_numbers = 1:166
will scrape all pages.
Your code only scrapes the first 20 threads simply because you only scrape the first page. So what I did is to create the links to the other pages in order to scrape them too.
The first function below, scrape_thread_data()
is a function, which is used inside the main scrape_ebaby_bypage()
function. So, just run the functions in that order before using the latter for your scraping needs.
Finally, the first two pages are scraped at the end of the code. The output of the main function is nested, but you can easily unnest it as is also shown in the code.
- Functions to scrape the data
# Load required packages
pacman::p_load(rvest, dplyr, stringr, purrr, lubridate, tibble, tidyr)
# Secondary custom function which scrapes data in each thread
# Input:
# - thread_link <chr>: link of the thread to scrape
# Output: tibble with the following columns:
# - participant <chr>: name of the poster
# - post_date <dttm>: date of the post
# - post <chr>: content of the post
scrape_thread_data <- function(thread_link){
thread_html <- read_html(thread_link)
participant <- thread_html %>%
html_nodes(css = ".guest , .vcard") %>%
html_text() %>%
str_squish()
post_date <- thread_html %>%
html_nodes(css = ".published") %>%
html_text() %>%
enframe(name = "id") %>%
mutate(value = str_replace_all(string = value, pattern = " -", replacement = "")) %>%
separate(col = value, into = c("day", "month", "year", "time", "time_of_day"), sep = " ") %>%
separate(col = time, into = c("hour", "min"), sep = ":") %>%
mutate_at(vars(day, year, hour, min), as.integer) %>%
mutate(month = match(month, month.name)) %>%
transmute(time = ISOdatetime(year = year, month = month, day = day, hour = hour, min = min, sec = 0)) %>%
pull()
post <- thread_html %>%
html_nodes(css = ".entry-content") %>%
html_text() %>%
str_trim()
tibble(participant, post_date, post)
}
# Main function which creates a master data set
# Input:
# - page_numbers <numeric>: numeric vector specifying the pages to scrape (default is 1)
# Output is a tibble with the following columns:
# - thread_creator <chr>: name of the creator of the thread
# - date <date>: date of creation of thread
# - thread_title <chr>: title of the thread
# - thread_url <chr>: Link of the thread (serves as input to the scrape_thread_data() function above)
# - thread_data <list>: a column list containing the output of the scrape_thread_data() function for each thread
scrape_ebaby_bypage <- function(page_numbers = 1){
page_urls <- c("http://www.essentialbaby.com.au/forums/index.php?/forum/232-sleeping/",
paste0("http://www.essentialbaby.com.au/forums/index.php?/forum/232-sleeping/page__prune_day__100__sort_by__Z-A__sort_key__last_post__topicfilter__all__st__", 1:165 * 20))
urls <- page_urls[page_numbers]
htmls <- map(urls, read_html)
thread_url <- map(htmls, function(html){
html %>%
html_nodes(".topic_title,.a")%>%
html_attr(name = "href")
}) %>%
flatten_chr()
# Scrape post title
thread_title <- map(htmls, function(html){
html %>%
html_nodes(css = ".topic_title") %>%
html_text() %>%
str_trim()
}) %>%
flatten_chr()
thread_creator_and_date <- map_dfr(htmls, function(html){
html %>%
html_nodes(css = ".lighter") %>%
html_text() %>%
str_trim() %>%
enframe(name = "id") %>%
mutate(value = str_replace_all(string = value, pattern = "Started by |\n|\t", replacement = "")) %>%
separate(col = value, into = c("thread_creator", "date"), sep = ", ") %>%
mutate(date = dmy(date)) %>%
select(-id)
})
master_data <- bind_cols(thread_creator_and_date, thread_title = thread_title, thread_url = thread_url) %>%
mutate(thread_data = map(thread_url, scrape_thread_data))
master_data
}
- Using the functions to scrape
dat <- scrape_ebaby_bypage(page_numbers = 1:2)
dat
# A tibble: 40 x 5
thread_creator date thread_title thread_url thread_data
<chr> <date> <chr> <chr> <list>
1 lucky 2 2014-06-03 Sleep Schools (Early Parenting Centres)- members… http://www.essentialbaby.com.au/forums/index.php?/topic/1130252-sleep-sch… <tibble [10 …
2 Shellby 2010-02-25 Control Crying Alternatives http://www.essentialbaby.com.au/forums/index.php?/topic/770816-control-cr… <tibble [2 ×…
3 Shellby 2009-11-08 New Moderator http://www.essentialbaby.com.au/forums/index.php?/topic/736132-new-modera… <tibble [1 ×…
4 .Ally. 2008-06-04 Read this before posting! http://www.essentialbaby.com.au/forums/index.php?/topic/546955-read-this-… <tibble [1 ×…
5 Caribou 2019-06-04 Farewell, Au revoir, Auf Wiedersehen, To Day Sle… http://www.essentialbaby.com.au/forums/index.php?/topic/1204169-farewell-… <tibble [25 …
6 Zeppelina 2019-05-13 8yo and sleep anxiety http://www.essentialbaby.com.au/forums/index.php?/topic/1203750-8yo-and-s… <tibble [8 ×…
7 PandoBox 2019-05-06 I completely ruined her sleep , how do I fix it? http://www.essentialbaby.com.au/forums/index.php?/topic/1203593-i-complet… <tibble [16 …
8 Davidoff-sensei 2019-04-24 4 month old absolutely hates nap/bed time. Screa… http://www.essentialbaby.com.au/forums/index.php?/topic/1203358-4-month-o… <tibble [25 …
9 joeyinthesky 2017-09-02 13mo crazy sleep issues http://www.essentialbaby.com.au/forums/index.php?/topic/1189771-13mo-craz… <tibble [22 …
10 Kattikat 2019-03-21 18 Mo old thinks she's a newborn http://www.essentialbaby.com.au/forums/index.php?/topic/1202716-18-mo-old… <tibble [3 ×…
# … with 30 more rows
unnest(dat)
# A tibble: 552 x 7
thread_creator date thread_title thread_url participant post_date post
<chr> <date> <chr> <chr> <chr> <dttm> <chr>
1 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… lucky 2 2014-06-03 11:06:00 "Hi,\nA thread has been suggested wh…
2 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Ellen101 2014-06-05 09:51:00 "One for the neutral camp \nWe recen…
3 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Muffintop 2014-06-05 10:22:00 "Neutral again I think.\nWe attended…
4 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… silverbubb… 2014-07-27 10:12:00 "Amazing, positive results. Have att…
5 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… libbylu 2014-07-27 10:28:00 Positive - I attended a day stay pro…
6 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… BeakyHoney… 2014-08-17 08:00:00 "These replies are great. \nI have a…
7 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… RockLobster 2014-08-18 09:59:00 "FERALfoxgirls, on 17 August 2014 - …
8 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Charli73 2014-08-18 10:13:00 "I was in a public melbourne sleep s…
9 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Natttmumm 2014-08-18 10:29:00 "We went to Tresillian in Sydney qui…
10 lucky 2 2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… nup 2016-04-21 06:40:00 "A very strong negative from me on a…
# … with 542 more rows
Hope this helps, and, once again, welcome to the community!