Web Scraping a User Forum

broepke · December 6, 2020, 10:15pm

Following the post that gueyenono posted here and it's not working. I'm too much of a novice and can't figure it out. Is it possible to try to run this again and adapt it to the current structure of the forum?

# Load packages
library(tidyverse)  
library(tibble)
library(tidyr)
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(lubridate)

scrape_page_info <- function(page_url){
  
  html <- read_html(page_url)
  
  topics <- html %>%
    html_nodes(".title a") %>%
    html_text()
  
  topic_urls <- html %>%
    html_nodes(".title a") %>%
    html_attr(name = "href") %>%
    paste0("https://forums.tesla.com", .)
  
  created_info <- html %>%
    html_nodes(".created") %>%
    html_text() %>%
    str_squish()
  
  tibble(topics, created_info, topic_urls) %>%
    separate(col = created_info, into = c("date_of_creation", "thread_author"), sep = " by ")
  
}

scrape_thread_info <- function(thread_html){
  
  thread_html %>%
    html_nodes(".clearfix") %>%
    html_text() %>%
    str_squish() %>%
    str_replace(pattern = "^(.*?(\\|.*?){1})\\|", replacement = "\\1") %>%  # Remove second "|"
    str_replace(pattern = "(^.*?\\d{4})", replacement = "\\1 \\|") %>% # Add "|" after first date
    enframe(name = NULL, value = "content") %>%
    separate(col = "content", into = c("author", "date", "content"), sep = "\\|")
  
}

scrape_thread <- function(url){
  
  html <- read_html(url)
  
  n_pages <- html %>%
    html_node("#article_content > div.panel-pane.pane-node-comments > div > div.item-list > ul > li.pager-last.last > a") %>%
    html_attr("href") %>%
    str_extract(pattern = "(\\d+)$") %>%
    as.numeric()
  
  df_page_1 <-
    html %>%
    html_nodes(".clearfix") %>%
    html_text() %>%
    str_squish() %>%
    str_replace(pattern = "^(.*?(\\|.*?){1})\\|", replacement = "\\1") %>%  # Remove second "|"
    str_replace(pattern = "(^.*?\\d{4})", replacement = "\\1 \\|") %>% # Add "|" after first date
    enframe(name = NULL, value = "content") %>%
    separate(col = "content", into = c("author", "date", "content"), sep = "\\|")
  
  df_page_1$author[1] <- html %>%
    html_node(".username") %>%
    html_text()
  
  df_page_1$date[1] <- html %>%
    html_node(".submitted") %>%
    html_text() %>%
    str_squish() %>%
    str_extract(pattern = "\\w+\\s\\d+\\W\\s\\d{4}$")
  
  df_page_1$content[1] <- html %>%
    html_node(".clearfix") %>%
    html_text() %>%
    str_squish() %>%
    str_replace(pattern = "^.*\\d+\\,\\s\\d{4}\\s", replacement = "")
    
  extra_page_data <- NULL
  
  if(!is.na(n_pages)){
    extra_urls <- paste0(thread_url, "p", seq_len(n_pages-1))
    other_htmls <- lapply(extra_urls, function(x) read_html(x))
    df <- lapply(other_htmls, function(x){
      scrape_thread_info(x)[-1, ]
    })
    extra_page_data <- do.call(rbind, df)
  } 
  
  rbind(df_page_1, extra_page_data)
  
}


scrape_thread_possibly <- possibly(scrape_thread, otherwise = NA)

# Actual scraping

page_urls <- c("https://forums.tesla.com/categories/tesla-model-3", paste0("https://forums.tesla.com/categories/tesla-model-3/p", 1:2))

master_data <- map_dfr(page_urls[1:2], function(url){
  scrape_page_info(url) %>%
    mutate(forum_data = map(topic_urls, scrape_thread_possibly))
})

gueyenono · December 6, 2020, 11:20pm

Hi @broepke and welcome to RStudio Community!!!

I would need more info in order to help. What error are you getting when you try to run the code?

broepke · December 7, 2020, 12:41am

Thank you @gueyenono! Basically running it works, but returns an empty data frame. I am a little lost as how to debug something like this.

I did make one small change, the URL to the Tesla forums has changed, I addressed that. But unsure if that's all that's needed more maybe more. Pointers or tips on how to troubleshoot would be cool too - I'm happy to do some work!

broepke · December 8, 2020, 5:01pm

I put in the effort and learned how this works Should have in the first place. I re-wrote it from scratch. It was a healthy exercise. I used some of your flow for inspiration, but in my own words. Take a look and let me know what you think @gueyenono.

https://github.com/broepke/TeslaForumScraper

gueyenono · December 8, 2020, 5:38pm

I apologize for not getting back to you earlier. Congratulations on figuring it out yourself. I'll definitely take a look and get back to you. Do not hesitate to reach out next time you need help in web scraping.

gueyenono · December 9, 2020, 10:29am

Well, I think you did a terrific job. Congratulations.

broepke · December 15, 2020, 1:51am

I found out that I wasn't getting the date object correctly and I can't for the life of me figure out the xpath to it. Can you take a look at this and tell me what I'm doing wrong @gueyenono (the display text doesn't show the full date so just using the "time" class doesn't work like I originally tried:

url <- "https://forums.tesla.com/discussion/144293/rear-doors-wont-open-from-the-inside-must-hold-button-down-to-open-door/p2"

html <- read_html(url)

path = '//html/body/div[8]/main/div[2]/div[1]/ul/li[1]/div/div[1]/a/time/@datetime'

html %>%
    html_nodes(xpath = path) %>%
    html_text(trim = TRUE)

gueyenono · December 15, 2020, 1:55am

@broepke You're trying to scrape the date for each post? Try this:

library(rvest)

url <- "https://forums.tesla.com/discussion/144293/rear-doors-wont-open-from-the-inside-must-hold-button-down-to-open-door/p2"

html <- read_html(url)

# Method 1
html %>%
  html_nodes(css = "time") %>%
  html_attr(name = "title")

# Method 2
html %>%
  html_nodes(css = "time") %>%
  html_attr(name = "datetime")

broepke · December 15, 2020, 2:28am

ah! wasn't fully grasping the html_attr() function. I totally see that now. Thank you again.

At least now I'm getting full POSIX date time objects!

Thank you!

gueyenono · December 15, 2020, 2:32am

If you plan on doing a bunch of web scraping in the future, I highly suggest you learn the basics of HTML and CSS at your own pace. Youtube is full of good resources for learning these technologies.

broepke · December 15, 2020, 2:40am

I know - it's a little overwhelming when you're trying to hunt and peck. Thanks again.

gueyenono · December 15, 2020, 2:41am

You're very welcome. And I know. It can definitely be overwhelming. Do not hesitate to ask your questions if you have any.

system · December 22, 2020, 2:41am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.