Solution for executing ~200 URLs from a data frame into `read_html rvest.

Hello,

My new hobby is using R, and I'm enjoying it. I don't have a CE or CS background, so some of my questions are hard to articulate in web searches. I scraped around 200 URLs that I put into a data frame. I'd like to feed the data frame URLs into rvest one at a time in the style of my code below. My goal is to have a big data frame with all the text that I'm trying to scrape from multiple pages.

pages <- read_html(df_links) %>%
  html_elements("The Elements of Interest") %>%
  html_text()

If I feed 1 page into the code above, with the appropriate elements, I get what I want. I get the text from the website that I can later do word counts on, graph with ggplot2, and make a word cloud.

However, I'm struggling to understand how I can use read_html to go through each successive URL in the data frame I made. After a few hours of searching on stack exchange, I'm guessing I need to do some type of for loop, or learnlapply. I'm just looking for some other input before I dedicate time to learning either for loops, or the lapply package.

Thanks for any input.

That will continue to be true long after you've reached expert level. The right question is absolutely the hardest part always.

You' ve done the hard part of the coding—you have the result you are looking for from a single URL to create an object, pages that you can work with and now you want to do it for 199 other URLs, without repeating yourself by doing it one-by-one.

I like to use school algebra as the key to posing questions like this: f(x) = y, where x is what you have, in the first case the_url, and y is what you want, in this case pages. The function f that did this was the chained operation html_text(html_element(read_html(the_url))) implemented with the pipes %>% Now the next step is g(f(x) = Y where g is some function or composite function that will apply f to the various values of x to produce Y, which is some structure to hold all the results, the pages, y.

Let's start with where to end up, Y. Each y is a list of the return value from html_text(), say all the <p> elements. A convenient place to collect lists is another list to take the role of Y.

holder <- list(length = 200)

There it is an empty list with space for 200 lists.

Next, g, how do we populate holder?

Your instinct to use a for loop or lapply is spot on. How to do this, g(X) = Y (X is y from f(x) = y, so that you can do lapply(df_links,g). We could do that, but a for loop is simpler.

First, df_links doesn't need to be a data frame. Turn it into a vector.

for(i in seq_along(df_links){
     holder[i] = read_html(df_links[i]) |>
                           html_elements("p") |> # or whatever element
                           html_text()
}

This works at all, because it allows the result to escape from the loop, and it works efficiently because the receiver object is pre-allocated.

Your reply was very thoughtful and pedagogical; I have no words.

The "for loop" format looks intimidating, but I will trust your judgement that I should learn it. I'm going to save your reply in my notes. I will definitely take your advice. Otherwise, I found a parsimonious lapply explanation on stack overflow a few moments before your reply came to my inbox.


For others reading who may have a similar issue, the solution I used was this:

pages <- lapply(df_links$txt, function(x) {
  read_html(x) %>% 
    html_text()})

I had an error pop up when I tried unnesting the tokens with unnest_tokens and the solution was making pages into a tibble:

pages_tibble <- tibble(txt = pages)

Then I fed pages_tibble into tidytext:

tidy_pages <- pages_tibble %>% 
  unnest_tokens(word, txt, format = "text")

Then I did a simple chart:

tidy_pages %>% 
  top_n(50) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() + 
  xlab("Words") +
  ylab("Count") +
  coord_flip()

I hope others can benefit from this post. Thank you again, Technocrat

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.