Problems when scraping data using R

cao_xueyan · December 30, 2023, 3:39am

I try to scrap article information (title, authors, abstract), but there exists a problem when I scrap the abstract. I have 261 weblinks. The number of abstracts is only 19. And there is an error. Can anyone help me~ Thanks!

The following is the data:

read_html(url[1],encoding = 'utf-8') %>%
html_nodes(' #search-results > section.search-results-list > div.search-results-chunks > div > article:nth-child(2) > div.docsum-wrap > div.docsum-content > a') %>%
html_text(trim = TRUE)

read_html(url[1],encoding = 'utf-8') %>%
html_nodes('.docsum-title') %>%
html_text(trim = TRUE)

title <- c()
for (i in url) {
title <- c(title,read_html(i,encoding = 'utf-8') %>% html_nodes(".docsum-title") %>% html_text(trim = T))
}

check numbers

length(title)

author <- c()
for (i in url) {
author <- c(author,read_html(i,encoding = 'utf-8') %>%
html_nodes('.full-authors') %>%
html_text())
}
length(author)

web <- c()
for (i in url) {
web <- c(web,read_html(i,encoding = 'utf-8') %>% html_nodes('.docsum-title') %>% html_attr(name = 'href'))
}
length(web)

web_link <- paste('https://pubmed.ncbi.nlm.nih.gov',web,sep = '')
web_link

abstract <- list()
for (i in web_link) {
abstract[[i]] <- read_html(i,encoding = 'utf-8') %>% html_nodes("#eng-abstract > p") %>% html_text(trim = T)
}

Error in open.connection(x, "rb") : HTTP error 404.

joesho112358 · December 30, 2023, 4:06am

Hi,

Not sure about this because can't see the full picture, but given the error is a 404:
Error in open.connection(x, "rb") : HTTP error 404.
I would guess the web_link <- paste('https://pubmed.ncbi.nlm.nih.gov',web,sep = '') is not forming the URL properly and there may be a / out of place or missing. Do you get any 404 not found errors when you go to the URLs in web_link manually?

cao_xueyan · December 30, 2023, 4:08am

I try to go to the URLs in web_link manually and every link works well

cao_xueyan · December 30, 2023, 4:49am

The following code works well~

abstract <- list()

for (i in web_link) {
tryCatch({
page_content <- read_html(i, encoding = 'utf-8')
abstracts <- page_content %>% html_nodes("#eng-abstract > p") %>% html_text(trim = T)
abstract[[i]] <- abstracts
Sys.sleep(2)
}, error = function(e) {
warning(paste("Failed to retrieve data from", i, "Error:", conditionMessage(e)))
})
}

system · February 10, 2024, 4:50am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.