I am trying to fetch one website link through web scrapping using rstudio cloud with the code I shared. But the result I am getting in console as twice the website name and the name of the website link up with NA. How to remove this NA from website name?
install.packages("rvest")
install.packages("dplyr")
library(rvest)
library(dplyr)
link = "https://tu■■■■a.info/"
page = read_html(link)
website_links = page %>% html_nodes("h1")%>% html_attr("href") %>% paste("http://www.tu■■■■a.info",.,sep="")
website_links
> website_links
[1] "http://www.tu■■■■a.infoNA" "http://www.tu■■■■a.infoNA"
Welcome to the community @Rekha_Verma! I am unable to see the website you are trying to scrape. I'm not sure if it is blurred intentionally, but can you share the link again?
It looks like page %>% html_nodes("h1")%>% html_attr("href")
is returning a vector with two NA values, which is why NA is being added in your paste
statement (as shown below).
c(NA, NA) %>% paste("http://www.tu■■■■a.info",.,sep="")
#> [1] "http://www.tu■■■■a.infoNA" "http://www.tu■■■■a.infoNA"
If you are able to share the link, then I/we can troubleshoot further.
1 Like
Hi, thank you for the reply. I am trying to send you the website link: https://tu■■■■a.info/
but again, a few letters are not visible.
Hi, are you after this?
page %>%
html_nodes("a") %>%
html_attr("href")
# [1] "https://tu■■■■a.info"
# [2] "https://www.facebook.com/Tu■■■■a.Meditation.Centre"
# [3] "https://www.youtube.com/user/Tu■■■■aMcLeodGanj"
# [4] "https://tu■■■■a.info/"
# [5] "https://tu■■■■a.info/about-us/"
# [6] "https://tu■■■■a.info/about-us/"
# [7] "https://tu■■■■a.info/about-us/our-spiritual-guides/"
# [8] "https://tu■■■■a.info/about-us/holy-objects-at-tu■■■■a/"
# [9] "https://tu■■■■a.info/about-us/history-of-tu■■■■a/"
# [10] "https://tu■■■■a.info/about-us/board-of-directors/"
Assuming that the link is censored here because of sh**
being in the domain name.
2 Likes
Thanks; I am new to the R language. Can we still solve the actual problem with a censored link?
I think the censoring is just on this forum. It isn't an issue within R itself.
Ok. I am getting the website's name twice, and the name of the website links up with NA because of the censoring on this forum. Can we still resolve it with the forum issue?
Your code
website_links = page %>% html_nodes("h1")%>% html_attr("href") %>% paste("http://www.tu■■■■a.info",.,sep="")
website_links
doesn't pull the right links. NA
is not a link.
The h1
values are not links on the website:
page %>%
html_nodes("h1")
# {xml_nodeset (2)}
# [1] <h1>Tu■■■■a Meditation Centre</h1>
# [2] <h1>Tu■■■■a Meditation Centre</h1>
1 Like
Great, it works out. I use the code of the link instead of the text. Thank you so much.
1 Like
or, if you already have pulled a lot
link <- "http://www.tu■■■■a.infoNA"
gsub("NA","",link)
#> [1] "http://www.tu■■■■a.info"
Created on 2023-01-25 with reprex v2.0.2