Hi everyone, I am posting this on behalf of a student of mine who is currently located in China. He is encountering issues scraping American websites because of the firewall there. While he has a VPN to bypass the firewall, when he uses the VPN to try to webscrape, he encounters a timeout error. He is using rvest and his VPN is called "Clash for Window". I am not very familiar with how VPNs, proxies, networks, etc. work, especially in conjunction with webscraping, so I would really appreciate any insight.
He was not able to create a formal reprex, but here is the code used and the error output:
> url <- "https://trends.google.com/trends/?geo=US"
> wiki <- read_html(url)
Error in open.connection(x, "rb") :
Timeout was reached: [trends.google.com] Connection timed out after 10000 milliseconds
While he is attempting to scrape Google trends in the above, he was also having issues with other websites, including Wikipedia.
Here is his sessionInfo():
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.936
#> [2] LC_CTYPE=Chinese (Simplified)_China.936
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.936
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] reprex_1.0.0 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.3
#> [5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.5
#> [9] ggplot2_3.3.3 tidyverse_1.3.0 rvest_0.3.6 xml2_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.6 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.3
#> [5] dbplyr_2.0.0 highr_0.8 tools_4.0.3 digest_0.6.27
#> [9] lubridate_1.7.9.2 jsonlite_1.7.2 gtable_0.3.0 evaluate_0.14
#> [13] lifecycle_0.2.0 pkgconfig_2.0.3 rlang_0.4.10 cli_2.2.0
#> [17] DBI_1.1.1 rstudioapi_0.13 curl_4.3 yaml_2.2.1
#> [21] haven_2.3.1 xfun_0.20 withr_2.4.0 styler_1.3.2
#> [25] httr_1.4.2 knitr_1.30 hms_1.0.0 generics_0.1.0
#> [29] fs_1.5.0 vctrs_0.3.6 grid_4.0.3 tidyselect_1.1.0
#> [33] glue_1.4.2 R6_2.5.0 fansi_0.4.2 readxl_1.3.1
#> [37] rmarkdown_2.6 modelr_0.1.8 magrittr_2.0.1 scales_1.1.1
#> [41] backports_1.2.0 ellipsis_0.3.1 htmltools_0.5.1 assertthat_0.2.1
#> [45] colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 broom_0.7.4
#> [49] crayon_1.3.4
Here is the info from Sys.getenv():
Sys.getenv()
#> AGSDESKTOPJAVA C:\Program Files (x86)\ArcGIS\Desktop10.6\
#> ALLUSERSPROFILE C:\ProgramData
#> APPDATA C:\Users\Raymond\AppData\Roaming
#> BESIEGE_GAME_ASSEMBLIES
#> D:/Games/SteamLibrary/steamapps/common/Besiege/Besiege_Data\Managed/
#> BESIEGE_UNITY_ASSEMBLIES
#> D:/Games/SteamLibrary/steamapps/common/Besiege/Besiege_Data\Managed/
#> CLICOLOR_FORCE 1
#> CommonProgramFiles C:\Program Files\Common Files
#> CommonProgramFiles(x86)
#> C:\Program Files (x86)\Common Files
#> CommonProgramW6432 C:\Program Files\Common Files
#> COMPUTERNAME DESKTOP-094HOAL
#> ComSpec C:\WINDOWS\system32\cmd.exe
#> CYGWIN nodosfilewarning
#> DISPLAY :0
#> DriverData C:\Windows\System32\Drivers\DriverData
#> FPS_BROWSER_APP_PROFILE_STRING
#> Internet Explorer
#> FPS_BROWSER_USER_PROFILE_STRING
#> Default
#> GFORTRAN_STDERR_UNIT -1
#> GFORTRAN_STDOUT_UNIT -1
#> GIT_ASKPASS rpostback-askpass
#> GOOGLE_API_KEY no
#> GOOGLE_DEFAULT_CLIENT_ID
#> no
#> GOOGLE_DEFAULT_CLIENT_SECRET
#> no
#> HOME C:\Users\Raymond\Documents
#> HOMEDRIVE C:
#> HOMEPATH \Users\Raymond
#> LOCALAPPDATA C:\Users\Raymond\AppData\Local
#> LOGONSERVER \\DESKTOP-094HOAL
#> MPLENGINE tkAgg
#> MSYS2_ENV_CONV_EXCL R_ARCH
#> NUMBER_OF_PROCESSORS 4
#> NVIDIAWHITELISTED 0x01
#> OneDrive C:\Users\Raymond\OneDrive
#> OS Windows_NT
#> PATH D:\R-4.0.3\bin\x64;C:\Program Files
#> (x86)\Common
#> Files\Oracle\Java\javapath;C:\Program Files
#> (x86)\Intel\iCLS Client\;C:\Program
#> Files\Intel\iCLS
#> Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
#> Files (x86)\Intel\Intel(R) Management Engine
#> Components\DAL;C:\Program Files\Intel\Intel(R)
#> Management Engine Components\DAL;C:\Program
#> Files (x86)\Intel\Intel(R) Management Engine
#> Components\IPT;C:\Program Files\Intel\Intel(R)
#> Management Engine Components\IPT;C:\Program
#> Files (x86)\NVIDIA
#> Corporation\PhysX\Common;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Python27;C:\WINDOWS\System32\OpenSSH\;D:\anal
#> application\phantomjs-2.1.1-windows\bin;D:\anal
#> application\selenium;C:\Program Files\Mozilla
#> Firefox;C:\Program Files
#> (x86)\Java\jre1.8.0_181\bin;C:\Program
#> Files\Microsoft VS Code\bin;C:\Program
#> Files\NVIDIA Corporation\NVIDIA
#> NvDLISR;D:\Git\cmd;C:\Users\Raymond\AppData\Local\Microsoft\WindowsApps;D:\bin\;D:\wind\bin\;
#> PATHEXT .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
#> PROCESSOR_ARCHITECTURE
#> AMD64
#> PROCESSOR_IDENTIFIER Intel64 Family 6 Model 78 Stepping 3,
#> GenuineIntel
#> PROCESSOR_LEVEL 6
#> PROCESSOR_REVISION 4e03
#> PROCESSX_PSWJOXQBNTYY_1615860287
#> YES
#> ProgramData C:\ProgramData
#> ProgramFiles C:\Program Files
#> ProgramFiles(x86) C:\Program Files (x86)
#> ProgramW6432 C:\Program Files
#> PSModulePath C:\Program
#> Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules
#> PUBLIC C:\Users\Public
#> QT_D3DCREATE_MULTITHREADED
#> 1
#> R_ARCH /x64
#> R_BROWSER false
#> R_COMPILED_BY gcc 8.3.0
#> R_DOC_DIR D:/R-4.0.3/doc
#> R_HOME D:/R-4.0.3
#> R_LIBS_USER C:/Users/Raymond/Documents/R/win-library/4.0
#> R_PDFVIEWER false
#> R_USER C:/Users/Raymond/Documents
#> RMARKDOWN_MATHJAX_PATH
#> D:/RStudio/resources/mathjax-27
#> RS_LOCAL_PEER \\.\pipe\42860-rsession
#> RS_RPOSTBACK_PATH D:/RStudio/bin/rpostback
#> RS_SHARED_SECRET 63341846741
#> RSTUDIO 1
#> RSTUDIO_CONSOLE_COLOR 256
#> RSTUDIO_CONSOLE_WIDTH 62
#> RSTUDIO_MSYS_SSH D:/RStudio/bin/msys-ssh-1000-18
#> RSTUDIO_PANDOC D:/RStudio/bin/pandoc
#> RSTUDIO_PROGRAM_MODE desktop
#> RSTUDIO_SESSION_PORT 42860
#> RSTUDIO_USER_IDENTITY Raymond
#> RSTUDIO_WINUTILS D:/RStudio/bin/winutils
#> SESSIONNAME Console
#> SHIM_MCCOMPAT 0x810000001
#> SSH_ASKPASS rpostback-askpass
#> SynaProgDir Synaptics\SynTP
#> SystemDrive C:
#> SystemRoot C:\WINDOWS
#> TEMP C:\Users\Raymond\AppData\Local\Temp
#> TERM xterm-256color
#> TMP C:\Users\Raymond\AppData\Local\Temp
#> TMPDIR C:\Users\Public\Documents\Wondershare\CreatorTemp
#> TZDIR D:/R-4.0.3/share/zoneinfo
#> USERDOMAIN DESKTOP-094HOAL
#> USERDOMAIN_ROAMINGPROFILE
#> DESKTOP-094HOAL
#> USERNAME Raymond
#> USERPROFILE C:\Users\Raymond
#> windir C:\WINDOWS
I'm happy to try to provide other context. If there is somewhere more relevant to ask, please let me know that as well.