So there are a million digits of pi here that I’d like to read into a vector or column, but I’m not sure how to read (say) a CSV file with no delimiter. Can anyone help?
I don't know what will happen if you try to read in one million digits but I did test read.table() with a file with just a single number and it reads it in as a data frame with one row and one column. If there is no termination to the line, it raises a warning but the process works.
I gave this one a try here, didn't get the full million, but about 50.000 it seems:
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scraped
url <- 'https://www.piday.org/million/'
#Reading the HTML code from the website
webpage <- read_html(url)
pi_xml <- html_nodes(webpage,'#million_pi')
pi_data <- html_text(pi_xml)
substr(pi_data, 1, 100)
I think the difficulty from taking it from the webpage is that more lines are loaded when scrolling down, so probably read_html doesn't work!?
So when you scroll down to the end, copy everything to the Clipboard you could (conveniently) try:
PI = readClipboard() when you use Windows.
If you stored everything in a file (txt or csv) read_file() from the readr library might do the job.
I can't seem to tell how many digits I have. It's all one number, rather than a single column of digits, which is what I'm after.
I tried changing it to a string to get its length, but R truncated it. I'm not sure that length()
works on numbers.
I want to get each digit as a single row so that I can test for randomness of the digits 0–9.
Yeah, I need one column of one digit numbers, the digits being the ones on the web page.
Just split it into characters and then make a dataframe:
# totally based on @valeri's solution, as I don't know web scraping at all
library(rvest)
#> Loading required package: xml2
url_to_be_scrapped <- 'https://www.piday.org/million/'
webpage_html <- read_html(x = url_to_be_scrapped)
pi_xml <- html_nodes(x = webpage_html,
css = '#million_pi')
pi_text <- html_text(x = pi_xml)
pi_vector <- strsplit(x = pi_text,
split = "")[[1]]
pi_digits_after_decimal_dataframe <- data.frame(digits = as.integer(x = pi_vector[-(1:2)]))
str(object = pi_digits_after_decimal_dataframe)
#> 'data.frame': 51197 obs. of 1 variable:
#> $ digits: int 1 4 1 5 9 2 6 5 3 5 ...
Created on 2019-09-16 by the reprex package (v0.3.0)
And just to add a bit to this here ... since the html_text
function returns a string (and only one string), then length
will return 1. Generally, if you would like to count the number of characters in a string, you can use nchar
. And as @Yarnabrina mentions, if you want to make a vector of digits instead, then you need to split the string into its digits.
And linking to this post regarding programatically scrolling down the page - apparently it can be done using RSelenium
: how to scrape, do not load whole page until we scroll down?
Is it surprising or weird that it only grabs 51K characters? I don't know enough about the R internals to know the difference, but I was hoping that the data.table primitive, supposedly designed for huge data sets, would be able to do it.
It looks like html_text is the bottleneck that's only grabbing 51K digits. I have no idea why that would be the case.
Your skillful use of splitting etc.is very helpful. Having no delimiter meant certain doom I thought!
Hi @leecreighton,
the around 51.000 characters "limit" is not in any way related to R or data frames as such. The page we are scraping is set up in such a way that only about 50.000 characters are rendered upon first load - to get to the rest you need to scroll down - this is a common "tactic" so that web pages load faster and further content is rendered only if needed (if the user scrolls down in this case) - that is why I linked to the article which discusses programmatic scrolling using RSelenium
(see above)
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.