Reading consecutive digits

leecreighton · September 16, 2019, 1:24pm

So there are a million digits of pi here that I’d like to read into a vector or column, but I’m not sure how to read (say) a CSV file with no delimiter. Can anyone help?

FJCC · September 16, 2019, 1:49pm

I don't know what will happen if you try to read in one million digits but I did test read.table() with a file with just a single number and it reads it in as a data frame with one row and one column. If there is no termination to the line, it raises a warning but the process works.

valeri · September 16, 2019, 2:03pm

I gave this one a try here, didn't get the full million, but about 50.000 it seems:

#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scraped
url <- 'https://www.piday.org/million/'

#Reading the HTML code from the website
webpage <- read_html(url)

pi_xml <- html_nodes(webpage,'#million_pi')

pi_data <- html_text(pi_xml)
substr(pi_data, 1, 100)

Matthias · September 16, 2019, 2:13pm

I think the difficulty from taking it from the webpage is that more lines are loaded when scrolling down, so probably read_html doesn't work!?
So when you scroll down to the end, copy everything to the Clipboard you could (conveniently) try:
PI = readClipboard() when you use Windows.

If you stored everything in a file (txt or csv) read_file() from the readr library might do the job.

leecreighton · September 16, 2019, 3:23pm

I can't seem to tell how many digits I have. It's all one number, rather than a single column of digits, which is what I'm after.

I tried changing it to a string to get its length, but R truncated it. I'm not sure that length() works on numbers.
I want to get each digit as a single row so that I can test for randomness of the digits 0–9.

leecreighton · September 16, 2019, 3:24pm

Yeah, I need one column of one digit numbers, the digits being the ones on the web page.

Yarnabrina · September 16, 2019, 4:01pm

Just split it into characters and then make a dataframe:

# totally based on @valeri's solution, as I don't know web scraping at all
library(rvest)
#> Loading required package: xml2
url_to_be_scrapped <- 'https://www.piday.org/million/'
webpage_html <- read_html(x = url_to_be_scrapped)
pi_xml <- html_nodes(x = webpage_html,
                     css = '#million_pi')
pi_text <- html_text(x = pi_xml)
pi_vector <- strsplit(x = pi_text,
                      split = "")[[1]]
pi_digits_after_decimal_dataframe <- data.frame(digits = as.integer(x = pi_vector[-(1:2)]))
str(object = pi_digits_after_decimal_dataframe)
#> 'data.frame':    51197 obs. of  1 variable:
#>  $ digits: int  1 4 1 5 9 2 6 5 3 5 ...

^{Created on 2019-09-16 by the reprex package (v0.3.0)}

valeri · September 16, 2019, 4:56pm

And just to add a bit to this here ... since the html_text function returns a string (and only one string), then length will return 1. Generally, if you would like to count the number of characters in a string, you can use nchar. And as @Yarnabrina mentions, if you want to make a vector of digits instead, then you need to split the string into its digits.

And linking to this post regarding programatically scrolling down the page - apparently it can be done using RSelenium: how to scrape, do not load whole page until we scroll down?

leecreighton · September 16, 2019, 5:14pm

Is it surprising or weird that it only grabs 51K characters? I don't know enough about the R internals to know the difference, but I was hoping that the data.table primitive, supposedly designed for huge data sets, would be able to do it.

leecreighton · September 16, 2019, 5:20pm

It looks like html_text is the bottleneck that's only grabbing 51K digits. I have no idea why that would be the case.

Your skillful use of splitting etc.is very helpful. Having no delimiter meant certain doom I thought!

valeri · September 16, 2019, 5:43pm

Hi @leecreighton,

the around 51.000 characters "limit" is not in any way related to R or data frames as such. The page we are scraping is set up in such a way that only about 50.000 characters are rendered upon first load - to get to the rest you need to scroll down - this is a common "tactic" so that web pages load faster and further content is rendered only if needed (if the user scrolls down in this case) - that is why I linked to the article which discusses programmatic scrolling using RSelenium (see above)

system · October 7, 2019, 5:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.