Rvest code problem.

After watching countless videos...help!
I am trying to scrape the "winners" from this MA. lottery website using rvest with no luck. Here is the webpage: Massachusetts Lottery
Any help with the code would be appreciated.

I think you may need {RSelenium} rather then {rvest} but I cannot check as my machine throws an error when I try try to run it. This looks like a good tutorial Web Scraping in R: Selenium, FireFox, and PhantomJS | Christopher Belanger, PhD

2 Likes

Thanks! I'll check out RSelenium.

You might want to consentrate on a Network tab of your browser's dev tools to figure out how exactly that data is fetched. In this particular case it comes from API calls like https://www.masslottery.com/api/v1/winners/query?start_index=0&count=25&sort=newestFirst, which you can use yourself though httr/ httr2 or just point jsonlite to the url:

api_query <-"https://www.masslottery.com/api/v1/winners/query?start_index=0&count=25&sort=newestFirst"
winners_resp <- jsonlite::fromJSON(api_query)
str(winners_resp)
#> List of 2
#>  $ pageOfWinners       :'data.frame':    25 obs. of  7 variables:
#>   ..$ date_of_win         : chr [1:25] "2024-08-26" "2024-08-26" "2024-08-26" "2024-08-26" ...
#>   ..$ prize_amount_display: chr [1:25] "$100,000" "$100,000" "$20,000" "$20,000" ...
#>   ..$ prize_amount_usd    : int [1:25] 100000 100000 20000 20000 20000 20000 20000 20000 15000 10000 ...
#>   ..$ identifier          : chr [1:25] "mass_cash" "mass_cash" "433" "433" ...
#>   ..$ name                : chr [1:25] "Mass Cash" "Mass Cash" "Lifetime Millions" "Lifetime Millions" ...
#>   ..$ retailer            : chr [1:25] "Gulf Foodmart" "Gulf Foodmart" "Highland Farm" "Bridgeview Convenience Store" ...
#>   ..$ retailer_location   : chr [1:25] "Lanesboro" "Lanesboro" "Provincetown" "Tyngsboro" ...
#>  $ totalNumberOfWinners: int 686169
tibble::as_tibble(winners_resp$pageOfWinners)
#> # A tibble: 25 Ă— 7
#>    date_of_win prize_amount_display prize_amount_usd identifier   name  retailer
#>    <chr>       <chr>                           <int> <chr>        <chr> <chr>   
#>  1 2024-08-26  $100,000                       100000 mass_cash    Mass… Gulf Fo…
#>  2 2024-08-26  $100,000                       100000 mass_cash    Mass… Gulf Fo…
#>  3 2024-08-26  $20,000                         20000 433          Life… Highlan…
#>  4 2024-08-26  $20,000                         20000 433          Life… Bridgev…
#>  5 2024-08-26  $20,000                         20000 433          Life… Abc Min…
#>  6 2024-08-26  $20,000                         20000 billion-dol… BILL… 7-Eleve…
#>  7 2024-08-26  $20,000                         20000 433          Life… Alltown…
#>  8 2024-08-26  $20,000                         20000 billion-dol… BILL… Saratog…
#>  9 2024-08-26  $15,000                         15000 keno         Keno  Amvets …
#> 10 2024-08-26  $10,000                         10000 100x-cash-2… 100X… Colbea …
#> # â„ą 15 more rows
#> # â„ą 1 more variable: retailer_location <chr>

Feel free to play with start_index & count parameters in API request. And sometimes it's worth testing with your own values, for example in this case the record count is not fixed to 25 per request.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.