I'm new to using the httr2 package and API more generally. I'm looking for help/tips to optimize my requests.
For context, I'm trying to fetch data from lat/lon coordinates. I can make 10 requests per second and every request can handle up to 512 points. I can send data in the form of lat1,lon1|lat2,lon2
, e.g.:
# data storing 2 points
data <- list(
latlons = "49.55,-113.76|49.99,-113.84"
)
Currently, I've my coordinates stored in a tabular format.
df |>
head() |>
str()
#> tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
#> $ lat: num [1:6] 49.6 50 49.6 49.7 49.7 ...
#> $ lon: num [1:6] -114 -114 -114 -114 -114 ...
Here's the code I've been experimenting with on a small subset of the entire dataset:
library(tidyverse)
library(httr2)
# base request
req <- request("url") |>
req_headers(`api-key` = keyring::key_get("secret")) |>
req_retry(max_tries = 3L) |>
req_throttle(rate = 10L)
reqs <- df |>
reframe(latlons = paste(lat, lon, sep = ",")) |>
# make chunks of 512 points
group_split(
ceiling(row_number() / 512L),
.keep = FALSE
) |>
# generate requests for every chunk
map(\(tbl) {
pts <- pull(tbl) |> paste(collapse = "|")
req_body_json(req, list(latlons = pts))
})
# fetch data
resps <- req_perform_sequential(reqs, on_error = "continue")
# extract the values of interest from the responses
# and store them in a .parquet file
resps |>
resps_data(\(resp) {
resp |>
resp_body_json() |>
pluck("results")
}) |>
as_tibble_col() |>
unnest_wider(value) |>
arrow::write_parquet("my_data.parquet")
This works but there's a catch. I need to fetch data for more than 13 million points, meaning that I'll have to send tens of thousands of requests. With the code above I need to wait until all requests are completed before saving the data, which isn't optimal if a problem occurs before the end.
What's the best strategy to manage this?
A possible solution I can think of is to pre-chunk the initial dataset into say 200 000 rows and run the code above (with some adjustments) for each subset of data rather than the entire dataset directly. This would allow me to save intermediate results in case I lose access to the server at some point. That should also be more memory efficient since that will limit the size of objects required to store requests or responses, which will eventually be garbage collected.
I'm wondering if there's a better alternative though or if there're some httr2 functionalities I may have missed but that could help for this use case.
Any input/feedback would be greatly appreciated.