Any tip to fetch data from thousands of requests efficiently?

arangaca · February 15, 2025, 6:45pm

I'm new to using the httr2 package and API more generally. I'm looking for help/tips to optimize my requests.

For context, I'm trying to fetch data from lat/lon coordinates. I can make 10 requests per second and every request can handle up to 512 points. I can send data in the form of lat1,lon1|lat2,lon2, e.g.:

# data storing 2 points
data <- list(
  latlons = "49.55,-113.76|49.99,-113.84"
)

Currently, I've my coordinates stored in a tabular format.

df |> 
  head() |> 
  str()
#> tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ lat: num [1:6] 49.6 50 49.6 49.7 49.7 ...
#>  $ lon: num [1:6] -114 -114 -114 -114 -114 ...

Here's the code I've been experimenting with on a small subset of the entire dataset:

library(tidyverse)
library(httr2)

# base request
req <- request("url") |> 
  req_headers(`api-key` = keyring::key_get("secret")) |>
  req_retry(max_tries = 3L) |> 
  req_throttle(rate = 10L)

reqs <- df |> 
  reframe(latlons = paste(lat, lon, sep = ",")) |>
  # make chunks of 512 points
  group_split(
    ceiling(row_number() / 512L),
    .keep = FALSE
  ) |> 
  # generate requests for every chunk
  map(\(tbl) {
    pts <- pull(tbl) |> paste(collapse = "|")
    req_body_json(req, list(latlons = pts))
  })

# fetch data
resps <- req_perform_sequential(reqs, on_error = "continue")

# extract the values of interest from the responses
# and store them in a .parquet file
resps |> 
  resps_data(\(resp) {
    resp |>
      resp_body_json() |> 
      pluck("results")
  }) |> 
  as_tibble_col() |> 
  unnest_wider(value) |> 
  arrow::write_parquet("my_data.parquet")

This works but there's a catch. I need to fetch data for more than 13 million points, meaning that I'll have to send tens of thousands of requests. With the code above I need to wait until all requests are completed before saving the data, which isn't optimal if a problem occurs before the end.

What's the best strategy to manage this?

A possible solution I can think of is to pre-chunk the initial dataset into say 200 000 rows and run the code above (with some adjustments) for each subset of data rather than the entire dataset directly. This would allow me to save intermediate results in case I lose access to the server at some point. That should also be more memory efficient since that will limit the size of objects required to store requests or responses, which will eventually be garbage collected.

I'm wondering if there's a better alternative though or if there're some httr2 functionalities I may have missed but that could help for this use case.

Any input/feedback would be greatly appreciated.

bjj103 · February 15, 2025, 11:46pm

Explore the tryCatch() function. See here. You'll likely want to wrap your script in a function and then wrap your function within tryCatch(). This will let your query continue in spite of encountering an error with an individual query. You might also want to include a loop in this function that inserts a timeout period before re-querying the same record and error-causing lat/long coordinates. Again, if this loop is all nested inside tryCatch(), repeated failures won't interrupt things.

And not addressing your error per se... you might also explore the future (and furrr) package. Querying 13 million records is time consuming. The future package lets you distribute this task across machine cores. Your Mac or PC likely has a dozen or more cores. If you dedicate 10 cores at the problem using the future::plan() function, the time to complete your 13-million record query will reduce to the time it takes to complete just a 1.3 million record query (less a small bit of overhead). I mention the furrr package because it compliments future. It contains a series of functions mimicking those in the purrr package (but geared for future). So, if you've created a function for your script with Catch(), you'd call it with furrr::future_pmap() rather than purrr::pmap().

Best of luck.

arangaca · February 16, 2025, 3:23pm

Thanks for the suggestions!

I can't use parallelization because I'm limited to 10 requests/sec. Retries and errors are actually handled by req_retry() and req_perform_sequential(), respectively. With my current code, the script shouldn't stop running but if I lose access to the server it will essentially be a waste of time. That's probably something I should change…

margusl · February 16, 2025, 9:38pm

How is rate limit enforced? Do they block your API key for some period, close offending connection(s) or through some other means?

What's the average response time, have you tryed req_perform_parallel() with custom pool to indirectly limit maximum number of requests per second?

Can you share what kind of API / service is this? Is it by any chance related to elevation / bathymetry lookups? If yes, are points scattered all over the world or is there an area of interest / fixed set of well defined areas and have you perhaps considered local lookups instead, either from some free and open or purchased dataset(s)?

arangaca · February 16, 2025, 11:35pm

Yes I'm fetching elevation data. I'm using the GPXZ API. The points are spread all over the globe.

From the doc:

Requests above the limit receive a response with a 429 status code: these requests don't count towards your quota, so feel free to retry rate-limited requests until they succeed.

So I'm not risking much but that would be quite useless to do because I'd definitely exceed that limit if I used parallelization.

I don't know.

Time for this isn't my biggest concern. I can let the computer work while I do something else. I'm doing this on my work computer and I've limited memory and space usage though. I need to be careful with the size of the different objects to handle without compromising speed to much.

I'm thinking of doing something like this:

fetch_elevations <- function(data, base_request, chunk_size = 1e6L) {
  cuts <- seq(1L, nrow(data), chunk_size) - 1L
  row_seq <- seq_len(chunk_size)
  iwalk(cuts, \(cut, i) {
    rows <- cut + row_seq
    data <- slice(data, rows)
    data <- unite(data, col = "value", matches("lat|lon"), sep = ",")
    tbls <- group_split(data, ceiling(row_number() / 512L), .keep = FALSE)
    reqs <- map(tbls, \(tbl) {
      pts <- paste(pull(tbl), collapse = "|")
      req_body_json(base_request, list(
        latlons = pts,
        interpolation = "nearest",
        bathymetry = TRUE
      ))
    })
    # probably better to use on_error = "return"
    resps <- req_perform_sequential(reqs, on_error = "continue")
    # need to handle responses with an error status code
    out <- resps_data(resps, \(resp) {
      resp <- resp_body_json(resp)
      pluck(resp, "results")
    })
    out <- as_tibble_col(out)
    out <- unnest_wider(out, value)
    arrow::write_parquet(out, glue::glue("data/gpxz-{i}.parquet"))
  })
}

Basically, handling the requests by chunks of 1 million rows, which corresponding to slightly less than 2000 requests of 512 points.

system · May 17, 2025, 11:36pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.