I don't work with gridded data very often, by climate science standards, but a recent analysis had me doing some small operations on some gridded NetCDF data: monthly occurrences of records in grid cells over Australia.
I wanted to derive from this the age of records for each grid cell and timestep. I chose to use the super-handy tidync
package to pull out a data frame using hyper_tibble
, group by grid cell (longitude + latitude) and use tidyr::fill
to work it out:
rec_txa =
# get the netcdf into a data frame
tidync(
here('analysis', 'data', 'record-ts', 'monthly-TXmean-records.nc')) %>%
activate('hot_record_time') %>% .
hyper_tibble() %T>%
# tidy up a little
print() %>%
mutate(record = as.logical(hot_record_time)) %>%
select(-hot_record_time) %>%
arrange(time) %>%
# cell-by-cell: find the time since the last record
group_by(longitude, latitude) %>%
mutate(last_record = if_else(record, true = time, false = NA_integer_)) %>%
fill(last_record) %>%
mutate(
record_age = time - last_record,
record_age_clamped = case_when(
record_age > 120 ~ NA_integer_,
TRUE ~ record_age
)) %>%
select(-last_record) %>%
ungroup() %>%
# convert time column to actual dates
mutate(date = as.Date('1950-01-15') + months(time - 1)) %T>%
print()
In this case it was fast enough for my needs: on the order of 10 seconds on my 2014 MBP for what was a ~2M row data frame (2000 cells * 1000 months). However, I'm aware that tidync
also has a hyper_array
function, and I'm not very familiar with R's matrix tools. Would it be quicker to dump to an array and use other tools to do this work? Or is there some possible performance benefit from hyper_tibble
accepting a grouping argument to skip dumping the whole thing into a single data frame before splitting it up again?
I mostly ask because it's pretty typical for me to do embarrassingly parallelisable work on gridded data (either by grid cell, working on time series one at a time, or grouping by timestep to aggregate spatially), and in my PhD work I've mostly outsourced this kinda thing with bigger datasets to dedicated tools like CDO, at the expense of readability. But for many of my colleagues who're Python users, xarray's similar split-apply-combine workflow is often used for analysis on model output.