Best Practices for Geospatial Time Series Data Storage and usage

badbayesian · June 23, 2022, 10:22pm

I've used R for both geospatial and time series data separately I wanted to know if there was a standard way of dealing with the temporal geospatial data when the shapefiles/geometries are constant over time.

My current understanding is that generally if one transform their data into a tidy format, then using one can really leverage the power of the tidyverse (ggplot, tidymodels, etc). However, transforming geospatial temporal data into a tidy style is really expensive memory wise as at every time index there is a new copy of the geometry even if the geometry has not changed between time indices.

Is there a way to leverage a constant geometry to so that the operations are closer to being additive instead of multiplicative but still use tidyverse?

More concretely and with last week's TidyTuesday data, processing the data in a tidy way creates very large object (dt_merged 135Gb) from left joins of somewhat large time series (df_raw 115 Mb) and geospatial components (counties 118 Mb). Is there an already built/established way of processing the data that leverages that the shapefiles don't change (or change very little)?

library(tidytuesdayR)
library(tigris)
library(data.table)
library(purrr)

counties <- as.data.table(counties(year = 2000))[, .(
    code = paste0(STATEFP, COUNTYFP),
    geometry, INTPTayLAT00, INTPTLON00)
]
setkey(counties, code)

fips <- as.data.table(fips_codes)[, code := paste0(state_code, county_code)]
setkey(fips, code)

df_raw <- as.data.table(tt_load('2022-06-14')$`drought-fips`)
setnames(df_raw, 'FIPS', 'code')
setkey(df_raw, code)

merge_order <- list(df_raw, fips, counties)
dt_merged  <- reduce(merge_order, ~merge(.x, .y, all.x = TRUE, by='code'))
setkey(dt_merged, code)

format(object.size(counties), units = "auto") # "118.7 Mb"
format(object.size(fips), units = "auto")           # "492.7 Kb"
format(object.size(df_raw), units = "auto")    # "115.3 Mb"

format(object.size(dt_merged), units = "auto") # "135.1 Gb"

jlacko · June 24, 2022, 7:51am

I am not familiar with the last week's Tidy Tuesday data, so please bear with me for a general answer.

There are several "lines" of geospatial time related work, and one-size-fits-all approach may not be feasible. From what I know you could have:

point data in time (stuff like crimes, traffic accidents and / or disease incidence). These are usually handled via a single {sf} data frame. Point coordinates come cheap.
remote sensing data (satellite or what not). These data are yuuuge, and come in specialized raster like formats. NetCDF and the like. These file formats are by necessity quite efficient, and are usually handled in R via {stars}.
area data, like census tracts or counties in the US and NUTS & LAUs in the EU. These are usually handled via two objects - one with data and keys, and another with keys and geometry.

Joining temporal data and geometry into a single object is a rather uncommon operation, done only when you really need to. Such as when you want to communicate your results via a gganimate animation. But this results in a really awkward object - as you have noted. This is usually only the last step.

In analysis you will likely be more interested in derivations of the geometry, like adjacency or distance matrices, than the actual polygons.

In short it seems to me that the cruel & unusual object is more an artifact of gganimate based workflow than spatial data analysis as such.

system · July 15, 2022, 7:52am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.