Simple online hosting solution for single data frame for RShiny app?

PizzaRoma · May 21, 2024, 4:35am

I am developing an RShiny application with a goal to serve ca 1000 daily users. I do have a data frame that is too big to be included in the application itself - 33M observations and 3 variables (chr, Date, num) I am looking for a solution where to host this data frame. The data frame can be publicly accessed and the users do not need edit the data in any way. The data frame includes about 10 000 unique characters strings which the application then uses to subset it by filtering out 10 to a new data frame. No complex queries needed.

What could be the most compatible solution?

EDIT: 33M not 3M observations*

nirgrahamuk · May 21, 2024, 10:00am

I'm going to go lawyer on you

Do you have evidence or further argument for this claim ?
I think if the string values, were stored as factors, and you use a file format with high compression, then you can store such data very efficiently and bundle with your app.

Worked example :

library(tidyverse)

bigframe_1 <- expand.grid(a=letters,b=LETTERS,c=letters) |> head(n=10000) |> rowwise() |> mutate(
  across(everything(),\(x)paste0(rep(x,10),collapse=""))
,
  txt=paste0(a,b,c,collapse=";")
) |> ungroup() |> select(-a,-b,-c) |> expand_grid(Date=seq.Date(from=as.Date("2020-01-01"),
                                                                to = as.Date("2020-01-30"),by="1 day"),
                                                  num=1:11)

bigframe_2 <- bigframe_1 |> mutate(txt=factor(txt))

pryr::object_size(bigframe_1)
pryr::object_size(bigframe_2)

library(fst)

write_fst(bigframe_1,"bf1.fst",compress = 100) # 6.3Mb
write_fst(bigframe_2,"bf2.fst",compress = 100) # .5Mb

library(qs)
qs::qsave(bigframe_1,file = "bf1.qs") # .2Mb
qs::qsave(bigframe_2,file = "bf2.qs") # .1Mb

100Kb using qs on factor txt seems like an easy thing to budget.

PizzaRoma · May 22, 2024, 3:21am

In my case using qs::qsave() does not result any compression and I know why:
Third column is "numeric" and not "integer".

How I overcame it : as.integer(x*1000)

Now that we have a compressed file and we bundle it with the app it indeed does make the application smaller, but once we load the file to data frame it uses the same amount of RAM as originally. Is there a way to subset the .qs file while reading?

nirgrahamuk · May 22, 2024, 8:08am

Maybe look at arrow/parquet.

PizzaRoma · May 27, 2024, 12:07am

lighting fast!!... the PARQUET file is 203,391 KB though.

Still looking for database recommendations.

randyzwitch · May 28, 2024, 1:04pm

You might check out DuckDB, it has a lot of features built-in for accessing parts of a dataset rather than having all of it directly in memory:

Even something as simple as SQLite would probably work well. The SQLite file to store the data will take whatever space it requires, but querying that file from R will only use the data needed for the query, not the entire file.

Best,
Randy

gordon · June 4, 2024, 2:55pm

Overall that dataset doesn't seem big enough to require an external database. I would probably store it as a parquet file or arrow dataset and then use arrow to do the queries from Shiny.

The main reason to use an external data store is if you want to update the data without redeploying the app.

system · June 11, 2024, 2:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.