I am developing an RShiny application with a goal to serve ca 1000 daily users. I do have a data frame that is too big to be included in the application itself - 33M observations and 3 variables (chr, Date, num) I am looking for a solution where to host this data frame. The data frame can be publicly accessed and the users do not need edit the data in any way. The data frame includes about 10 000 unique characters strings which the application then uses to subset it by filtering out 10 to a new data frame. No complex queries needed.
Do you have evidence or further argument for this claim ?
I think if the string values, were stored as factors, and you use a file format with high compression, then you can store such data very efficiently and bundle with your app.
In my case using qs::qsave() does not result any compression and I know why:
Third column is "numeric" and not "integer".
How I overcame it : as.integer(x*1000)
Now that we have a compressed file and we bundle it with the app it indeed does make the application smaller, but once we load the file to data frame it uses the same amount of RAM as originally. Is there a way to subset the .qs file while reading?
You might check out DuckDB, it has a lot of features built-in for accessing parts of a dataset rather than having all of it directly in memory:
Even something as simple as SQLite would probably work well. The SQLite file to store the data will take whatever space it requires, but querying that file from R will only use the data needed for the query, not the entire file.
Overall that dataset doesn't seem big enough to require an external database. I would probably store it as a parquet file or arrow dataset and then use arrow to do the queries from Shiny.
The main reason to use an external data store is if you want to update the data without redeploying the app.