New To R Studio (student of Google Data Analytics course) Question about size limitation of Uploads

Hey Everyone,

im very new to R Studio and analysis outside of excel. Im having fun learning and i am up to the capstone project in my course - however they advise to download the last 12 months of data to use. It seems like to me that the data they want me to work with will be too much for R Studio (12 CSV average 100+mb each). I can see the RAM use going into the red. I wish to perform some data cleaning and then if feasible merge into one file. Being so new to this im not sure whats possible.

Maybe showing just how new i am but im attempting to load the CSV into the environment using the below.

'Removing NA from data frames'

dt <- read_csv("202505-divvy-tripdata.csv")
dt505_nona <- drop_na(dt, "end_station_id")
rm(dt)

dt <- read_csv("202506-divvy-tripdata.csv")
dt506_nona <- drop_na(dt, "end_station_id")
rm(dt)

dt <- read_csv("202507-divvy-tripdata.csv")
dt507_nona <- drop_na(dt, "end_station_id")
rm(dt)

ETC

1 Like

Just go ahead and try it. If it works, then you're okay :grinning_face:

1 Like

Seems like its going to be a problem, is it more memory intensive because i am using an R Script and not a markdown?

It crashed trying to process "202510-divvy-tripdata.csv"

I'm assuming there's a good reason for you to not install RStudio (or Positron) locally and you are after free options.

If you are fine with Jupyter notebooks, you'd get 12GB RAM in Google Colab(link chooses R runtime) or 30GB with R kernels in Kaggle.

For alternative RStudio Server instances, perhaps consider GitHub Codespaces, free options used to be mostly for those who qualify for GitHub Education, but now all free tier personal accounts get first 120 compute hours + 15GB storage per month for free and non-pro accounts can choose between 8GB and 16GB machines.

To try out RStudio Server in Codespaces :

(the first badge, Open example-rig-rstudio)
It starts VS Code and can take some time to finalize the setup; once finished, you can open RStudio through a link in VSCode Ports tab:


If you'd rather stick with the free plan in Posit Cloud and want to try to get most out of that 1GB limit, you should probably look into workflows and packages that are designed for working with larger-than-memory dataset.

I'd start with DuckDB. If your course was mostly about using Tidyverse packages, you should be rather comfortable using DuckDB through duckplyr package, a (mostly) drop-in replacement of dplyr that uses DuckDB as a backend.

Now we can also skip manual download-extract-upload of input files and point DuckDB directly to files in S3 bucket (assuming you are working with files from divvy-tripdata ), save resulting dataset as a single Parquet file (so we don't have to download and parse files again; and no fully materialized dataset is kept in memory for the whole time) and process and analyze data with regular dplyr verbs. With duckplyr, most of the heavy lifting is handled by DuckDB and ideally only aggregated data and/or relatively small data subsets ahould end up in R (think of metrics youcplan to present and data needed for visualizations). And to make the best use of DuckDB , you should be careful not to accidentally pull data into R when it can be avoided, especially when working with tight memory limits.

You should probably go though A DuckDB-Backed Version of dplyr • duckplyr and Articles • duckplyr , but example session starting with 6M trips could look something like this:

library(duckplyr)
library(tidyr)

# load duckdb extensions to import CSVs files 
# from zip archives stored in a public S3 bucket
db_exec("INSTALL httpfs")
db_exec("INSTALL zipfs FROM community")
db_exec("LOAD zipfs")

# build a list of files, as some zip-archives include some metadata files, 
# make sure to only read (presumably single) csv file from archive's root
# https://divvy-tripdata.s3.amazonaws.com/index.html 

files <- glue::glue(
  "zip://s3://divvy-tripdata/{yyyymm}-divvy-tripdata.zip/*.csv", 
  yyyymm = c(202504:202512, 202601:202604)
)
files
#> zip://s3://divvy-tripdata/202504-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202505-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202506-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202507-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202508-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202509-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202510-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202511-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202512-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202601-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202602-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202603-divvy-tripdata.zip/*.csv
#> zip://s3://divvy-tripdata/202604-divvy-tripdata.zip/*.csv

# Generate sql for duckdb using `files`, execute it and save result as 
# divvy-tripdata.parquet file (a single table including all tripdata records),
# return it as a lazy duckplyr frame.
# `read_csv()`  in this context is https://duckdb.org/docs/current/data/csv/overview#csv-functions , 
# called with a list of files.
tripdata_tbl <- 
  glue::glue_sql("from read_csv([{files*}])", .con = duckplyr:::get_default_duckdb_connection()) %>% 
  read_sql_duckdb() %>% 
  compute_parquet("divvy-tripdata.parquet")
tripdata_tbl %>% print(n = 3)
#> # A duckplyr data frame: 13 variables
#>   ride_id          rideable_type started_at          ended_at           
#>   <chr>            <chr>         <dttm>              <dttm>             
#> 1 AF3863596DF9D94B classic_bike  2025-04-27 14:29:34 2025-04-27 14:36:23
#> 2 8B38081EBE918800 electric_bike 2025-04-23 17:48:51 2025-04-23 17:59:06
#> 3 1C7F1DE826BBBC8D electric_bike 2025-04-05 17:55:30 2025-04-05 18:05:40
#> # ℹ more rows
#> # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
#> #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
#> #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

# next time just read it directly from the file
tripdata_tbl <- 
  read_parquet_duckdb("divvy-tripdata.parquet")

# example queries (fast, `count()` is translated and handled by DuckDB):
tripdata_tbl %>% 
  count()
#> # A duckplyr data frame: 1 variable
#>         n
#>     <int>
#> 1 6068796

tripdata_tbl %>% 
  count(rideable_type)
#> # A duckplyr data frame: 2 variables
#>   rideable_type       n
#>   <chr>           <int>
#> 1 classic_bike  2074957
#> 2 electric_bike 3993839

# `drop_na(end_station_id)` would need to pull whole dataset into R, 
# row count exceeds default limits and triggers an error (good);
# we would explicitly need to call collect() which in turn would 
# probably hit 1GB limit (bad) and crash R session :
tripdata_tbl %>% 
  # collect() %>% 
  drop_na(end_station_id) %>% 
  count()
#> Error:
#> ! Materialization would result in more than 76923 rows. Use `collect()` or `as_tibble()` to materialize.
#> ℹ Context: GetQueryResult

# instead, use dplyr verb `filter()` with `is.na()`, this will again be translated and handled by DuckDB:
tripdata_tbl %>% 
  filter(!is.na(end_station_id)) %>% 
  count(rideable_type)
#> # A duckplyr data frame: 2 variables
#>   rideable_type       n
#>   <chr>           <int>
#> 1 classic_bike  2068724
#> 2 electric_bike 2646574

2 Likes