After merging 12 Excel spreadsheets - one for each month of the year, each with identical column names - and running this function: ggplot(merged_data, aes(x = ride_length, y = day_of_week)) + geom_point(), the visualization in lower-right corner is seriously fuzzy. The x axis is ride_length (in minutes) and the y axis is day_of_week. Is there a better package than ggplot for this type of work? Or maybe something other than geom_point? Visualizations more suited for datasets with over 4 million merged rows? Check out the domain in the link below.
A link to Posit Cloud is of no use to anyone who does not subscribe. I do not.
If I understand the problem, you have too many data points to be graphed. You might want to consider randomly sub-setting your data set and plotting the subsets.
Thank you for your suggestion - Trying to merge 12 spreadsheets with a total of over 4 million rows is what the instructions said to do, although even this newbie was skeptical. Part of the capstone project to ascertain patterns. Will probably attempt to sub-set two columns only, since they are the ones I'm interested in.
The Posit Cloud is what was 'sold' to me as the new cloud version of RStudio, which I'm using for capstone project #1. Didn't know I would need to spend $ to upgrade to 8GB, only for that not to be enough (since I'm testing much code so as to learn what I'm doing), and to add insult to injury, not being able to post link that everyone using the forum can use.
To be honest it sounds a bit strange but if you have the memory it is doable. I'd have thought a database like DuckDB would be preferable but I don't deal with data with millions of rows normally.
My sampling theory is close to non-existent but I would think that a 5% or 10% sample would give you some feel for what's happening unless you have a really unusual distribution. Your comment about the the visualization in lower-right corner is seriously fuzzy is a bit worrying.
There may be a way to display this much data sensibly but I am not aware of any.
What you might want to try is plotting each day of the week separately and then using something like {patchwork} or [cowplot} to glue them together.
Also, you might want to look into breaking the data down by day-of-the-week and trying a box-plot or violin-plot.
The inspectdf library might be handy for some general stats on the dataset.
If you have both the 12 yearly data sets and the merged data in the environment, you should be able to free up memory by deleting the former. Each of the years can be extracted from the merged data if you need them.
y is an ordinal variable that takes on 7 distinct values and x is continuous unless it's been binned. That's going to have some implications on how the data can be displayed. Because the linked posit cloud data is password, protection, I've taken similar data to illustrate. The three plots, particularly the scatterplot, don't illustrate anything that can't be seen more precisely from the various summaries and statistical tests. The great strength of visualizations is the power to surprise, and these don't.
That's not to say that different visualizations might not be useful. But first, the question needs asking: What is it we want to know about length of trips by day of week?
At the most basic: are the lengths the same or different? For trips measured in seconds, we can confidently guess that 5 million plus will be different. (Spoiler, they are.) So, the next question is how different? One unreasonably useful measure of difference is the difference in means. Indeed, the days of week have different mean length of trips. So, next question: are the differences just the result of random variation? Some are, some aren't (see below).
But we're not done, because the plural of number isn't data. We have some records recorded for a single day showing duration length greater than 24 hours; the largest is 72 days. And we have 60 second records. A few hundred in either case, out of over 5 million. How do we feel about that? Do we really want either the seemingly impossible long trips or the trivially short flip-the-switch trips? Does it make a difference? (Ecclesiastes 3:1 provides one heuristic.)
library(ggplot2)
library(rstatix)
# data derived from https://s3.amazonaws.com/tripdata/201307-201402-citibike-tripdata.zip
# dur is length of ride in seconds
# dt is date of beginning of ride
d <- read.csv("https://careaga.s3-us-west-2.amazonaws.com/bike.csv")
d[,1] <- as.numeric(d[,1])
d[,2] <- lubridate::ymd_hms(d[,2]) |>
lubridate::wday(x = _, label = TRUE) |>
as.factor()
d <- d[complete.cases(d),]
dim(d)
#> [1] 5562321 2
str(d)
#> 'data.frame': 5562321 obs. of 2 variables:
#> $ dur: num 634 1547 178 1580 757 ...
#> $ dt : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 2 2 2 2 2 2 2 2 2 ...
summary(d)
#> dur dt
#> Min. : 60 Sun:674524
#> 1st Qu.: 398 Mon:806726
#> Median : 634 Tue:839971
#> Mean : 879 Wed:857451
#> 3rd Qu.: 1042 Thu:820588
#> Max. :6250750 Fri:829008
#> Sat:734053
mean(d$dur) - median(d$dur)
#> [1] 244.8093
hist(d$dur)
# save out initial set
e <- d
# restrict range to aid visualization
# number of rides lasting less than 5 minutes
sum(d[,1] == 300)
#> [1] 6097
# trim to eliminate the 1-minute rides rides (executive decision)
d <- d[which(d$dur > 300),]
# number of rides longer than 45 minutes
sum(d[,1] > 60*45)
#> [1] 97172
# trim to eliminate rides longer than 45 minutes (also arbitrary)
d <- d[which(d$dur <= 60*45),]
# rerun summaries
dim(d)
#> [1] 4695502 2
summary(d)
#> dur dt
#> Min. : 301.0 Sun:565787
#> 1st Qu.: 484.0 Mon:677543
#> Median : 708.0 Tue:709553
#> Mean : 853.1 Wed:725912
#> 3rd Qu.:1096.0 Thu:692125
#> Max. :2700.0 Fri:701755
#> Sat:622827
quantile(d$dur,probs = seq(0,1,0.05))
#> 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75%
#> 301 338 375 411 447 484 522 563 607 655 708 766 831 906 994 1096
#> 80% 85% 90% 95% 100%
#> 1220 1370 1563 1853 2700
mean(d$dur) - median(d$dur)
#> [1] 145.1044
hist(d$dur)
Footnote. Here is the shell script to pre-process the source csv files to pull only the two columns of interest. Cuts down the size for the import into R by an order of magnitude.
#!/bin/bash
# Initialize the output file
echo "dur","dt" > bike
# Loop through all CSV files in the current directory
for file in *.csv; do
# Extract the desired columns and append them to the output file
awk -F, -v OFS=, 'NR > 1 {print $1, $2}' "$file" >> bike
done
mv bike bike.csv
Thanks - it makes sense to break down the data into manageable amounts, especially since doing so will also afford the opportunity to evaluate patterns better. I will take a look at patchwork and cowplot.
Thanks. I'm still working on nomenclature in the world of RStudio. When you refer to the environment, is that the whole right half, or just the lower-right. Still not clear on what the individual names for the two right side panes are. I believe the left side is the console, or is that the word for the whole set of panes - both left and right? Anyway, if I get your meaning, I need to delete the files in the lower-right, while keeping the files in the upper-right, which got there after I read and/or merged previously uploaded files.
Thanks - I already cleaned all 12 spreadsheets by removing all records under 5 minutes and records that ended after the 'start date'. Even rides started at 11:50 pm and ended at 12:05 am were removed. Depending on the month, the % of records removed varied from about 15% in the warmer months to about 32% in the colder months. In other words, all remaining records show rides beginning and ending on the same day, so no rides are over 24 hours. All told, there are still over 4 million rows for one year, but I'm not working at the year-level anymore - just by the month and day of the week, especially. So, the visualizations forthcoming should be more relevant and less prone to fuzziness due to over-saturation of data.