geom_point() in ggplot is plotting a distorted chart

AT50Plus · August 19, 2023, 6:00pm

After merging 12 Excel spreadsheets - one for each month of the year, each with identical column names - and running this function: ggplot(merged_data, aes(x = ride_length, y = day_of_week)) + geom_point(), the visualization in lower-right corner is seriously fuzzy. The x axis is ride_length (in minutes) and the y axis is day_of_week. Is there a better package than ggplot for this type of work? Or maybe something other than geom_point? Visualizations more suited for datasets with over 4 million merged rows? Check out the domain in the link below.

Here is the domain: Posit Cloud

jrkrideau · August 20, 2023, 10:55pm

A link to Posit Cloud is of no use to anyone who does not subscribe. I do not.

If I understand the problem, you have too many data points to be graphed. You might want to consider randomly sub-setting your data set and plotting the subsets.

AT50Plus · August 21, 2023, 6:08pm

Thank you for your suggestion - Trying to merge 12 spreadsheets with a total of over 4 million rows is what the instructions said to do, although even this newbie was skeptical. Part of the capstone project to ascertain patterns. Will probably attempt to sub-set two columns only, since they are the ones I'm interested in.

The Posit Cloud is what was 'sold' to me as the new cloud version of RStudio, which I'm using for capstone project #1. Didn't know I would need to spend $ to upgrade to 8GB, only for that not to be enough (since I'm testing much code so as to learn what I'm doing), and to add insult to injury, not being able to post link that everyone using the forum can use.

jrkrideau · August 21, 2023, 7:03pm

To be honest it sounds a bit strange but if you have the memory it is doable. I'd have thought a database like DuckDB would be preferable but I don't deal with data with millions of rows normally.

My sampling theory is close to non-existent but I would think that a 5% or 10% sample would give you some feel for what's happening unless you have a really unusual distribution. Your comment about the the visualization in lower-right corner is seriously fuzzy is a bit worrying.

There may be a way to display this much data sensibly but I am not aware of any.
What you might want to try is plotting each day of the week separately and then using something like {patchwork} or [cowplot} to glue them together.

Also, you might want to look into breaking the data down by day-of-the-week and trying a box-plot or violin-plot.

The inspectdf library might be handy for some general stats on the dataset.

EconProf · August 21, 2023, 8:15pm

If you have both the 12 yearly data sets and the merged data in the environment, you should be able to free up memory by deleting the former. Each of the years can be extracted from the merged data if you need them.

technocrat · August 22, 2023, 10:41am

y is an ordinal variable that takes on 7 distinct values and x is continuous unless it's been binned. That's going to have some implications on how the data can be displayed. Because the linked posit cloud data is password, protection, I've taken similar data to illustrate. The three plots, particularly the scatterplot, don't illustrate anything that can't be seen more precisely from the various summaries and statistical tests. The great strength of visualizations is the power to surprise, and these don't.

That's not to say that different visualizations might not be useful. But first, the question needs asking: What is it we want to know about length of trips by day of week?

At the most basic: are the lengths the same or different? For trips measured in seconds, we can confidently guess that 5 million plus will be different. (Spoiler, they are.) So, the next question is how different? One unreasonably useful measure of difference is the difference in means. Indeed, the days of week have different mean length of trips. So, next question: are the differences just the result of random variation? Some are, some aren't (see below).

But we're not done, because the plural of number isn't data. We have some records recorded for a single day showing duration length greater than 24 hours; the largest is 72 days. And we have 60 second records. A few hundred in either case, out of over 5 million. How do we feel about that? Do we really want either the seemingly impossible long trips or the trivially short flip-the-switch trips? Does it make a difference? (Ecclesiastes 3:1 provides one heuristic.)

library(ggplot2)
library(rstatix)
# data derived from https://s3.amazonaws.com/tripdata/201307-201402-citibike-tripdata.zip
# dur is length of ride in seconds
# dt is date of beginning of ride
d <- read.csv("https://careaga.s3-us-west-2.amazonaws.com/bike.csv")
d[,1] <- as.numeric(d[,1])
d[,2] <- lubridate::ymd_hms(d[,2]) |> 
  lubridate::wday(x = _, label = TRUE) |>
  as.factor()
d <- d[complete.cases(d),]
dim(d)
#> [1] 5562321       2
str(d)
#> 'data.frame':    5562321 obs. of  2 variables:
#>  $ dur: num  634 1547 178 1580 757 ...
#>  $ dt : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 2 2 2 2 2 2 2 2 2 ...
summary(d)
#>       dur            dt        
#>  Min.   :     60   Sun:674524  
#>  1st Qu.:    398   Mon:806726  
#>  Median :    634   Tue:839971  
#>  Mean   :    879   Wed:857451  
#>  3rd Qu.:   1042   Thu:820588  
#>  Max.   :6250750   Fri:829008  
#>                    Sat:734053
mean(d$dur) - median(d$dur)
#> [1] 244.8093
hist(d$dur)

# save out initial set
e <- d
# restrict range to aid visualization
# number of rides lasting less than 5 minutes
sum(d[,1] == 300)
#> [1] 6097
# trim to eliminate the 1-minute rides rides (executive decision)
d <- d[which(d$dur > 300),]
# number of rides longer than 45 minutes
sum(d[,1] > 60*45)
#> [1] 97172
# trim to eliminate rides longer than 45 minutes (also arbitrary)
d <- d[which(d$dur <= 60*45),]
# rerun summaries
dim(d)
#> [1] 4695502       2
summary(d)
#>       dur           dt        
#>  Min.   : 301.0   Sun:565787  
#>  1st Qu.: 484.0   Mon:677543  
#>  Median : 708.0   Tue:709553  
#>  Mean   : 853.1   Wed:725912  
#>  3rd Qu.:1096.0   Thu:692125  
#>  Max.   :2700.0   Fri:701755  
#>                   Sat:622827
quantile(d$dur,probs = seq(0,1,0.05))
#>   0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%  75% 
#>  301  338  375  411  447  484  522  563  607  655  708  766  831  906  994 1096 
#>  80%  85%  90%  95% 100% 
#> 1220 1370 1563 1853 2700
mean(d$dur) - median(d$dur)
#> [1] 145.1044
hist(d$dur)

aggregate(dur ~ dt, d, mean)
#>    dt      dur
#> 1 Sun 925.6559
#> 2 Mon 826.3782
#> 3 Tue 824.4831
#> 4 Wed 824.9405
#> 5 Thu 815.4007
#> 6 Fri 835.7804
#> 7 Sat 943.1217
aggregate(dur ~ dt, d, median)
#>    dt dur
#> 1 Sun 780
#> 2 Mon 684
#> 3 Tue 683
#> 4 Wed 685
#> 5 Thu 676
#> 6 Fri 694
#> 7 Sat 801
aggregate(dur ~ dt, d, quantile)
#>    dt dur.0% dur.25% dur.50% dur.75% dur.100%
#> 1 Sun    301     508     780    1234     2700
#> 2 Mon    301     474     684    1048     2700
#> 3 Tue    301     475     683    1044     2700
#> 4 Wed    301     476     685    1044     2700
#> 5 Thu    301     472     676    1027     2700
#> 6 Fri    301     480     694    1063     2700
#> 7 Sat    301     521     801    1260     2700
aggregate(dur ~ dt, d, range)
#>    dt dur.1 dur.2
#> 1 Sun   301  2700
#> 2 Mon   301  2700
#> 3 Tue   301  2700
#> 4 Wed   301  2700
#> 5 Thu   301  2700
#> 6 Fri   301  2700
#> 7 Sat   301  2700

ggplot(d,aes(dur,dt,group = dt)) + 
  geom_boxplot() +
  coord_flip() +
  theme_minimal()


ggplot(d,aes(dur,dt,group = dt)) +
  geom_col() +
  coord_flip() +
  theme_minimal()


ggplot(d,aes(dur,dt)) +
  geom_point() +
  theme_minimal()


ggplot(d,aes(dur)) +
  geom_histogram() +
  facet_wrap(~ dt) +
  theme_minimal()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.



# Perform the non-parametric Kruskal-Wallis test
# because days have unequal numbers of records
# null is that the median (and therefore the
# distribution of each day of week is the same)
kruskal.test(dur ~ dt, data = e)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  dur by dt
#> Kruskal-Wallis chi-squared = 41558, df = 6, p-value < 2.2e-16
# results indicate that at least one day of the
# week is different from the others
# post-hoc test
# results indicate that the medians of each
# day of week differ except Monday and Tuesday
# and Tuesday and Wednesday
pairwise.wilcox.test(e$dur, e$dt, p.adjust.method = "bonferroni")
#> 
#>  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
#> 
#> data:  e$dur and e$dt 
#> 
#>     Sun    Mon    Tue    Wed    Thu    Fri   
#> Mon <2e-16 -      -      -      -      -     
#> Tue <2e-16 1.000  -      -      -      -     
#> Wed <2e-16 0.014  0.283  -      -      -     
#> Thu <2e-16 <2e-16 <2e-16 <2e-16 -      -     
#> Fri <2e-16 <2e-16 <2e-16 <2e-16 <2e-16 -     
#> Sat <2e-16 <2e-16 <2e-16 <2e-16 <2e-16 <2e-16
#> 
#> P value adjustment method: bonferroni

# parametric test of means
# Mon-Tue, Mon-Fri, Tue-Wed, Tue-Thu, Wed-Thu means don't differ significantly
tukey_hsd(e,dur ~ dt) |> print(n = Inf)
#> # A tibble: 21 × 9
#>    term  group1 group2 null.value estimate conf.low conf.high        p.adj
#>  * <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>        <dbl>
#>  1 dt    Sun    Mon             0  -145.    -164.     -125.   0           
#>  2 dt    Sun    Tue             0  -153.    -172.     -133.   0           
#>  3 dt    Sun    Wed             0  -168.    -187.     -148.   0           
#>  4 dt    Sun    Thu             0  -166.    -186.     -147.   0           
#>  5 dt    Sun    Fri             0  -131.    -150.     -112.   0           
#>  6 dt    Sun    Sat             0    26.5      6.66     46.4  0.00161     
#>  7 dt    Mon    Tue             0    -7.60   -26.0      10.8  0.886       
#>  8 dt    Mon    Wed             0   -22.7    -41.0      -4.40 0.00474     
#>  9 dt    Mon    Thu             0   -21.3    -39.8      -2.84 0.0119      
#> 10 dt    Mon    Fri             0    13.9     -4.57     32.3  0.286       
#> 11 dt    Mon    Sat             0   171.     152.      190.   0           
#> 12 dt    Tue    Wed             0   -15.1    -33.2       3.02 0.175       
#> 13 dt    Tue    Thu             0   -13.7    -32.0       4.57 0.289       
#> 14 dt    Tue    Fri             0    21.5      3.22     39.7  0.00944     
#> 15 dt    Tue    Sat             0   179.     160.      198.   0           
#> 16 dt    Wed    Thu             0     1.36   -16.8      19.6  1           
#> 17 dt    Wed    Fri             0    36.5     18.4      54.7  0.0000000616
#> 18 dt    Wed    Sat             0   194.     175.      213.   0           
#> 19 dt    Thu    Fri             0    35.2     16.8      53.5  0.000000331 
#> 20 dt    Thu    Sat             0   193.     174.      212.   0           
#> 21 dt    Fri    Sat             0   158.     139.      176.   0           
#> # ℹ 1 more variable: p.adj.signif <chr>

# but is that due to unrealistically long duration records?
# such as this 72 day trip

max(e$dur) / 3600/24
#> [1] 72.34664
# reduce effect of very long and very short trips by sampling
es <- sample(dim(e)[1],1e4)
es <- e[es,]
summary(es)
#>       dur            dt      
#>  Min.   :   60.0   Sun:1196  
#>  1st Qu.:  400.0   Mon:1434  
#>  Median :  632.0   Tue:1454  
#>  Mean   :  845.3   Wed:1537  
#>  3rd Qu.: 1038.0   Thu:1563  
#>  Max.   :49468.0   Fri:1468  
#>                    Sat:1348
tukey_hsd(es,dur ~ dt) |> print(n = Inf)
#> # A tibble: 21 × 9
#>    term  group1 group2 null.value estimate conf.low conf.high        p.adj
#>  * <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>        <dbl>
#>  1 dt    Sun    Mon             0  -158.   -284.       -32.5  0.0039      
#>  2 dt    Sun    Tue             0  -208.   -333.       -82.6  0.0000212   
#>  3 dt    Sun    Wed             0  -164.   -288.       -39.7  0.00191     
#>  4 dt    Sun    Thu             0  -245.   -369.      -122.   0.0000000977
#>  5 dt    Sun    Fri             0  -129.   -254.        -3.82 0.0384      
#>  6 dt    Sun    Sat             0   -51.0  -179.        76.7  0.902       
#>  7 dt    Mon    Tue             0   -49.6  -169.        70.0  0.885       
#>  8 dt    Mon    Wed             0    -5.27 -123.       113.   1           
#>  9 dt    Mon    Thu             0   -87.1  -205.        30.4  0.303       
#> 10 dt    Mon    Fri             0    29.4   -89.9      149.   0.991       
#> 11 dt    Mon    Sat             0   107.    -14.5      229.   0.127       
#> 12 dt    Tue    Wed             0    44.4   -73.2      162.   0.925       
#> 13 dt    Tue    Thu             0   -37.5  -155.        79.6  0.965       
#> 14 dt    Tue    Fri             0    79.0   -39.9      198.   0.441       
#> 15 dt    Tue    Sat             0   157.     35.5      279.   0.00267     
#> 16 dt    Wed    Thu             0   -81.8  -197.        33.6  0.359       
#> 17 dt    Wed    Fri             0    34.7   -82.6      152.   0.977       
#> 18 dt    Wed    Sat             0   113.     -7.26     233.   0.0819      
#> 19 dt    Thu    Fri             0   116.     -0.324    233.   0.0512      
#> 20 dt    Thu    Sat             0   194.     75.0      314.   0.000033    
#> 21 dt    Fri    Sat             0    78.0   -43.2      199.   0.482       
#> # ℹ 1 more variable: p.adj.signif <chr>

^{Created on 2023-08-22 with reprex v2.0.2}

technocrat · August 22, 2023, 5:28pm

Footnote. Here is the shell script to pre-process the source csv files to pull only the two columns of interest. Cuts down the size for the import into R by an order of magnitude.

#!/bin/bash

# Initialize the output file
echo "dur","dt" > bike

# Loop through all CSV files in the current directory
for file in *.csv; do
  # Extract the desired columns and append them to the output file
  awk -F, -v OFS=, 'NR > 1 {print $1, $2}' "$file" >> bike
done
mv bike bike.csv

AT50Plus · August 22, 2023, 8:18pm

Thanks - it makes sense to break down the data into manageable amounts, especially since doing so will also afford the opportunity to evaluate patterns better. I will take a look at patchwork and cowplot.

AT50Plus · August 22, 2023, 8:24pm

Thanks. I'm still working on nomenclature in the world of RStudio. When you refer to the environment, is that the whole right half, or just the lower-right. Still not clear on what the individual names for the two right side panes are. I believe the left side is the console, or is that the word for the whole set of panes - both left and right? Anyway, if I get your meaning, I need to delete the files in the lower-right, while keeping the files in the upper-right, which got there after I read and/or merged previously uploaded files.

AT50Plus · August 22, 2023, 8:28pm

Took another look just now and I see that the environment is the top-right pane, while the lower-right pane is .......files, visualizations, etal....

AT50Plus · August 22, 2023, 8:37pm

Thanks - I already cleaned all 12 spreadsheets by removing all records under 5 minutes and records that ended after the 'start date'. Even rides started at 11:50 pm and ended at 12:05 am were removed. Depending on the month, the % of records removed varied from about 15% in the warmer months to about 32% in the colder months. In other words, all remaining records show rides beginning and ending on the same day, so no rides are over 24 hours. All told, there are still over 4 million rows for one year, but I'm not working at the year-level anymore - just by the month and day of the week, especially. So, the visualizations forthcoming should be more relevant and less prone to fuzziness due to over-saturation of data.

system · September 12, 2023, 8:37pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.