Hi, this is my code, I'm actually using csv files which I have edited in excel before linking to variable in R.
title: "R Notebook"
author: "Edwin Tay"
date: "26/7/2023"
output: html_notebook
Case Study for Cyclistic
Introduction
Cyclists is a bike-share offering company that launched since 2016 and has grown to a fleet of 5824 bicycles that are geotracked and locked into a network of 692 stations across Chicago.
They offer different pricing plans for both casual riders and annual members.
Their finance analysts have concluded that annual members are much more profitable than casual riders.
STEP 1: ASK
Business Task/Question:
How do annual members and casual riders use Cyclistic bikes differently?
Stakeholders:
- Lily Moreno (Director of Marketing and Manager)
- Cyclistic Executive Team
# Packages required for this analysis
install.packages("tidyverse") # for data import
install.packages("lubridate") # for date functions
install.packages("ggplot2") # for visualization
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
STEP 2: PREPARE
qw months of data were collected (Jul 2022 to Jun 2023) instead of 12.
R Studio keeps crashing if I uploaded too much datasets even with 6 datasets.
The data collected is first-party type as they were collected by Cyclistic themselves.
The data collected is also under license by Lyft Bikes and Scooters, LLC ("Bikeshare").
Therefore, the data is reliable.
Data preparation and cleaning will be done in spreadsheet.
STEP 3: PROCESS
R Programming was chosen to handle the process as it can help to break down large data and show great visualization.
Compared between the original and new datasets and the changelog to ensure that the data is clean and properly documented.
Verified that the data is clean (unnecessary data filtered out).
Data Cleaning done in Spreadsheets before importing into R Studio
First change of datasets: Removed lattitude and longitude columns of the datasets (irrelevant in our analysis)
Second change of datasets: Removed station names/id columns from datasets (there are majority of missing data in some months)
Third change of datasets: Change the time format to dd/mm/yyyy hh:mm
Fourth change of datasets: Added a new column for the trip_duration where end time - start time in hh:mm format
Linking to datasets
d202207 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202207-divvy-tripdata-remove.csv")
d202208 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202208-divvy-tripdata-remove.csv")
d202209 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202209-divvy-tripdata-remove.csv")
d202210 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202210-divvy-tripdata-remove.csv")
d202211 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202211-divvy-tripdata-remove.csv")
d202212 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202212-divvy-tripdata-remove.csv")
d202301 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202301-divvy-tripdata-remove.csv")
d202302 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202302-divvy-tripdata-remove.csv")
d202303 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202303-divvy-tripdata-remove.csv")
d202304 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202304-divvy-tripdata-remove.csv")
d202305 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202305-divvy-tripdata-remove.csv")
d202306 <- read_csv("C:/Users/edwin/Desktop/Case_Studies/001/Data/202306-divvy-tripdata-remove.csv")
STEP 4: ANALYZE
Datasets will be formatted to match each other in order to combine them.
# Checking the columns of all datasets (must be same)
colnames(d202207)
colnames(d202208)
colnames(d202209)
colnames(d202210)
colnames(d202211)
colnames(d202212)
colnames(d202301)
colnames(d202302)
colnames(d202303)
colnames(d202304)
colnames(d202305)
colnames(d202306)
# Check if there is any abnormalities
str(d202207)
str(d202208)
str(d202209)
str(d202210)
str(d202211)
str(d202212)
str(d202301)
str(d202302)
str(d202303)
str(d202304)
str(d202305)
str(d202306)
# Convert ride_id and rideable_type to character type
d202207 <- mutate(d202207, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202208 <- mutate(d202208, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202209 <- mutate(d202209, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202210 <- mutate(d202210, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202211 <- mutate(d202211, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202212 <- mutate(d202212, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202301 <- mutate(d202301, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202302 <- mutate(d202302, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202303 <- mutate(d202303, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202304 <- mutate(d202304, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202305 <- mutate(d202305, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
d202306 <- mutate(d202306, ride_id=as.character(ride_id), rideable_type=as.character(rideable_type))
# Combine the datasets together
all_trips <- bind_rows(d202207, d202208, d202209, d202210, d202211, d202212, d202301, d202302, d202303, d202304, d202305, d202306)
all_trips <-mutate(all_trips, trip_time=as.numeric(trip_time))
Inspecting the new table for any irregularities or abnormalities.
colnames(all_trips)
nrow(all_trips)
dim(all_trips)
head(all_trips)
str(all_trips)
summary(all_trips)
# Checking if there are abnormalities in member_casual and trip_duration columns.
check_trips <- all_trips %>%
group_by(member_casual) %>%
summarize(max = max(trip_time))
head(check_trips)