Trying to execute Logistic Regression In R- Predicting Customer Conversions

AyomideA · March 15, 2024, 3:43am

Ok I can send the raw data but I did not proceed , I just thought I needed to change the data format before I attempted to run the split test.

Also, in your example , how I understand you are using the head code to preview first 10 rows of the dataset. How would I perform this on the entire dataset, which is 365,069 rows of data?

Here is the output of the code

> dataset2 <- read.csv("Bike_Trips_2019.csv")
> dput(head(dataset2,10))
structure(list(trip_id = 21742443:21742452, start_time = c("2019-01-01 0:04:37", 
"2019-01-01 0:08:13", "2019-01-01 0:13:23", "2019-01-01 0:13:45", 
"2019-01-01 0:14:52", "2019-01-01 0:15:33", "2019-01-01 0:16:06", 
"2019-01-01 0:18:41", "2019-01-01 0:18:43", "2019-01-01 0:19:18"
), end_time = c("2019-01-01 0:11:07", "2019-01-01 0:15:34", "2019-01-01 0:27:12", 
"2019-01-01 0:43:28", "2019-01-01 0:20:56", "2019-01-01 0:19:09", 
"2019-01-01 0:19:03", "2019-01-01 0:20:21", "2019-01-01 0:47:30", 
"2019-01-01 0:24:54"), bikeid = c(2167L, 4386L, 1524L, 252L, 
1170L, 2437L, 2708L, 2796L, 6205L, 3939L), tripduration = c("390", 
"441", "829", "1,783.00", "364", "216", "177", "100", "1,727.00", 
"336"), from_station_id = c(199L, 44L, 15L, 123L, 173L, 98L, 
98L, 211L, 150L, 268L), from_station_name = c("Wabash Ave & Grand Ave", 
"State St & Randolph St", "Racine Ave & 18th St", "California Ave & Milwaukee Ave", 
"Mies van der Rohe Way & Chicago Ave", "LaSalle St & Washington St", 
"LaSalle St & Washington St", "St. Clair St & Erie St", "Fort Dearborn Dr & 31st St", 
"Lake Shore Dr & North Blvd"), to_station_id = c(84L, 624L, 644L, 
176L, 35L, 49L, 49L, 142L, 148L, 141L), to_station_name = c("Milwaukee Ave & Grand Ave", 
"Dearborn St & Van Buren St (*)", "Western Ave & Fillmore St (*)", 
"Clark St & Elm St", "Streeter Dr & Grand Ave", "Dearborn St & Monroe St", 
"Dearborn St & Monroe St", "McClurg Ct & Erie St", "State St & 33rd St", 
"Clark St & Lincoln Ave"), user_type = c("Subscriber", "Subscriber", 
"Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber", 
"Subscriber", "Subscriber", "Subscriber"), gender = c("Male", 
"Female", "Female", "Male", "Male", "Female", "Male", "Male", 
"Male", "Male"), birthyear = c(1989L, 1990L, 1994L, 1993L, 1994L, 
1983L, 1984L, 1990L, 1995L, 1996L), ride_length = c("0:06:30", 
"0:07:21", "0:13:49", "0:29:43", "0:06:04", "0:03:36", "0:02:57", 
"0:01:40", "0:28:47", "0:05:36"), day_of_week = c(3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L)), row.names = c(NA, 10L), class = "data.frame")

FJCC · March 15, 2024, 4:28am

It turns out the code does work on the raw data, so you can run the following code on your full data set. The column transformations are exactly the same as I used on the first ten rows.

library(tidyverse)
library(tidymodels)
library(hms)
dataset2 <- read.csv("Bike_Trips_2019.csv")
dataset2$user_type <- factor(dataset2$user_type, levels = c("Customer", "Subscriber"), labels = c("no","yes"))
dataset2$trip_id <- as.character(dataset2$trip_id)
dataset2$start_time <- as.POSIXct(dataset2$start_time)
dataset2$end_time <- as.POSIXct(dataset2$end_time)
dataset2$tripduration <- parse_number(dataset2$tripduration)
dataset2$ride_length <- as_hms(dataset2$ride_length)
set.seed(421)
split <- initial_split(dataset2, prop = 0.8, strata = user_type)
train <-split %>% training()
test <- split %>% testing()

AyomideA · March 16, 2024, 7:51pm

Ok no prob! I tried to run that code and received this output:

Error in `abort_lossy_cast()`:
! Lossy cast from <character> to <hms> at position(s) 101, 146, 854, 1405, 7935, ... (and 187 more)
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `abort_lossy_cast()`:
! Lossy cast from <character> to <hms> at position(s) 101, 146, 854, 1405, 7935, ... (and 187 more)
---
Backtrace:
    ▆
 1. ├─hms::as_hms(dataset2$ride_length)
 2. └─hms:::as_hms.default(dataset2$ride_length)
 3.   └─vctrs::vec_cast(x, new_hms())
 4.     └─vctrs (local) `<fn>`()
 5.       └─hms:::vec_cast.hms.character(...)
 6.         └─hms:::abort_lossy_cast(x, to, ..., lossy = lossy)
Run rlang::last_trace(drop = FALSE) to see 1 hidden frame.
>

FJCC · March 16, 2024, 9:05pm

The data at the given positions is probably malformed. The code only uses as_hms() on the ride_length column and the error message gives you some row numbers, so you know where to look for a problem. You can do

dataset2[c(101,146,854,1405,7935), "ride_length"]

to see the specific values causing the problem.

AyomideA · March 16, 2024, 9:06pm

Ahh thanks, will do, one moment,please!

AyomideA · March 16, 2024, 9:09pm

That output is:

[1] "31:14:26" "26:30:30" "71:23:36" "27:56:32" "40:56:47"

FJCC · March 16, 2024, 9:50pm

I didn't know that as_hms() will not handle hours above 24 and I haven't found a simple way to deal with it after looking for a few minutes. The first thing to consider is whether you need that column to be converted. The tripduration column already stores the time of the ride in seconds. If you don't need ride_length converted, just drop that line of the code.

AyomideA · March 16, 2024, 11:06pm

I found an article that mentioned using the function strptime() instead

So I applied the formula and instead ran this code:

library(tidymodels)
library(hms)
dataset2 <- read.csv("Bike_Trips_2019.csv")
dataset2$user_type <- factor(dataset2$user_type, levels = c("Customer", "Subscriber"), labels = c("no","yes"))
dataset2$trip_id <- as.character(dataset2$trip_id)
dataset2$start_time <- strptime(dataset2$start_time, format = "%m/%d/%Y %H/%M")
dataset2$end_time <- strptime(dataset2$end_time,  format = "%m/%d/%Y %H/%M")
dataset2$tripduration <- parse_number(dataset2$tripduration)
dataset2$ride_length <- strptime(dataset2$ride_length, format = "%m/%d/%Y %H/%M" )
set.seed(421)
split <- initial_split(dataset2, prop = 0.8, strata = user_type)
train <-split %>% training()
test <- split %>% testing()

Then per the instructions I moved on to run the train the logistic regression model for it to produce the coefficients.

# Train a logistic regression model
model <- logistic_reg(mixture = double(1), penalty = double(1)) %>%
  set_engine("glmnet") %>%
  set_mode("classification") %>%
  fit(user_type~ ., data = train)

# Model summary
tidy(model)

Output:

Error in model.frame.default(formula, data) : 
  invalid type (list) for variable 'start_time'

system · April 6, 2024, 11:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.