Ecommerce dataset using R

karthikpaikota · November 28, 2023, 1:09pm

https://www.kaggle.com/datasets/shriyashjagtap/e-commerce-customer-for-behavior-analysis/discussion/453676

So we are planning to do project using R for the above dataset Can someone guide us
What criteria are used to determine whether a customer is categorized as 'retained' (assigned a value of 0) or 'churned' (assigned a value of 1) in the binary churn column?

technocrat · November 28, 2023, 8:15pm

There's a good outline of the problem, especially the flow chart.

This is a problem in logistic regression because churn is a binary variable, typically called the response variable. From the code page, there are additional variables available to model the response. Conventionally, these are called treatment variables.

Customer ID: A unique identifier for each customer.
Customer Name: The name of the customer (generated by Faker).
Customer Age: The age of the customer (generated by Faker).
Gender: The gender of the customer (generated by Faker).
Purchase Date: The date of each purchase made by the customer.
Product Category: The category or type of the purchased product.
Product Price: The price of the purchased product.
Quantity: The quantity of the product purchased.
Total Purchase Amount: The total amount spent by the customer in each transaction.
Payment Method: The method of payment used by the customer (e.g., credit card, PayPal).
Returns: Whether the customer returned any products from the order (binary: 0 for no return, 1 for return).

Among these are a Customer ID. IRL, these might be random, in which case we wouldn't expect there to be any useful information as a treatment variable. But they could also be sequential, partially overlapping with Purchase Date leading to an issue called colinearity for which model adjusts may need to be made. There may also be other overlapping variables.

You can find worked examples of models using the similar German bank credit dataset in R on various websites. Here are some examples:

Logistic Regression Model: A logistic regression model is fitted to the dataset using selected covariates like Account Balance, Payment Status of Previous Credit, Purpose, Length of current employment, and Sex/Marital Status. The ROC curve and the AUC are computed for this model.
Logistic Regression with All Explanatory Variables: Another approach is to fit a logistic regression model using all explanatory variables in the dataset. This model's performance is also evaluated using the ROC curve and AUC.
Regression Tree Model: A regression tree model is created using all covariates in the dataset. The performance of this model is assessed with the ROC curve and AUC. Regression trees are visualized using the rpart and rpart.plot .
Random Forest Model: Finally, a random forest model is fitted to the dataset, and its performance is evaluated using ROC curves and AUC. This approach involves growing several trees using a bootstrap procedure and then aggregating those predictions.

These examples illustrate different modeling techniques applied to the German credit dataset in R, providing insights into the practical application of statistical models in finance and credit risk analysis.

cailynkerr321 · December 19, 2023, 11:10am

Regarding your eCommerce dataset, determining customer retention vs. churn is a crucial puzzle. Consider factors like purchase frequency, time since last purchase, or interactions with the platform. You might want to explore clustering or classification algorithms to unveil patterns.

system · January 9, 2024, 11:11am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.