Asking for feedback and suggestion for my first project coded using R on Rstudio

I am relatively new to Data Analytics and have been self learning it through taking up a Google Data Analytics Processional Certification course on Coursera.

I have completed my first ever Data analytic project that I published on Github. It is a very simple project where I clean, organise, transform and analysis a set of data on Food Choice and Preferences of College Students. You can refer to this link for more details regarding it: GitHub - charlenelow/Food-Choices-and-Preferences-of-College-Students

I would like to get some feedback and suggestions on what I can improve on! I would greatly appreciate your help for that! Thank you in advance!

Please refer to my README.md file within the repository to get more details regarding the project.

1 Like

Great work on data wrangling. As a follow-up, you might want to run the data through the {vcd} package, which is specifically geared to questionnaire results.

A few tips/recommendations:

  1. Don't use absolute paths. I saw this in your code food_survey_respond_df <-readr::read_csv("~/Desktop/My Projects/Food Choices and Preference of College Students/food_coded.csv") Will other people running this code have the data in the same place? Probably not. Instead, I recommend using the here package. https://here.r-lib.org/ Project-oriented workflow This could be replaced with: food_survey_respond_df <- readr::read_csv(here::here("food_coded.csv"))
  2. Why are you deleting all missing data? Removing more than half the rows?
  3. At the beginning of the analysis before diving in, give some context. Where did this data come from? How was it collected? When was it collected? I have no idea if this data is from the 1980s or 2023 if it's at one college, colleges across the US, or colleges across the world. Was it a random sample? A convenience sample?
  4. You're typing a lot more than you need to. Look into functions such as case_when() or case_match(). Here's an example of your code and a more compact form which also does a check of the recoding comparing the original and recoded variable:
weight_df <- weight_df %>% 
  mutate(exercise_Num = exercise) # create a duplicate column to preserve original data
weight_df$exercise <- str_replace_all(weight_df$exercise, "1", "Everyday")
weight_df$exercise <- str_replace_all(weight_df$exercise, "3", "once a week")
weight_df$exercise <- str_replace_all(weight_df$exercise, "2", "2-3 times a week")
weight_df$exercise <- str_replace_all(weight_df$exercise, "4", "Sometimes")
weight_df$exercise <- str_replace_all(weight_df$exercise, "5", "Never")

weight_df %>% 
  count(exercise)
weight_df <- weight_df %>% 
  mutate(
    exercise_Num = exercise,# create a duplicate column to preserve original data
    exercise = case_match(
      exercise_Num,
      1~"Everyday",
      2~"2-3 times a week",
      3~"Once a week",
      4~"Sometimes",
      5~"Never"
    )
    ) 
weight_df %>% 
  count(exercise_Num, exercise)
2 Likes

Thanks for your kind input!! I managed to shorten my code using the case_when and case_match functions!! You made a very good point with not using absolute paths. i will be using the here package to import my raw datasets from now on!

As for the 2nd and 3rd point, I got this dataset from a kaggle open data and I will try providing more context regarding before the analysis.

oh! I see. May I ask if I am plotting a stacked bar chart, using a vcd package vs ggplot2 package, is there a difference? Or vcd package is more useful in terms of showing correlation between categorial data and not so much on the data viz itself?

Hi Steph,

Thank you for your valuable suggestions and feedback regarding my work. I have learnt a lot from all your inputs, and I have made some changes to my work. I am eager to continue refining it. I invite you to take a look at it here: GitHub - charlenelow/Food-Choices-and-Preferences-of-College-Students, and I sincerely appreciate any future suggestions you might offer.

Here's what a typical plot using vcd::mosaic would look like

library(data.table)
library(vcd)
#> Loading required package: grid
d <- fread("https://raw.githubusercontent.com/charlenelow/Food-Choices-and-Preferences-of-College-Students/main/Data/food_coded.csv")
e <- d[,c("marital_status","income","drink","eating_changes_coded",
          "ethnic_food","sports")]
e <- sapply(e,as.factor)
mosaic(~ marital_status + income + drink, data = e,
       main = "Comforting", shade = TRUE, legend = TRUE)

mosaic(~ ethnic_food + sports, data = e,
       main = "Comforting", shade = TRUE, legend = TRUE)

mosaic(~ marital_status + sports, data = e,
       main = "Comforting", shade = TRUE, legend = TRUE)

mosaic(~ income + sports, data = e,
       main = "Comforting", shade = TRUE, legend = TRUE)

mosaic(~ marital_status + ethnic_food, data = e,
       main = "Comforting", shade = TRUE, legend = TRUE)

Created on 2023-08-15 with reprex v2.0.2

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.