It makes either knit.md file or terminate the session now..
I am using brfss2013 dataset.
Setting up environment
install.packages("ggplot2")
library(ggplot2)
library(dplyr)
library(tidyr)
load("brfss2013.gz")
Part 3: Exploratory Data Analysis
Research question 1
Based on the report, is respondent’s body weight linked to their level of education? Can we discern any difference of BMI with income variables? This is quite interesting question as it focuses on linkage between one’s body mass with their income and education status. .
# Select adequate variables from dataset and omit NAs
dd <- select(brfss2013, X_educag, X_incomg, X_bmi5cat) %>%
na.omit()
dim(dd)
We can use 'group_by' and 'summarize' to view the value counts in the '_educag' variable with the mean of body mass in the category in order to check summary statistics:
ddd <- dd %>%
mutate(X_bm=as.numeric(dd$X_bmi5cat))
ddd %>%
group_by(X_educag) %>%
summarize(mean(X_bm), count=n())
Let's apply bar graph to visualize the variables, use as.factor function:
# Specify a column type to be factor (also called categorical or enumerative)
dd$X_educag <- as.factor(dd$X_educag)
dd$X_incomg <- as.factor(dd$X_incomg)
dd$X_bmi5cat <- as.factor(dd$X_bmi5cat)
Now, let's check the graph:
# Visualize the relationship of variables
cc <- ggplot(dd, aes(x=X_educag, fill = X_incomg)) +
geom_bar() + facet_wrap(~X_bmi5cat, ncol=2)
# Enhance readability
cc <- cc + xlab("Level of education completed") + ylab("Count") + scale_fill_discrete(name="Income categories") + theme(axis.text.x = element_text(face="bold", size=7, angle=90)) + ggtitle("BMI related with the level of education and Income categories") + theme(plot.title = element_text(face = "italic", hjust = 0.5, size = 12))
cc
Research question 2
Based on the report, we can examine any linkage in high blood pressure between one’s marital status and genders. The outcome could imply any difference of high blood pressure depends on their marriage situation their genders. Would it be better to get married for the sake of healthy blood pressure? Let’s check it out..
# Select adequate variables from dataset and omit NAs
qq <- select(brfss2013,bpmeds,marital,sex) %>%
na.omit()
We can use 'filter' and 'group by' to view the 'bpmeds' value counts in the 'Married' variable in order to check summary statistics. It will tell the number of married respondents taking medicine for high blood pressure:
qq %>%
filter(marital == "Married") %>%
group_by(bpmeds) %>%
summarize(count=n())
Let's visualize the 3 variables' detail:
# Visualize in a bar plot
gg <- ggplot(qq) + aes(x=marital,fill=sex) + geom_bar() +facet_grid(.~bpmeds)
# Enhance readability
gg <- gg + theme(axis.title.x=element_blank()) + ylab("Count") + scale_fill_discrete(name="Sex") + theme(axis.text.x = element_text(face="bold", size=7, angle=45)) + ggtitle("Taking high blood pressure medicine") + theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 13, color = "darkblue")) + theme(axis.title.y=element_text(angle=0, face="italic", size=11)) + theme(legend.title = element_text(face = "italic", size = 12))
gg
Research question 3
Based on the report, is length of one's sleeping time associated with respondent's marital status? The answer to this question could bring us more research inquires on the relationship of the type of marital status and respondent's health in general.
# Select adequate variables from dataset and calculate sum of sleep time omitting NAs
am <- brfss2013[,c("marital","sleptim1")]
sum(is.na(am$sleptim1))
Using group by function, we can see the number of respondents in each marital satus:
am %>%
group_by(marital) %>%
summarize(count = n())
We can also calculate respondents' mean sleep time per their marital status:
# Calculatie mean of sleep time per marital status and set new names using colnames function.
# Colnames() function in R Language is used to set the names to columns of a matrix.
am <- am[!is.na(am$sleptim1), ]
am <- aggregate(am[, "sleptim1"], list(am$marital), mean)
colnames(am) <- c('Marital_status', 'Average_sleep_time')
am <- am[am$Marital_status !=0,]
Let's visualize the 3 variables' detail:
# Visualize the relationship of variables and enhance readability.
bb <- ggplot(am,aes(x=Marital_status,y=Average_sleep_time,group = 1))+
geom_point() + geom_line() +
xlab("Marital status") + ylab("Average sleep time") +
ggtitle("Plot of sleep time per Marital status")+
theme(axis.text.x = element_text(face="bold", size=10, angle=90)) + theme(axis.title.x=element_blank()) + theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 13, color = "darkblue")) + theme(axis.title.y=element_text(size=11))
bb