Every R
problem can be thought of to advantage as the interaction of three objects: an existing object, x , a desired object, y , and a function,f that will return a value of y given x as an argument. In R
everything is an object and objects may contain other objects. Functions are first class objects, just as in f(g(x)).
With f(x)=y in mind, this problem has
x, a data frame of student records arranged in a data frame
by course and semester and, optionally year. If there are separate data frames
, additional preprocessing will be needed to use the script below.
y, a data frame
containing summary statistics by course and semester for n
, the number of students with grading.progressPercent
below a cutoff and the mean of grading.progressPercent
for such students. The cutoff is based on those values greater than the second quartile minus 1.5 times the interquartile difference of the fourth and second quartile. NOTE: this is the Tukey IQR implemented in the fivenum
function. The IQR
function uses a different algorithm. Cf. the difference between the boxplot shown from ggplot2::geom_boxplot
with that returned by the boxplot
function.
f could be created b composing the steps below into a single function, but it is preferable to leave them standalone to be able to follow the composition step by step.
In outline:
- Read in x and discard variables not relevant to the task
- Create a composite course/semester variable
- Determine the cutoff for classifying the
variable
- Identify the
percent
values less than that cutoff
- Take the mean of those values
- Roll everything into a single data frame, keeping a list for each course/semester combination of the
percent
values
An intermediate byproduct of the aggregation method is the creation of multi-level objects. Hence, the ugly subset operators. It will be worthwhile reviewing the syntax to see when to use object[row,column], object[element], object[row,] (all columns) and object[[1]], etc. Once internalized, it simplifies this class of problem.
There is an unresolved situation with course_nest[5,]
that should be hand-checked.
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(purrr)
library(tibble)
})
# identify points more than 1.5 IQR less than lower hinge, using
# Tukey method
get_cutoff <- function(x) {
fivenum(x)[2] -
(fivenum(x)[4] - fivenum(x)[2]) * 1.5
}
get_course_cutoff <- function(x) get_cutoff(course_nest$data[[x]][[1]])
get_out_mean <- function(x) {
mean(course_nest$data[[x]][[1]][which(course_nest$data[[x]][[1]] < course_nest[3][[1]][x])], na.rm = TRUE)
}
get_out_no <- function(x) {
length(which(course_nest$data[[x]][[1]] < course_nest[3][[1]]))
}
# data downloaded from "this workspace by OP"
readr::read_csv("/home/roc/Desktop/grist.csv") %>%
mutate(course =
paste0(grading.courseNumber,"_",grading.termName)) %>%
rename(percent = grading.progressPercent) %>%
select(course, percent) -> x
#>
#> ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> cols(
#> student.enrollmentID = col_double(),
#> student.gender = col_character(),
#> student.grade = col_double(),
#> grading.courseNumber = col_character(),
#> grading.courseName = col_character(),
#> grading.teacherDisplay = col_character(),
#> grading.termName = col_character(),
#> grading.progressScore = col_character(),
#> grading.progressPercent = col_double()
#> )
# display global outliers
ggplot(x,aes(percent)) +
geom_boxplot() +
coord_flip() +
theme_minimal()

# extract lower outlier threshold of all records, if needed
global_cut_off <- get_cutoff(x$percent)
x %>% nest_by(course) -> course_nest
map(1:nrow(course_nest),get_course_cutoff) %>%
unlist() %>%
add_column(cutoff =., course_nest) -> course_nest
map(1:nrow(course_nest),get_out_no) %>%
unlist() %>%
add_column(n =., course_nest) -> course_nest
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
map(1:nrow(course_nest),get_out_mean) %>%
unlist() %>%
add_column(mu =., course_nest) -> course_nest
course_nest %>% mutate(mu = ifelse(is.nan(mu),NA,mu)) -> y
y %>% print(n = Inf)
#> # A tibble: 23 x 5
#> # Rowwise: course
#> course data cutoff n mu
#> <chr> <list<tbl_df[,1]>> <dbl> <int> <dbl>
#> 1 SST110_S1 [440 Γ 1] 34.9 37 27.2
#> 2 SST111_S2 [431 Γ 1] 37.9 32 27.6
#> 3 SST210_S1 [339 Γ 1] 41.0 24 30.4
#> 4 SST211_S2 [335 Γ 1] 38.0 28 22.1
#> 5 SST306_S2 [13 Γ 1] -49.6 6 NA
#> 6 SST310_S1 [198 Γ 1] 43.7 11 30.8
#> 7 SST311_S2 [204 Γ 1] 43.1 15 26.9
#> 8 SST325_S1 [58 Γ 1] 59.6 0 54.2
#> 9 SST325_S2 [88 Γ 1] 38.4 5 27.9
#> 10 SST345_S1 [62 Γ 1] 42.8 3 38.1
#> 11 SST345_S2 [28 Γ 1] 51.7 0 NA
#> 12 SST408_S1 [62 Γ 1] 61.0 1 48.2
#> 13 SST409_S2 [61 Γ 1] 60.3 2 37.0
#> 14 SST415_S1 [92 Γ 1] 42.4 4 34.2
#> 15 SST415_S2 [162 Γ 1] 48.7 3 21.5
#> 16 SST420_S1 [25 Γ 1] 45.3 1 38.8
#> 17 SST421_S2 [23 Γ 1] 38.7 1 NA
#> 18 SST422_S1 [57 Γ 1] 53.1 3 40.3
#> 19 SST423_S2 [47 Γ 1] 63.6 2 51.2
#> 20 SST425_S1 [38 Γ 1] 63.7 0 NA
#> 21 SST430_S1 [38 Γ 1] 60.9 0 57.4
#> 22 SST431_S2 [39 Γ 1] 60.1 0 58.2
#> 23 SST451_S1 [41 Γ 1] 59.1 0 NA
Created on 2020-12-29 by the reprex package (v0.3.0.9001)