Hi RStudio Community,
When applying multiple exclusion criteria to a dataset, I often want to report the number of observations after each exclusion, either in text or in a flowchart in the Rmarkdown.
However, in my typical data cleaning workflow, I apply all of my exclusion criteria/filters before saving to a new object, which does not allow me to report intermediate numbers (see codechunk combined-filter
).
My current workaround is to either:
- Create a new dataframe for each filter (see codechunk
stepwise-filter-multiple-df
), or - Resave into the same dataframe for each filter after saving out the number (see codechunk
stepwise-filter-multiple-df
)
However, neither looks particularly tidy, and the former could add up in memory if the dataframe is large and there are numerour exclusion steps.
How do you tackle reporting step-wise on data exclusions? Any best practices or suggestions are appreciated!
I checked out Emily Riederer's RMarkdown Driven Development (RmdDD) and documentation for some flowchart packages, e.g. PRISMAstatement but have yet to find any suggestions.
Sample Rmd
*since I'm not sure how to reprex an Rmd
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(dplyr)
library(DiagrammeR)
filtered_mtcars <-
mtcars %>%
filter(hp < 150) %>%
filter(wt < 3) %>%
filter (cyl > 4)
The original mtcars
dataset has r nrow(mtcars)
observations. We removed r nrow(mtcars) - nrow(filtered_mtcars)
observations with a horsepower under 150, weight under 3000lbs, or fewer than 5 cylinders. The filtered dataset has r nrow(filtered_mtcars)
observations.
v1 <-
mtcars %>%
filter(hp < 150)
v2 <-
v1 %>%
filter(wt < 3)
filtered_mtcars <-
v2%>%
filter (cyl > 4)
The original mtcars
dataset has r nrow(mtcars)
observations. We removed r nrow(mtcars) - nrow(v1)
observations with a horsepower under 150, r nrow(v1) - nrow(v2)
observations with weight under 3000lbs, and r nrow(v2) - nrow(filtered_mtcars)
observations with fewer than 5 cylinders. The filtered dataset has r nrow(filtered_mtcars)
observations.
filtered_mtcars <-
mtcars %>%
filter(hp < 150)
hp_1 <- nrow(filtered_mtcars)
filtered_mtcars <-
filtered_mtcars %>%
filter(wt < 3)
wt_2 <- nrow(filtered_mtcars)
filtered_mtcars <-
filtered_mtcars %>%
filter (cyl > 4)
The original mtcars
dataset has r nrow(mtcars)
observations. We removed r nrow(mtcars) - hp_1
observations with a horsepower under 150, r hp_1 - wt_2
observations with weight under 3000lbs, and r wt_2 - nrow(filtered_mtcars)
observations with fewer than 5 cylinders. The filtered dataset has r nrow(filtered_mtcars)
observations.
DiagrammeR::grViz("digraph {
graph [layout = dot, rankdir = TB]
node [shape = rectangle]
rec1 [label = 'Original mtcars (n = @@1)']
rec2 [label = 'Horsepower >= 150 (n = @@2)']
rec3 [label = 'Weight <=3 (n = @@3)']
rec4 [label = 'Cyl > 4 (n = @@4)']
# edge definitions with the node IDs
rec1 -> rec2 -> rec3 -> rec4
}
[1]: nrow(mtcars)
[2]: hp_1
[3]: wt_2
[4]: nrow(filtered_mtcars)
")