What could possibly go wrong?
Every R
problem can be thought of with advantage as the interaction of three objectsâ an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebraâ f(x) = y. Any of the objects can be composites.
Usually, we have the benefit that x is, at the beginning, populated, and we can inspect it for properties that must be transformed through one or more f s before we can apply the f or f s that will yield y. Here, you must work backward from y to anticipate the transformations that are needed.
This suggests that y is the place to startâselect the questions that you wish to put to the data. Under the principle of lazy evaluation there should be no effort applied to cleansing data that will not be used. So, identify all of the descriptive or test statistics first.
Next, phony up some data to apply to each of the tests (often you can crib this from the help page examples) and write your own functions to call them, such as
get_basics <- function(x) {
Mean = mean(x, rm.na = TRUE)
Median = median(x)
return(c(Mean,Median))
}
get_basics(mtcars$mpg)[1] - get_basics(mtcars$mpg)[2]
#> [1] 0.890625
(Not that you'd particularly need this example.)
From there, you'd note that some functions don't take a rm.na
argument and so add to the task list for data cleaning some way of handling NA
s.
Note especially that at this point you care about the answers only so much as that the output of f(x) = y is correct in form. Of course, if you slam two random numeric vectors together with sufficiently large n
, their correlation is likely to be close to zero most of the time.
With these goals in mind, we can begin the time-consuming part of the exercise, variously called scrubbing, munging, cleaning, rehabilitation, remediation, etc.
First, unless you know otherwise, assume that the otherwise correctly recorded data has passed through a spreadsheet before arriving at R
. It will have one or more of the following defects:
- Multiline headers
- Mixing character and numeric types in the same column
- Variables as rows
- Illegal or cumersome variable names
- Missing values
- Errors in computed values (e.g., division by zero)
- Obvious transcription errors (e.g., 7-figure salary for job title \ne head football coach)
- Categorical variables that should be dummied
- Unknown
Second, decide which of these you care about. For example, don't spend time curating a variable that won't be used to create y.
Third, take a stand on data imputation. Do it or not.
Fourth, write a workflow to make the transformations you anticipate.
Fifth, apply to some public data repositories.
Sixth (optional), design a report format.
If done well, you should get an honor's paper out of it.
Come back with specific questions as they arise.