Beginner Data Analyst

technocrat · August 21, 2023, 8:30pm

Here's my general advice

The Tao of Analysis

Any project benefits from an over-arching mental model—f(x) = y, just as in school algebra. x is what is to hand, y is what is required and f is what is available to transform x into y. Each of these objects may be, and usually is, composite. x may contain columns and rows of numeric and character values, y may be a table of summary statistics and f might be in the form of f(g(x)). In R everything is an object, including functions, and because functions can be arguments to other functions, it is said that in R, functions are "first class" objects.

Notice that this mental model is missing how. In programming, this type of model is called a functional style. R as it presents to the user is a functional programming language. The principal style of programming is called procedural/imperative—do this, do that, then do this other thing the first way and that other thing the second way \dots. A functional orientation helps to keep the eyes on the ball and the goal. When it is important to do something very specific, a procedural orientation helps to keep the eyes on the patch of grass beneath it. Always remember that the data is infinitely variable but the tools aren't.

This type of model also is called analysis. Analysis is hard because it is unnatural. We do not go through our daily lives minutely examining every situation and breaking it down into its smallest pieces. Rather, we are constantly scanning and integrating sources of information in our environment simultaneously. Legal education in the United States is heavily focused on analysis in its first year. Some students have had university experience in chemistry, linguistics or philosophy and have less difficulty. Most, however, lack previous exposure and will go to great lengths to avoid learning analysis. That tendency is countered by assignment of more court cases to review than can be comfortably read with the threat of being called upon in lecture to present a case with no notice. Despite the severe pressure of time and materials, students will spend hours extra reading student guidebooks and meeting with other students to try collectively to find the right answers. It only slowly becomes apparent that while answer to any particular type of case turns on the specific facts, the questions remain the same.

In studying data science, the equivalent evasion should be apparent—attempting to "learn" R (you can't; there's too much already and more pops up every week), trying to replicate the steps of a program found somewhere and burning hours before finding that your data doesn't support the steps used in the example.

More general advice.

Leave presentation until the end. Designing a sound solution to an analytic project takes enough time without interruption every few minutes to see "how it looks" and then finding that an hour has flown by in tweaking "just this one little thing." Better a plain report at the end than dazzling layout and graphics that lacks substance.

Learning by doing is very effective, but also very inefficient. Obstacles will arise at every step. Most error messages are punctuation related. Errors mentioning "closures" usually indicate that the name of an assigned object, such as data conflicts with a built-in function of the same name. Errors arise from missing arguments given to functions, such as f(x) when f(x,y) is needed because y has no default value. Errors also arise from the wrong type of object given as an argument, such as a date in character string form when a Date object is required.

Usually, the best way to approach all of this is to identify the specific function that is causing the error and read its help(some_function) page. I used to complain that help needed its own help and was taken aback to find that it does. Look at the formal argument signature at the top which includes its parameters. In using a function you satisfy the parameters with arguments corresponding in position or in labels to the parameters. Some are shown taking a default argument foo = NULL, for example. If there is no default, the user must supply it, unless it's optional. Then look at the arguments section which discusses each requirement and option. At least those that are required. You can ignore $\dots$ because it's always optional. Then read the values section, which describes the return of the function. It sometimes comes packages in an object class that you may need to unpack to extract the parts needed. Run one or two of the examples that look closest to what you are trying to do. Save the result to some object, foo, and use str(foo) to examine its structure.

Tooling tips

Create a new project from RStudio and do all your work in that folder. install.packages("here") and use library(here) in your scripts. That way if you decide to have a data folder and a script folder you can refer to them `here("R/script1.R") whatever working directory you happen to be in.

Zip your project file and save it to cloud every day. Or twice a day. Consider GitHub.com as an alternative backup strategy but don't burn too much time if it is too challenging.

Pay attention to your namespace in the Environment panel. It's easy to lose track of where objects came from and one often finds that something being relied on in Script3.R was created in Script1.R and won't be found in the later unless the former has been run. It's good practice to restart an R Session after each save.

Look for opportunities to reduce repetition and proliferation of object names by creating your own functions. The Extract Function option in the RStudio Code menu is great for this. Here's an example

# find_seconds between two date times

find_seconds <- function(x,y){
  begin  = lubridate::ymd_hms(x)
  finish = lubridate::ymd_hms(y)
  lapsed = lubridate::as.duration(
    lubridate::interval(begin,finish, tzone = lubridate::tz(begin)))
  just_seconds = lapsed@.Data
  return(just_seconds)
}

(begin  <- "2022-12-31 04:07:31")
#> [1] "2022-12-31 04:07:31"
(finish <- "2023-01-01 14:07:56")
#> [1] "2023-01-01 14:07:56"

find_seconds(begin,finish)
#> [1] 122425

Created on 2023-01-02 with reprex v2.0.2 (When asking for specific help here use a reprex like this..)

There are functions in the apply family that will run an appropriate function over the variables in a data frame.

Remember that for small datasets, it is easier to recreate objects from a script or function that to keep track of what they were named and where they were put. When dealing with one data frame at a time, choose a standard identifier, say d. That way the same code snippets that take d as an object won't have to be edited as if you were using my_data_frame_for_particulates and my_data_frame_for_no2. Don't use . in your own names for anything; by convention it should be reserved for functions in {base} and other libraries that are loaded automatically.

Keep variable names short and lower case (to save keystrokes). When it comes time to presentation those names can be made more descriptive for readers who have not been living with them.