About a year ago I started using R. Like many others, for a lack of a better term I am "self-taught". Much like a toddler in a spaceship, I thoroughly enjoy the ride, but have little idea what those shiny buttons and functions do. "Might as well start chewing on it, see what happens".
Naturally, at times the new information can be a bit overwhelming and confusing. For new R users, the solution is often to just look it up online. I would describe "Self-taught" as habitually and haphazardly outsourcing sensemaking to some entity across spacetime to equip oneself with the tools to deal with imminent disasters. In the first month of learning R, I would click on the first resource that popped up. Gradually more and more resources were found, and through experience, a preference for certain resources was built.
For all intents and purposes being "self-taught" is risky. Instead of chewing buttons, a toddler might find that other toddlers recommend headbutting it, which leads to better results, and is more fun! This however, does not mean the action itself is advisable or indeed, recommended. Personally, for learning purposes, I defaulted to YouTube, StackOverflow and the occasional tutorial website during the first few months. It was not until much later I encountered cheatsheets, R books, fora and webinars. To illustrate why this is an issue, consider the following (exaggerated) example where I want to create a vector with numeric value 1. Can you spot how many improvements can be made?
x = ((as.data.frame(1,"2")[[1]]) * 1L)
This is not a particularly aesthetically pleasing, or indeed a practical approach, yet the result is valid. In practice, the excess brackets, inefficient data type conversion, unnecessary function calls, different extractor operators et cetera are the source of problems encountered in code optimization or when communicating results.
x <- 1
To some people, the better approach is obvious. However, what if I asked a question where some opinions are more divided? Should I use a for-loop, the apply family, or map? In an R script, or C++ file? One finds different answers depending on the resource one defaults to. There are contradictions in recommendations. In this example, some say for-loops are bad and some say they are not. Apply is nice but not nice. Map is useful, and not useful. The result: learning 3 different approaches to arrive at one result.
Toddlers unite! let's chew on some of these practices! Here's a challenge:
Using less than 50 characters and 5 lines of code, write something that violates as many aesthetic and performance practices as possible, yet yields object x
with value numeric 1
.
The goal of the above challenge is to share what I think is an important thought process to improve the experience for "self-taught" R users. There exists an opportunity to implement a source of information where users can safely outsource sensemaking on aesthetic and performance choices. Note that the intent is not to draft a definitive selection of tools or packages for a given task. Rather, the intent is to provide an up-to-date and updatable trustworthy entry point to assist in the purpose-driven information search. My opinion is that a cheatsheet is an appropriate channel to deliver such information to R users, assuming that consensus on topics is possible.
A cheatsheet has two goals: first, to help users find essential information quickly, and second, to prevent confusion while doing the above -- Garett Grolemund in Guide to write cheatsheets
Aesthetic programming practices cheatsheet
Good - R - code - practices - and - graphics - are - important. Should R users have to browse through all resources to find out how to though? Some candidate topics for the sheet:
- General code formatting
Commenting practices, style guide selection - Visualization
Themes/colors, scales, graph choices - File format
When to use markdown? shiny? R script? R presentation?
Performance programming cheatsheet
There - are - obviously - many - resources - dedicated - to - performance. Would be nice if there is an accessible repository of information sources available to the R users right? Some candidate topics:
- Common bottlenecks
What are common vectorized alternatives in R? When to consider C/C++? - Data loading practices
vroom
,data.table
,readr
, open connection or something else? which file formats work best? - Memory management
Which data types to use? What resources for larger-than-memory data handling are there? - Benchmarks whens and hows
bench
,microbenchmark
,rbenchmark
,tictoc
or something else? - Profiler selection.
xrprof
,Rprof
,profvis
,profmem
,lineprof
or something else? - Parallelisation practices
foreach
,parallel
, how to identify candidates for parallelisation?
Obviously, such cheatsheet would require the input of many R users, and opinions fundamentally differ on certain topics. One such topic is the need for cheatsheets in the first place. Please fill in the poll!
- Performance best practice cheatsheet
- Aesthetics best practice cheatsheet
- None of these