Coming up with example and dummy data is an important parts of of a reprex
, and not covered well in the "FAQ: What's a reproducible example (reprex
) and how do I do one?" and "FAQ: Tips for writing R-related questions" guides.
The goal of this topic is to recap a private discussion sustainers had on this, and draft a new section on creating dummy datasets for reprex's
Borrowed heavily from
- Stack Overflow's excellent "How to make a great R reproducible example"
- Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset?
Producing a minimal dataset
Built in datasets
You can use one of built-in datasets, which are provided with most packages and base-R.
A comprehensive list of built-in datasets can be seen with library(help = "datasets")
. There is a short description to every dataset and more information can be obtained for example with ?mtcars
where 'mtcars' is one of the datasets in the list. Other packages might contain additional datasets, for example ggplot2's diamonds
dataset.
Creating your own vector and data frame
Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample()
can randomize a vector, or give a random vector with only a few values. letters
is a useful vector containing the alphabet. This can be used for making factors.
A few examples :
x = c(1,2,3)
- random values :
x <- rnorm(10)
for normal distribution,x <- runif(10)
for uniform distribution. (Here's a list of all distriubtions in the Rstats
package) - a permutation of some values :
x <- sample(1:10)
for vector 1:10 in random order. - a random factor :
x <- sample(letters[1:4], 20, replace = TRUE)
Making data frames can be done using data.frame()
. One should pay attention to name the entries in the data frame, and to not make it overly complicated.
You may make a data frame with the data.frame()
function. One should pay attention to name the entries in the data frame, and to not make it overly complicated.
An example :
set.seed(1)
data <- data.frame(
X = sample(1:5),
Y = sample(c("yes", "no"), 5, replace = TRUE)
)
data
#> X Y
#> 1 2 no
#> 2 5 no
#> 3 4 no
#> 4 3 no
#> 5 1 yes
Created by the reprex package (v0.2.0.9000).
For some questions, specific formats can be needed. For these, one can use any of the provided as.someType
functions: as.factor
, as.integer
, as.numeric
, as.character
, as.Date
, as.xts
.
tibble::tribble() - Handy if you have the patience to hand type out a some data for your audience in a pretty format. There is a servere limitation in that not all data types can be represented in a tribble()
.
library(tibble); library(dplyr)
df <- tibble::tribble(
~date, ~id, ~ammount,
"27/10/2016 21:00", "0001234", "$18.50",
"28/10/2016 21:05", "0001235", "-$18.50"
) %>%
mutate(date = lubridate::parse_date_time(date, orders = c("d!/m!/Y! H!:M!")))
df
#> # A tibble: 2 x 3
#> date id ammount
#> <dttm> <chr> <chr>
#> 1 2016-10-27 21:00:00 0001234 $18.50
#> 2 2016-10-28 21:05:00 0001235 -$18.50
Created by the reprex package (v0.2.0.9000).
readr::read_csv() - It’s possible to represent your data, complete with type specification, as a read_csv()
call. This can be helpful when you want to copy and paste from a CSV file.
The previous would be:
library(readr)
df <- readr::read_csv(
'date, id, amount
27/10/2016 21:00, 0001234, $18.50
28/10/2016 21:05, 0001235, -$18.50',
col_types = list(col_datetime(format = "%d/%m/%Y %H:%M"),
col_character(), col_character() )
)
df
#> # A tibble: 2 x 3
#> date id amount
#> <dttm> <chr> <chr>
#> 1 2016-10-27 21:00:00 0001234 $18.50
#> 2 2016-10-28 21:05:00 0001235 -$18.50
Created by the reprex package (v0.2.0.9000).
read.table
- Worst case scenario, you can give a text representation that can be read in using the text
parameter of read.table
:
df_txt <- 'date, id, amount
27/10/2016 21:00, 0001234, $18.50
28/10/2016 21:05, 0001235, -$18.50'
df <- read.table(text=df_txt, header = TRUE)
df
#> date. id. amount
#> 27/10/2016 21:00, 0001234, $18.50
#> 28/10/2016 21:05, 0001235, -$18.50
Created by the reprex package (v0.2.0.9000).
Copy your data
If you have some data that would be too difficult to construct using the tips above, then you can always make a subset of your original data, using eg head()
, subset()
or the indices. Then use eg. dput()
to give us something that can be put in R immediately:
For example with the built-in dataset iris
:
dput(head(iris,4))
will produce the output:
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
4L), class = "data.frame")
If your data frame has a factor with many levels, the dput
output can be unwieldy, listing all the possible factor levels even if they aren't present in the the subset of your data.
To solve this issue, you can use the droplevels()
function. Notice below how species is a factor with only one level:
> dput(droplevels(head(iris, 4)))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = "setosa",
class = "factor")), .Names = c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width", "Species"), row.names = c(NA,
4L), class = "data.frame")