Synthetic Data Generation

I am working on a simple synthetic data generator to whip up quick datasets I can play with. Is there an alternative to the rsn() function from the sn package that can skew and manipulate values but restrict the values to my minimum and maximum arguments?

This is what I have so far, the argument for "sig_result" is TRUE it uses rsn() otherwise, it calls for random numbers between the min and max values, I apologize for the general lack of comments:

# Variable Data Generator

##### Chunk 1: Load Required Packages #####

library(random); library(tidyverse); library(moments); library(synthpop);

##### Chunk 2: Create the data_generator function #####

data_generator <- function(min_value, max_value, whole_values, dec_places,
                           sig_result, number_of_cases, visualize, 
                           seed_number, xi, omega, alpha) {

  data_values <- randomNumbers(n = number_of_cases,
                               min = min_value,
                               max = max_value,
                               col = 1,
                               base = 10)
} else {
  data_values <- rsn(number_of_cases, xi, omega, alpha)
  if(whole_values == TRUE) {
    data_values <- round(data_values)
    } else {data_values <- round(data_values, digits = dec_places)}
  # Generate Histogram w/normal curve plotted
  if(visualize == TRUE) {
    hist(data_values, probability = TRUE,
         main = paste("Histogram of", number_of_cases, "Generated Cases"),
         xlab = "Generated Data Values", ylab = "Density")
    # Calculate mean and standard deviation
    m <- mean(data_values)
    s <- sd(data_values)
    # Add normal curve
    curve(dnorm(x, mean = m, sd = s), add = TRUE, col = "darkblue", lwd = 2)
  print(paste("Skewness:", round(skewness(data_values), digits = 2)))
  print(paste("Kurtosis:", round(kurtosis(data_values), digits = 2)))

scale_total <- data_generator(0, 21, FALSE, 0, TRUE, 10000, TRUE, 1024, 0, 1, 0)

I'd say that by definition, the skew-normal distribution can have values outside whatever range you are defining.

So depending on your goal, you could use a different, bounded, distribution (for example a beta or uniform); or you could simply reject any value outside the min-max range, and get a truncated skew-normal distribution.

1 Like

Thanks, this is helpful.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.