resample a dataset to make certain columns meet specific criteria

billyi · September 13, 2023, 1:40am

I have a dataframe called 'df' defined as follows:

df <- data.frame(A = rnorm(1000, 100, 50), B = rnorm(1000, 20, 10))

I'd like to create a new dataframe, called 'new_df,' by resampling the original 'df.' I want the new dataframe to have its 'A' column with a mean of 80 and a standard deviation of 30.

How can I accomplish this in R?

technocrat · September 13, 2023, 6:23am

set.seed(42)
# Original data frame
d <- data.frame(
  A = rnorm(1000, 100, 50), 
  B = rnorm(1000, 20, 10))

# Function to rescale data
rescale_data <- function(x, new_mean, new_sd) {
  old_mean <- mean(x)
  old_sd <- sd(x)
  (((x - old_mean) / old_sd) * new_sd) + new_mean
}

# Resample data frame with new mean and standard deviation
d_resampled <- data.frame(
  A = rescale_data(d$A, new_mean = 80, new_sd = 30),
  B = rescale_data(d$B, new_mean = 80, new_sd = 30)
)

# Check the mean and standard deviation of the resampled data
mean(d_resampled$A) # Should be close to 80
#> [1] 80
sd(d_resampled$A)   # Should be close to 30
#> [1] 30
mean(d_resampled$B) # Should be close to 80
#> [1] 80
sd(d_resampled$B)   # Should be close to 30
#> [1] 30

^{Created on 2023-09-12 with reprex v2.0.2}

billyi · September 13, 2023, 7:44am

Thank you for your prompt reply, technocrat

What I'm looking for is not a new dataset with rescaled columns, but rather a resampled version of the original dataset where at least one column meets certain criteria, such as mean and standard deviation.

technocrat · September 13, 2023, 8:06am

How would you express that mathematically?

nirgrahamuk · September 13, 2023, 9:45am

set.seed(42)
# Original data frame
d <- data.frame(
  A = rnorm(1000, 100, 50), 
  B = rnorm(1000, 20, 10))


fitness <- function(x){
  (mean(x$A)-80)^2+(sd(x$A)-30)^2
}

split_choose <- function(x){
  nr <- nrow(x) -1
  ch1 <- slice_sample(x,
                      n = nr)
  ch2 <- slice_sample(x,
                      n = nr)
  
 
  if(fitness(ch1)<fitness(ch2)){
    cat("\npick1")
    return(ch1)
  } 
  cat("\npick2")
  ch2
}
new_sample <- split_choose(d)
while (nrow(new_sample)>100) {
  new_sample <- split_choose(new_sample)
}
dim(d)
mean(d$A)
sd(d$A)
dim(new_sample)
mean(new_sample$A)
sd(new_sample$A)

nirgrahamuk · September 13, 2023, 11:21am

just please dont use this for any academic science research, I'd hate to be involved in a bad practices scandal.

jrkrideau · September 13, 2023, 3:31pm

Retraction Watch has been notified.

system · September 20, 2023, 3:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.