Numbers Not Changing in R

omario · October 16, 2021, 4:15am

I am working with the R programming language. I have a dataset which contains a person's height and whether or not they play basketball.

I want to see if on average, people over the 80th percentile (height wise) play basketball.

To do this, I:

I randomly break the data into a 70% group (train) and a 30% group (test)
I calculate the 80th percentile of the train group: using this 80th percentile, I see how many people in the test group play basketball
I calculate on average how accurate I was (on the test group)
I repeat this procedure many times (e.g. 100) and calculate the total average.

Here is the R code that generates the data for this example:

set.seed(123)

height <- rnorm(1000,210,5)
status <- c("basketball", "not_basketball")
basketball_status <- as.character(sample(status, 1000, replace=TRUE, prob=c(0.80, 0.20)))
data_1 <- data.frame(height, basketball_status)

height <- rnorm(1000,190,1)
status <- c("basketball", "not_basketball")
basketball_status <- as.character(sample(status, 1000, replace=TRUE, prob=c(0.8, 0.2)))
data_2 <- data.frame(height, basketball_status)


height <- rnorm(1000,170,5)
status <- c("basketball", "not_basketball")
basketball_status <- as.character(sample(status, 1000, replace=TRUE, prob=c(0.20, 0.80)))
data_3 <- data.frame(height, basketball_status)


my_data <- rbind(data_1, data_2, data_3)

And here is the iterative process:

library(dplyr)

results <- list()
for (i in 1:100) {

  train_i<-sample_frac(my_data, 0.7)

  sid<-as.numeric(rownames(train_i))

  test_i<-my_data[-sid,]
 
  quantiles = data.frame( train_i %>% summarise (quant_1 = quantile(height, 0.80)))
 
 
  test_i$basketball_pred = as.character(ifelse(test_i$height > quantiles_i$quant_1 , "basketball",   "not_basketball" ))
 
  test_i$accuracy = ifelse(test_i$basketball_pred == test_i$basketball_status, 1, 0)
 
  
 
  results_tmp = data.frame(test_i %>%
                          
                           dplyr::summarize(Mean = mean(accuracy, na.rm=TRUE)))
 
  results_tmp$iteration = i
 
  results_tmp$total_mean = mean(test_i$accuracy)
  results[[i]] <- results_tmp
}

results

results_df <- do.call(rbind.data.frame, results)

But when I run the iterative process, all averages appear the same:

head(results_df)
       Mean iteration total_mean
1 0.8344444         1  0.8344444
2 0.8344444         2  0.8344444
3 0.8344444         3  0.8344444
4 0.8344444         4  0.8344444
5 0.8344444         5  0.8344444
6 0.8344444         6  0.8344444

Question: Does anyone know why this is happening?

Thanks

nirgrahamuk · October 16, 2021, 9:21am

I think this code doesn't do what you believe.
You are sampling rows from my_data, then making sid from the result of that process. this means sid is always fixed, as whatever the result of that process its rownames with always be 1,2,3,4,5
There are many alternatives you could code up. the one closest to your current approach would be to make the row id's of my_data concrete, in an actual variable name, so that they come along with the sample_frac, and can be pulled out of the result to determine sid

system · November 6, 2021, 9:21am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.