Reaching legit RandomForest results reproducible with set.seed

Ruffybeo · January 8, 2020, 12:36am

I guess that my question is kinda weird but:

I'm working on a university project where I have to use a RandomForest model to predict if patients have depressive tendencies. And while I'm getting results, I'm not sure if they are really valid or legit because of the seed. Here is a code snippet:

for (t in 1:5) {
  set.seed(123)
  seed <- sample.int(100)
    set.seed(seed)  
      seeds <- vector(mode = "list", length = 50)
      for(i in 1:50){
       seeds[[i]] <- sample.int(1000, 12)}
   #For the last model:
      seeds[[50]] <- sample.int(1000, 1)

yourdata_neu$Depressiv <- as.factor(yourdata_neu$Depressiv)

inTraining <- createDataPartition(yourdata_neu$Depressiv[1:nrow(yourdata_neu)], p = 0.70, list = FALSE) #75% der Probanden in Training, 25 in Test
training <- yourdata_neu[inTraining,] 
testing <- yourdata_neu[-inTraining,]

train_control <- trainControl(method="cv", number=10, verboseIter = TRUE, seeds = seeds, search = "grid") 
model <- train(training[,1:ncol(yourdata_neu)-1],as.factor(training[,ncol(yourdata_neu)]), method = "rf", type="classification", metric= "Accuracy", maximize= TRUE, trControl = train_control, importance = TRUE) 
model1 <- randomForest(training[,1:ncol(yourdata_neu)-1],as.factor(training[,ncol(yourdata_neu)]), type="classification", importance = TRUE, proximity = TRUE) 
prediction1 <- predict(model1, testing[,1:ncol(yourdata_neu)-1])
prediction2 <- predict(model, testing[,1:ncol(yourdata_neu)-1])
print(confusionMatrix(prediction2, as.factor(testing[,ncol(yourdata_neu)]),  positive = "1"))

Basically I'm setting my seed to "123" at the beginning of my loop, after that I'm generating the numbers 1 to 100 in a random order and save them in a variable. This variable is my new seed for the whole model. The variable "seeds" is for my trainControl and contains a list with 50 entries for a seed. (1 till 49 have 12 numbers each, the last one gets only one number after the seed-loop) I repeat these steps for every iteration of the model.

My result for the Accuracy is constant after every iteration. But because I have to write a scientific paper about my model, I'm not quite sure if I can set my seed as I did? Or if there is another way to make my results reproducible for academic matters? I'm grateful for every input

technocrat · January 8, 2020, 1:05am

Hi, a reproducible example, called a reprex would help.

My suggestion would be to start simply with the example from help(trainControl

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
## Do 5 repeats of 10-Fold CV for the iris data. We will fit
## a KNN model that evaluates 12 values of k and set the seed
## at each iteration.

set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

## For the last model:
seeds[[51]] <- sample.int(1000, 1)

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     seeds = seeds)
ctrl
#> $method
#> [1] "repeatedcv"
#> 
#> $number
#> [1] 10
#> 
#> $repeats
#> [1] 5
#> 
#> $search
#> [1] "grid"
#> 
#> $p
#> [1] 0.75
#> 
#> $initialWindow
#> NULL
#> 
#> $horizon
#> [1] 1
#> 
#> $fixedWindow
#> [1] TRUE
#> 
#> $skip
#> [1] 0
#> 
#> $verboseIter
#> [1] FALSE
#> 
#> $returnData
#> [1] TRUE
#> 
#> $returnResamp
#> [1] "final"
#> 
#> $savePredictions
#> [1] FALSE
#> 
#> $classProbs
#> [1] FALSE
#> 
#> $summaryFunction
#> function (data, lev = NULL, model = NULL) 
#> {
#>     if (is.character(data$obs)) 
#>         data$obs <- factor(data$obs, levels = lev)
#>     postResample(data[, "pred"], data[, "obs"])
#> }
#> <bytecode: 0x7faab3531458>
#> <environment: namespace:caret>
#> 
#> $selectionFunction
#> [1] "best"
#> 
#> $preProcOptions
#> $preProcOptions$thresh
#> [1] 0.95
#> 
#> $preProcOptions$ICAcomp
#> [1] 3
#> 
#> $preProcOptions$k
#> [1] 5
#> 
#> $preProcOptions$freqCut
#> [1] 19
#> 
#> $preProcOptions$uniqueCut
#> [1] 10
#> 
#> $preProcOptions$cutoff
#> [1] 0.9
#> 
#> 
#> $sampling
#> NULL
#> 
#> $index
#> NULL
#> 
#> $indexOut
#> NULL
#> 
#> $indexFinal
#> NULL
#> 
#> $timingSamps
#> [1] 0
#> 
#> $predictionBounds
#> [1] FALSE FALSE
#> 
#> $seeds
#> $seeds[[1]]
#>  [1] 415 463 179 526 195 938 818 118 299 229 244  14 374 665 602 603 768 709  91
#> [20] 953 348 649
#> 
#> $seeds[[2]]
#>  [1] 989 355 840  26 519 426 649 766 211 932 590 593 555 871 373 844 143 544 490
#> [20] 621 775 905
#> 
#> $seeds[[3]]
#>  [1] 937 842  23 923 956 309 135 821 997 224 166 217 290 581  72 588 575 141 722
#> [20] 865 859 153
#> 
#> $seeds[[4]]
#>  [1] 294 277 463  41 431  90 316 223 528 116 606 774 747 456 598 854  39 159 752
#> [20] 209 374 818
#> 
#> $seeds[[5]]
#>  [1]  34 516  13  69 895 755 409 308 278  89 928 537 983 291 424 880 286 908 671
#> [20] 121 110 158
#> 
#> $seeds[[6]]
#>  [1]  64 483 910 477 480 711  67 663 890 847  85 165 648  51  74 178 362 236 610
#> [20] 330 726 127
#> 
#> $seeds[[7]]
#>  [1] 972 212 686 785 958 814 310 931 744 878 243 862 847 792 113 983 619 903 477
#> [20] 975 151 666
#> 
#> $seeds[[8]]
#>  [1] 614 767 160 391 155 426   5 326 784 280 800 789 567 843 932 238 764 339  39
#> [20] 822 137 455
#> 
#> $seeds[[9]]
#>  [1] 738 560 589  83 696 879  39 196 769 680 286 606 500 985 784 344 310 459 944
#> [20]  20 872 195
#> 
#> $seeds[[10]]
#>  [1] 861 164  52 876 534 177 554 827  84 523 633 951 392 302 597 877 706 619 589
#> [20] 430 710 761
#> 
#> $seeds[[11]]
#>  [1] 712 428 672 250 804 429 398 528 983 381 545  40 936 522 473 200 978 125 265
#> [20] 775 903 186
#> 
#> $seeds[[12]]
#>  [1] 573 252 458 152 831  54 919 538 235 289 185 765 413 627 522 309 995 205 875
#> [20] 779 537 564
#> 
#> $seeds[[13]]
#>  [1] 794 391 409 727 346 160 468 509 920  57 457 617 357 279 270 878 646 347 129
#> [20] 218 618 881
#> 
#> $seeds[[14]]
#>  [1] 698 337 797  26 539 981 519 956 757 666 553 724 390 498 222 671 861 657 960
#> [20] 421  57 660
#> 
#> $seeds[[15]]
#>  [1] 163 985 238 673 578 516 330 225 389 117 537 648  55 217 597 557 658 682 415
#> [20] 134 711 957
#> 
#> $seeds[[16]]
#>  [1] 873 688 913 757 941 988 447 821 104 993 831 711 468 210 349 401 737 258 177
#> [20] 386 141  24
#> 
#> $seeds[[17]]
#>  [1] 945 963 466 130 165 703 588 377 781 170 445 710 874 234 422 508 880  64  80
#> [20] 483 548 475
#> 
#> $seeds[[18]]
#>  [1] 291 765 343 323 479 560 450 111 791 963 905 317 807 222 287 734 585 292 226
#> [20] 790 890 684
#> 
#> $seeds[[19]]
#>  [1] 297 860 605 637 811  39 237 165 619  33  83 396 866 277 209  76  94 803  30
#> [20] 217 946 175
#> 
#> $seeds[[20]]
#>  [1] 374 323 115 377 850 608 465 358 682 424 938  96 538 397 404 742 148 980 862
#> [20] 937 392 935
#> 
#> $seeds[[21]]
#>  [1]  714  593  447  338  744  243  106  887   11  625  364  386  403  461  141
#> [16]   31  926  115  790   94 1000   16
#> 
#> $seeds[[22]]
#>  [1] 709 420 178 417 464 412 177 524 437 924 578 562 204 175 947 373 646 996 384
#> [20] 122 399 403
#> 
#> $seeds[[23]]
#>  [1] 315 259 494 865 760 289  48 331 100 108 301  10 170 280 348 402 209 468 827
#> [20] 649 309 395
#> 
#> $seeds[[24]]
#>  [1] 108   8 626 261 541 306 326  74 282 585 267 887 262 736 204 723 219 696 352
#> [20] 667 119 452
#> 
#> $seeds[[25]]
#>  [1] 856 924 579 622 936 646  36  55 490 240 891 632 862 304  10 665 422 612 105
#> [20] 793 388 463
#> 
#> $seeds[[26]]
#>  [1] 180 278 373 241  24 679 559 956 703  37 686 566 303 719 912  19 712 671 378
#> [20] 549 615 244
#> 
#> $seeds[[27]]
#>  [1]  48 188 958 464 393 139 299 371 670 189 970 311 991 418 569 382  38  84 319
#> [20] 686 846 838
#> 
#> $seeds[[28]]
#>  [1] 402 642 120 712 331 533 441 199 499 599  72 315 714 677  81  55 134 424 756
#> [20]   6 128 879
#> 
#> $seeds[[29]]
#>  [1] 668 800  49 739 476 239 340 193 709 459 303 148 898 190 624 191 446 119 627
#> [20] 522 982  59
#> 
#> $seeds[[30]]
#>  [1] 817 903  61 422 108 292 373 535 115 930 600 644 950 413 698 983 763 203 758
#> [20] 246 440 947
#> 
#> $seeds[[31]]
#>  [1] 690 251 560 643 545 990 162 322 576 168 442 788  78 665 493 199 424 445  95
#> [20] 918 464 379
#> 
#> $seeds[[32]]
#>  [1] 342 221 696 161 620 448 242 693 927 814 968 536 828 926 407 229 224 785 474
#> [20] 699 441 171
#> 
#> $seeds[[33]]
#>  [1]  23 218 484 301 648  79 511 507 164 237 579 807 929 422 493 730 796 209 599
#> [20] 693 358 650
#> 
#> $seeds[[34]]
#>  [1] 877 358  41 904 129 848 886 450 232 334 396 730 840 639 998 264 697 201  52
#> [20] 225  67 680
#> 
#> $seeds[[35]]
#>  [1] 770 577 457 903 973 541  20 206 124 592 775 740  45 332 281  91 653 980 138
#> [20] 606 127 425
#> 
#> $seeds[[36]]
#>  [1] 780   8 839 271 595 945 747 167 499 255 599 634 931 902  71 772 970  81 944
#> [20] 776 437 579
#> 
#> $seeds[[37]]
#>  [1] 876 896 437 750 270 412 646 137 673 628  46  64 531 229 610 129 220 692 222
#> [20] 836 507 602
#> 
#> $seeds[[38]]
#>  [1] 122 331 901 502 484 787 291 929 743 709 829 919 169 729 447 561 341  69 320
#> [20] 504  76   2
#> 
#> $seeds[[39]]
#>  [1] 886 786 772 106 111 855 374  72 449 888 971 229 523 719 335 953  56 618 271
#> [20] 207 436 876
#> 
#> $seeds[[40]]
#>  [1] 957 601 292 387 263  68 120 744 565 357 792 742 836 835 523 586 256 349 471
#> [20] 901  88 416
#> 
#> $seeds[[41]]
#>  [1] 857  11 586 463 755 700 287 842 685 827 280 512 803 242 778  64 328 172 298
#> [20] 160 679 903
#> 
#> $seeds[[42]]
#>  [1] 678 529 468 384 929 741 970 365 994 898 591 471 879 227 834 838 622 315 943
#> [20] 243 265 535
#> 
#> $seeds[[43]]
#>  [1] 793 911 982 112 456  93 489 789 631  48 969 482 248 105 171 696 459 516 839
#> [20] 312 562 892
#> 
#> $seeds[[44]]
#>  [1] 139 758 481 843 828 250 742 597 330 633 626 195  99  98  58 424 525   8 258
#> [20] 635 262 599
#> 
#> $seeds[[45]]
#>  [1]  529  928  206  199  589  840  870  459  234 1000  839   55  892  531  753
#> [16]  488  271  942  398  218  155  209
#> 
#> $seeds[[46]]
#>  [1] 545 818 786 660 347 810  74 565 831 329 671 339  55 247 370 530 876  44 860
#> [20] 533 541 376
#> 
#> $seeds[[47]]
#>  [1] 932  84 535 554 709 194 594 460 830 576  77 619 691 658 266 713 355 470 339
#> [20] 200 237 258
#> 
#> $seeds[[48]]
#>  [1] 380 939 893 766 112   5 469 479 612 411 407 899 655 746 615  58 391 290 287
#> [20] 466 530 488
#> 
#> $seeds[[49]]
#>  [1] 814 265 591 913 647 478 904 321   9 867 281 642 907 708  40 148 651  18 950
#> [20] 398 963 348
#> 
#> $seeds[[50]]
#>  [1]  67 878 732 605 383 551 151 591 471 344 168 161 421 444  29 633 241 966 665
#> [20]  42 922  44
#> 
#> $seeds[[51]]
#> [1] 224
#> 
#> 
#> $adaptive
#> $adaptive$min
#> [1] 5
#> 
#> $adaptive$alpha
#> [1] 0.05
#> 
#> $adaptive$method
#> [1] "gls"
#> 
#> $adaptive$complete
#> [1] TRUE
#> 
#> 
#> $trim
#> [1] FALSE
#> 
#> $allowParallel
#> [1] TRUE

^{Created on 2020-01-07 by the reprex package (v0.3.0)}

Also, consider whether you need the optional seeds argument

an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of NA will stop the seed from being set within the worker processes while a value of NULL will set the seeds using a random set of integers. Alternatively, a list can be used. The list should have B+1 elements where B is the number of resamples, unless method is "boot632" in which case B is the number of resamples plus 1. The first B elements of the list should be vectors of integers of length M where M is the number of models being evaluated. The last element of the list only needs to be a single integer (for the final model)

Ruffybeo · January 8, 2020, 2:11am

Hi technocrat, thank you so much for your reply!

I'm sorry for forgetting to upload a reproducible example with my question. (I tried to use the reprex package, but I didn't get quite the outcome that I saw in screenshots. I'm sorry, I'm quite a newbie in R)

(Additional information about my dataset: I have 44 observations (rows), 4 Features and 1 column for my labels. With this formation, I'm able to get a constant Accuracy of 80%. And while I would be happy if that's really the case, I'm not quite sure if that's legit. I expected some decreases/increases during multiple iterations around 5% or something at least. That's why I'm quite sceptical if my seed setup is ok. But the example from help(trainControl) is kind of similar.....)

yourdata_neu <- data.frame(df_test)
#> Error in data.frame(df_test): Objekt 'df_test' nicht gefunden
rownames(yourdata_neu) <- NULL
#> Error in rownames(yourdata_neu) <- NULL: Objekt 'yourdata_neu' nicht gefunden

set.seed(123)

###############################Random Forest Round 1
#Training/Test-Split und Model trainieren
for (t in 1:5) {
  set.seed(123)
  for(i in 1:50){
    seeds[[i]] <- sample.int(1000, 12)}
    #For the last model:
  seeds[[50]] <- sample.int(1000, 1)
  
  yourdata_neu$Depressiv <- as.factor(yourdata_neu$Depressiv)
  
  inTraining <- createDataPartition(yourdata_neu$Depressiv[1:nrow(yourdata_neu)], p = 0.75, list = FALSE) #75% der Probanden in Training, 25 in Test
  training <- yourdata_neu[inTraining,] 
  testing <- yourdata_neu[-inTraining,]
  
  #Crossvalidation werden K-Vali mit den 10mal wiederholen
  train_control <- trainControl(method="cv", number=10, verboseIter = TRUE, seeds = seeds, search = "grid") 
  model <- train(training[,1:ncol(yourdata_neu)-1],as.factor(training[,ncol(yourdata_neu)]), method = "rf", type="classification", metric= "Accuracy", maximize= TRUE, trControl = train_control, importance = TRUE) 
  model1 <- randomForest(training[,1:ncol(yourdata_neu)-1],as.factor(training[,ncol(yourdata_neu)]), type="classification", importance = TRUE, proximity = TRUE) #Macht im Prinzip dasselbe, nur Randomforest
  prediction1 <- predict(model1, testing[,1:ncol(yourdata_neu)-1])
  prediction2 <- predict(model, testing[,1:ncol(yourdata_neu)-1])
  print(confusionMatrix(prediction2, as.factor(testing[,ncol(yourdata_neu)]),  positive = "1"))
  
  importance    <- importance(model1)
  varImportance <- data.frame(Variables = row.names(importance), 
                              Importance = round(importance[ ,'MeanDecreaseGini'],2))
  
  #Create a rank variable based on importance
  rankImportance <- varImportance %>%
    mutate(Rank = paste0(dense_rank(desc(Importance))))
  if(min(rankImportance$Importance) < 1.0){
    RankImportance_Filter <- rankImportance[rankImportance$Importance == min(rankImportance$Importance),]
    Importance_Table_Filter <- RankImportance_Filter$Variables
    Excluding_Channels <- names(yourdata_neu) %in% Importance_Table_Filter
    yourdata_neu <- yourdata_neu[!Excluding_Channels]
  }
}
#> Error in eval(expr, envir, enclos): Objekt 'seeds' nicht gefunden

technocrat · January 8, 2020, 3:15am

R has a notoriously steep learning curve, especially for anyone coming from a procedural/imperative programming background of

do this, then do that

R is exposed to the user mainly as gymnasium algebra: f(x) = y

Keep that in mind will help, I hope.

As far as repex, it doesn't always work, at least in my experience. The specific problem yours has is that there is nothing to feed it, no data. The best solution, if you can find one, is to find a standard R data set, using data() that is similarly structured to yours or can be readily transformed.

Let's think about reproducibility that depends on set.seed().

Every time we run a random process without a seed we get (surprise), a different result, which is why we set.seed(42) for reproducibility. It doesn't matter what the specific seed is for any given random call.

So, why not simply create a list of random integers

sample.int(1000, 51)
#>  [1] 508 904 363 343 691 687 493 372 706 186 599 926 741 840 621 603 325 773 358
#> [20] 266  28 268 820  13  76 126 552 861 936  95 869 375 781 767 816 258 509 244
#> [39] 532  31 778 428 777 369 220 248 466 762 571 416 612

<sup>Created on 2020-01-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>

Then as you iterate models (you might look at the recipes package), just pick the next seed.

Does that make sense? Isn't the key fact that \kappa is identical for all model runs evidence that the seeds parameter is identical?

Ruffybeo · January 10, 2020, 8:09pm

.....finally, the light in my head went on!

And I tried to work in your example and it worked!

Thank you so much for your help

technocrat · January 10, 2020, 8:21pm

Great! Please mark the solution for the benefit of those to follow.

system · January 17, 2020, 8:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.