Any Help With Decision Tree Algorithm/Implementation Performance and Resource Utilization

Steven_ML · October 8, 2019, 9:02pm

Split from RStudio crash with ctree algorithm in party library - #3

The ctree decision tree algorithm from party library crashed with 500-record training set from my personal computer. I recorded the following memory consumption with ctree algorithm while running the ctree algorithm from my personal computer:

From the graph, it is pretty sure that the ctree consumed great amount of memory. Though there are many Decision Tree algorithm's implementations available from CRAN, they all perform the similar function. Would the information regarding their performance and their resource utilization be possible available, so we can decide the right algorithm implementation to use by the nature of our tasks and resource we have, any idea?

Cheers !!!

Stephen

Steven_ML · October 10, 2019, 2:41am

The R markdown code attached below:

### get the data
#library(readr)
#temp = tempfile()
#download.file(url="https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016/downloads/master.csv/1", destfile=temp )
rm(list=ls())
setwd("/cloud/project")
suicideData = read.csv(file="./raw-suicide-data.csv", header=TRUE, as.is=FALSE)
suicideData$gdp_for_year.... = as.numeric(gsub(",", "", suicideData$gdp_for_year....))
suicideData$HDI.for.year[is.na(suicideData$HDI.for.year)] = mean(suicideData$HDI.for.year, na.rm = TRUE)
suicideData$classVar = cut(suicideData$suicides.100k.pop, breaks = c(0, 0.92, 5.99, 16.62, 225), labels = c(0, 1, 2, 3), include.lowest = TRUE)
suicideData$classVar = as.numeric(suicideData$classVar)
suicideData$suicides.100k.pop = NULL  
DTtrainingCount = 500
DTtestCount = 500
set.seed(123)
DTtraining_indices = sample(seq_len(nrow(suicideData)), size=DTtrainingCount)
DTtrainSet = suicideData[DTtraining_indices,]
RemainingSet = suicideData[-DTtraining_indices,]
DTtest_indices = sample (seq_len(nrow(RemainingSet)), size = DTtestCount)
DTtestSet = RemainingSet[DTtest_indices,]
library(party)
#> Loading required package: grid
#> Loading required package: mvtnorm
#> Loading required package: modeltools
#> Loading required package: stats4
#> Loading required package: strucchange
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> Loading required package: sandwich
# this step crash each time with more than 1,000 records
treeModel = ctree(classVar ~ ., data=DTtrainSet)

Steven_ML · October 22, 2019, 1:45am

Further study the ctree algorithm, I found it seems the correlation between memory usage and record numbers do not exist for the ctree algorithm. I recorded both the minimum level of memory usage and mean level of memory usage every second during the ctree algorithm run with [ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] records and [ 200, 300, 400, 500] records. Below are the plot diagrams for memory usage:

system · November 12, 2019, 1:45am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.