Must subset columns with a valid subscript vector. Can't convert from <double> to <integer> due to loss of precision

joselugo · December 29, 2021, 11:14pm

I am trying to run the decision tree C5.0 model with the following dataset:

DT5_Example
id A B C D E PF

1 1 0.0045 0.765 0.0072 0.938 0.809 1
2 2 0.0022 1 0.0076 0.938 1 1
3 3 0.0030 1 0.0010 0.946 1 1
4 4 0.0054 1 0.0045 0.844 1 0
5 5 0.0046 1 0.0041 0.856 1 1
6 6 0.0048 1 0.0051 0.846 1 0
7 7 0.0038 1 0.0005 0.617 0.987 1
8 8 0.0275 1 0.0103 0.954 1 1
9 9 0.0017 1 0.0129 0.917 1 1
10 10 0.0139 1 0.0059 0.983 1 1

Below is my script:
A<-DT5_Example$A
B<-DT5_Example$B
C<-DT5_Example$C
D<-DT5_Example$D
E<-DT5_Example$E

vars<-c(A, B, C, D, E)

Converting PF into a factor because it is the outcome variable

DT5_Example2<-DT5_Example %>%
mutate(PFcat=factor(PF, levels = c(0,1))) %>% collect()

Fitting the C5.0 model to the data

install.packages("C50")
library(C50)
DT5_model<-C5.0(x=DT5_Example2[, vars], y = DT5_Example2$PFcat)
summary(DT5_model)

I received the following error message:
Error: Must subset columns with a valid subscript vector.
x Can't convert from to due to loss of precision.

If you run the model with PF as an integer variable, you still receive the same message

I already googled this error and read topics related in the RStudio community, and I have not been able to fix it. Any help will be appreciated. Thanks.

technocrat · December 30, 2021, 2:14am

This calls for a subset of column indices. The indices must be integers. But vars contains doubles.

library(C50)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DT5_Example <- data.frame(A = c(
  0.0045, 0.0022, 0.003, 0.0054, 0.0046, 0.0048,
  0.0038, 0.0275, 0.0017, 0.0139
), B = c(
  0.765, 1, 1, 1, 1, 1,
  1, 1, 1, 1
), C = c(
  0.0072, 0.0076, 0.001, 0.0045, 0.0041, 0.0051,
  5e-04, 0.0103, 0.0129, 0.0059
), D = c(
  0.938, 0.938, 0.946, 0.844,
  0.856, 0.846, 0.617, 0.954, 0.917, 0.983
), E = c(
  0.809, 1, 1,
  1, 1, 1, 0.987, 1, 1, 1
), PF = c(1, 1, 1, 0, 1, 0, 1, 1, 1, 1))

A <- DT5_Example$A
B <- DT5_Example$B
C <- DT5_Example$C
D <- DT5_Example$D
E <- DT5_Example$E

vars <- c(A, B, C, D, E)

vars
#>  [1] 0.0045 0.0022 0.0030 0.0054 0.0046 0.0048 0.0038 0.0275 0.0017 0.0139
#> [11] 0.7650 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
#> [21] 0.0072 0.0076 0.0010 0.0045 0.0041 0.0051 0.0005 0.0103 0.0129 0.0059
#> [31] 0.9380 0.9380 0.9460 0.8440 0.8560 0.8460 0.6170 0.9540 0.9170 0.9830
#> [41] 0.8090 1.0000 1.0000 1.0000 1.0000 1.0000 0.9870 1.0000 1.0000 1.0000

DT5_Example2<-DT5_Example %>%
  dplyr::mutate(PFcat=factor(PF, levels = c(0,1))) %>% dplyr::collect()

# give required columns explicitly
DT5_model<-C5.0(x=DT5_Example2[, 1:5], y = DT5_Example2$PFcat)
summary(DT5_model)
#> 
#> Call:
#> C5.0.default(x = DT5_Example2[, 1:5], y = DT5_Example2$PFcat)
#> 
#> 
#> C5.0 [Release 2.07 GPL Edition]      Wed Dec 29 18:15:53 2021
#> -------------------------------
#> 
#> Class specified by attribute `outcome'
#> 
#> Read 10 cases (6 attributes) from undefined.data
#> 
#> Decision tree:
#> 
#> D <= 0.846: 0 (3/1)
#> D > 0.846: 1 (7)
#> 
#> 
#> Evaluation on training data (10 cases):
#> 
#>      Decision Tree   
#>    ----------------  
#>    Size      Errors  
#> 
#>       2    1(10.0%)   <<
#> 
#> 
#>     (a)   (b)    <-classified as
#>    ----  ----
#>       2          (a): class 0
#>       1     7    (b): class 1
#> 
#> 
#>  Attribute usage:
#> 
#>  100.00% D
#> 
#> 
#> Time: 0.0 secs

joselugo · December 30, 2021, 10:22pm

Thank you so much for your help! It worked to my end with the real data set. I still have the following question. I understand that the subset of predictors [1:5] must be integers. However, the script that you used to fix the issue did not include any transformation from double to integer. The five predictors in the DT5_Example2 are double. Therefore, "*dplyr::mutate(PFcat=factor(PF, levels = c(0,1))) %>% dplyr::collect()" was the solution for this problem. Am I correct in my interpretation?

technocrat · December 30, 2021, 10:36pm

works to subset DT5_Example2 so long as there are at least 5 variables(columns). It does not matter what type of variables the columns are—integer,double,character,logical or a mix. They just have to be referred to with an integer index.

joselugo · December 31, 2021, 1:40pm

Technocrat, thank you very much for your explanation! The solution of the problem and your explanation have been very much appreciated.

system · January 21, 2022, 1:41pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.