initial_time_split creating wrong split

nealec · November 25, 2022, 11:12am

On some number combinations the result of the initial_time_split generates a train/testing split different to what was intended. Seems the use of n_train = floor(x) within the function is driving this.

My script was generating the prop by using '1 - NROW(desired test set)/NROW(full data set)'. Using prop = NROW(desired training set )/NROW(full data set) does fix the issue so this isn't a major issue but still it caught me out. I guess I was interested in the reason for using floor(x) and not round(x, digits = 0)?

If you run the code below the result of NROW(testData) is 45, not 44.

data <- dplyr::as_tibble(c(
  rep("tr", 199),
  rep("te", 44)
))

prop = 1 - (44/243)

split <- rsample::initial_time_split(data = data, prop = prop)

testData <- rsample::testing(split)

NROW(testData)

hannah · December 6, 2022, 1:02pm

initial_time_split() is working as intended since prop is documented as

The proportion of data to be retained for modeling/analysis.

or, in your words "NROW(desired training set )/NROW(full data set)".

The reason we're working with floor() rather than round() is that round() does a thing that's often rather unexpected for users:

round(1:10 + 0.5)
#>  [1]  2  2  4  4  6  6  8  8 10 10

^{Created on 2022-12-06 with reprex v2.0.2}

nealec · December 6, 2022, 2:05pm

thank you, and that output from round is indeed 'unexpected'.

system · December 27, 2022, 2:05pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.