Does `rsamples::group_vfold_cv` copy the data set into each split?

chillerb · February 18, 2025, 6:17pm

I thought that rsamples only stored the training and test indices, instead of full data set copies - however, the memory usage increases perfectly linear with the number of splits. Am I missing something here?

print(dim(my_data))
print(format(object.size(my_data), units = "MiB"))

splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 1, group = group)
print(format(object.size(splits), units = "MiB"))

splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 10, group = group)
print(format(object.size(splits), units = "MiB"))

[1]  54 962
[1] "169.4 MiB"
[1] "847.1 MiB"
[1] "8471.4 MiB"

chillerb · February 19, 2025, 10:44am

Ok, so apparently, the issue is that I was using object.size instead of lobstr::obj_size to compute the memory footprint. If I understand it correctly, object.size does not account for shared references, therefore overestimating the memory footprint, while lobstr::obj_size computes the actual memory size:

print(dim(my_data))
print(lobstr::obj_size(my_data))

splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 1, group = group)
print(lobstr::obj_size(splits))

splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 1000, group = group)
print(lobstr::obj_size(splits))

[1]  54 962
[1] "693.65 kB"
[1] "701.37 kB"
[1] "5.75 MB"

chillerb · February 19, 2025, 5:47pm

Also, I was a fool and did not know how R's copy-on-modify works.
I'm a changed man now.

system · May 20, 2025, 5:47pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.