I thought that rsamples only stored the training and test indices, instead of full data set copies - however, the memory usage increases perfectly linear with the number of splits. Am I missing something here?
print(dim(my_data))
print(format(object.size(my_data), units = "MiB"))
splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 1, group = group)
print(format(object.size(splits), units = "MiB"))
splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 10, group = group)
print(format(object.size(splits), units = "MiB"))
[1] 54 962
[1] "169.4 MiB"
[1] "847.1 MiB"
[1] "8471.4 MiB"
Ok, so apparently, the issue is that I was using object.size
instead of lobstr::obj_size
to compute the memory footprint. If I understand it correctly, object.size
does not account for shared references, therefore overestimating the memory footprint, while lobstr::obj_size
computes the actual memory size:
print(dim(my_data))
print(lobstr::obj_size(my_data))
splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 1, group = group)
print(lobstr::obj_size(splits))
splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 1000, group = group)
print(lobstr::obj_size(splits))
[1] 54 962
[1] "693.65 kB"
[1] "701.37 kB"
[1] "5.75 MB"
Also, I was a fool and did not know how R's copy-on-modify works.
I'm a changed man now.