partition a set of elements into subsets in such a way that the subsets are as similar as possible

Hi there, I'm pretty new to R and I need some advice on how to handle this particular topic.

I have a set of elements like this:

AR AC RG PR

1 1 4 7
2 3 9 12
3 3 9 12
.. .. .. ..

And I need to partition this set in 2 subsets as similar as possible, but I also want to ensure that each set contains an equal number of elements.

example:

nrow(dataTOT)
[1] 1000

table(dataTOT$SPLIT)
0 1
500 500

I also want that the subsets are similar as possible, so:

table(dataTOT$SPLIT, dataTOT$AR)
1 2 3 4 5
0 105 112 150 120 104
1 106 112 149 121 104

This is needed for all the above columns ( AR, AC, RG, PR) a perfect match will not exist, but I need to minimize the difference across the subsets, in the example above a +1/-1 difference is perfectly tolerated.

Thanks for your precious help...

Hi, and welcome.

A couple of things before getting started. First, if applicable, see the homework policy and second, provide a reproducible example, called a reprex, so that a common data frame is available for answers and an illustration of an attempted solution.

The starting point is a class of problem within combinatorics. Let n = nrow(dataTOT). You are looking to create the set of combinations of n/2 objects taken 4 at a time from dataTOT, \binom{n/2}{4}.

For example, if n/2 = 10

m <- combn(10,4)
m
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1
#> [2,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2
#> [3,]    3    3    3    3    3    3    3    4    4     4     4     4     4     5
#> [4,]    4    5    6    7    8    9   10    5    6     7     8     9    10     6
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,]     1     1     1     1     1     1     1     1     1     1     1     1
#> [2,]     2     2     2     2     2     2     2     2     2     2     2     2
#> [3,]     5     5     5     5     6     6     6     6     7     7     7     8
#> [4,]     7     8     9    10     7     8     9    10     8     9    10     9
#>      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
#> [1,]     1     1     1     1     1     1     1     1     1     1     1     1
#> [2,]     2     2     3     3     3     3     3     3     3     3     3     3
#> [3,]     8     9     4     4     4     4     4     4     5     5     5     5
#> [4,]    10    10     5     6     7     8     9    10     6     7     8     9
#>      [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
#> [1,]     1     1     1     1     1     1     1     1     1     1     1     1
#> [2,]     3     3     3     3     3     3     3     3     3     3     3     4
#> [3,]     5     6     6     6     6     7     7     7     8     8     9     5
#> [4,]    10     7     8     9    10     8     9    10     9    10    10     6
#>      [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
#> [1,]     1     1     1     1     1     1     1     1     1     1     1     1
#> [2,]     4     4     4     4     4     4     4     4     4     4     4     4
#> [3,]     5     5     5     5     6     6     6     6     7     7     7     8
#> [4,]     7     8     9    10     7     8     9    10     8     9    10     9
#>      [,63] [,64] [,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74]
#> [1,]     1     1     1     1     1     1     1     1     1     1     1     1
#> [2,]     4     4     5     5     5     5     5     5     5     5     5     5
#> [3,]     8     9     6     6     6     6     7     7     7     8     8     9
#> [4,]    10    10     7     8     9    10     8     9    10     9    10    10
#>      [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86]
#> [1,]     1     1     1     1     1     1     1     1     1     1     2     2
#> [2,]     6     6     6     6     6     6     7     7     7     8     3     3
#> [3,]     7     7     7     8     8     9     8     8     9     9     4     4
#> [4,]     8     9    10     9    10    10     9    10    10    10     5     6
#>      [,87] [,88] [,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98]
#> [1,]     2     2     2     2     2     2     2     2     2     2     2     2
#> [2,]     3     3     3     3     3     3     3     3     3     3     3     3
#> [3,]     4     4     4     4     5     5     5     5     5     6     6     6
#> [4,]     7     8     9    10     6     7     8     9    10     7     8     9
#>      [,99] [,100] [,101] [,102] [,103] [,104] [,105] [,106] [,107] [,108]
#> [1,]     2      2      2      2      2      2      2      2      2      2
#> [2,]     3      3      3      3      3      3      3      4      4      4
#> [3,]     6      7      7      7      8      8      9      5      5      5
#> [4,]    10      8      9     10      9     10     10      6      7      8
#>      [,109] [,110] [,111] [,112] [,113] [,114] [,115] [,116] [,117] [,118]
#> [1,]      2      2      2      2      2      2      2      2      2      2
#> [2,]      4      4      4      4      4      4      4      4      4      4
#> [3,]      5      5      6      6      6      6      7      7      7      8
#> [4,]      9     10      7      8      9     10      8      9     10      9
#>      [,119] [,120] [,121] [,122] [,123] [,124] [,125] [,126] [,127] [,128]
#> [1,]      2      2      2      2      2      2      2      2      2      2
#> [2,]      4      4      5      5      5      5      5      5      5      5
#> [3,]      8      9      6      6      6      6      7      7      7      8
#> [4,]     10     10      7      8      9     10      8      9     10      9
#>      [,129] [,130] [,131] [,132] [,133] [,134] [,135] [,136] [,137] [,138]
#> [1,]      2      2      2      2      2      2      2      2      2      2
#> [2,]      5      5      6      6      6      6      6      6      7      7
#> [3,]      8      9      7      7      7      8      8      9      8      8
#> [4,]     10     10      8      9     10      9     10     10      9     10
#>      [,139] [,140] [,141] [,142] [,143] [,144] [,145] [,146] [,147] [,148]
#> [1,]      2      2      3      3      3      3      3      3      3      3
#> [2,]      7      8      4      4      4      4      4      4      4      4
#> [3,]      9      9      5      5      5      5      5      6      6      6
#> [4,]     10     10      6      7      8      9     10      7      8      9
#>      [,149] [,150] [,151] [,152] [,153] [,154] [,155] [,156] [,157] [,158]
#> [1,]      3      3      3      3      3      3      3      3      3      3
#> [2,]      4      4      4      4      4      4      4      5      5      5
#> [3,]      6      7      7      7      8      8      9      6      6      6
#> [4,]     10      8      9     10      9     10     10      7      8      9
#>      [,159] [,160] [,161] [,162] [,163] [,164] [,165] [,166] [,167] [,168]
#> [1,]      3      3      3      3      3      3      3      3      3      3
#> [2,]      5      5      5      5      5      5      5      6      6      6
#> [3,]      6      7      7      7      8      8      9      7      7      7
#> [4,]     10      8      9     10      9     10     10      8      9     10
#>      [,169] [,170] [,171] [,172] [,173] [,174] [,175] [,176] [,177] [,178]
#> [1,]      3      3      3      3      3      3      3      4      4      4
#> [2,]      6      6      6      7      7      7      8      5      5      5
#> [3,]      8      8      9      8      8      9      9      6      6      6
#> [4,]      9     10     10      9     10     10     10      7      8      9
#>      [,179] [,180] [,181] [,182] [,183] [,184] [,185] [,186] [,187] [,188]
#> [1,]      4      4      4      4      4      4      4      4      4      4
#> [2,]      5      5      5      5      5      5      5      6      6      6
#> [3,]      6      7      7      7      8      8      9      7      7      7
#> [4,]     10      8      9     10      9     10     10      8      9     10
#>      [,189] [,190] [,191] [,192] [,193] [,194] [,195] [,196] [,197] [,198]
#> [1,]      4      4      4      4      4      4      4      5      5      5
#> [2,]      6      6      6      7      7      7      8      6      6      6
#> [3,]      8      8      9      8      8      9      9      7      7      7
#> [4,]      9     10     10      9     10     10     10      8      9     10
#>      [,199] [,200] [,201] [,202] [,203] [,204] [,205] [,206] [,207] [,208]
#> [1,]      5      5      5      5      5      5      5      6      6      6
#> [2,]      6      6      6      7      7      7      8      7      7      7
#> [3,]      8      8      9      8      8      9      9      8      8      9
#> [4,]      9     10     10      9     10     10     10      9     10     10
#>      [,209] [,210]
#> [1,]      6      7
#> [2,]      8      8
#> [3,]      9      9
#> [4,]     10     10

Created on 2019-12-15 by the reprex package (v0.3.0)

There are 210 possible combinations. Obviously as n increases, the number of combinations increases.

The second part of the problem is best addressed by linear algebra, using the matrix m produced by combn.

Let's take two pairs from m and test for similarity:

m <- combn(10,4)
m[,136] - m[,136] <= 1
#> [1] TRUE TRUE TRUE TRUE
m[,136] - m[,143] <= 1
#> [1]  TRUE FALSE FALSE FALSE

Created on 2019-12-15 by the reprex package (v0.3.0)

1 Like

Thank you for the reply, I will try to work on your considerations,

for our pilot dataframe we will have approximately 1300 rows to partition in those 2 balanced subsets...

I'll stay in touch

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.