Calculate missing values based on total and sum of available values in groups

I have a dataset in the following form

library(data.table)

dt <- data.table("category" = c("G", "G", "G", "G", "G", "G", "G", "G45", "G45", 
                                "G45", "G45", "G45", "G45", "G45", "G46", "G46", 
                                "G46", "G46", "G46", "G46", "G46", "G47", "G47", 
                                "G47", "G47", "G47", "G47", "G47"),
                  "indicator" = c("V11110", "V11110", "V11110", "V11110", 
                                  "V11110", "V11110", "V11110", "V11110", 
                                  "V11110", "V11110", "V11110", "V11110", 
                                  "V11110", "V11110", "V11110", "V11110", 
                                  "V11110", "V11110", "V11110", "V11110", 
                                  "V11110", "V11110", "V11110", "V11110", 
                                  "V11110", "V11110", "V11110", "V11110"),
                  "size_class" = c("sc_1", "sc_3", "sc_2", "sc_4", "sc_5", 
                                   "sc_6", "sc_TOTAL", "sc_1", "sc_3", "sc_2", 
                                   "sc_4", "sc_5", "sc_6", "sc_TOTAL", "sc_1", 
                                   "sc_3", "sc_2", "sc_4", "sc_5", "sc_6", 
                                   "sc_TOTAL", "sc_1", "sc_3", "sc_2", "sc_4", 
                                   "sc_5", "sc_6", "sc_TOTAL"),
                  "geo" = c("IE", "IE", "IE", "IE", "IE", "IE", "IE", "IE", 
                            "IE", "IE", "IE", "IE", "IE", "IE", "IE", "IE", 
                            "IE", "IE", "IE", "IE", "IE", "IE", "IE", "IE", 
                            "IE", "IE", "IE", "IE"),
                  "year" = c(2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 
                             2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 
                             2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 
                             2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L),
                  "values" = c(19996, 3890, 19476, 2035, NA, 104, 46234, 4077, 
                               456, 2851, 280, NA, NA, 7736, 5864, 1159, 4935, 
                               715, NA, NA, 12991, NA, 2275, 11690, 1040, 378, 
                               69, NA),
                  "grp_id" = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
                               2L, 2L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 16L, 16L, 
                               16L, 16L, 16L, 16L, 16L))

Which gives us the following table:

    category indicator size_class geo year values grp_id
 1:        G    V11110       sc_1  IE 2008  19996      1
 2:        G    V11110       sc_3  IE 2008   3890      1
 3:        G    V11110       sc_2  IE 2008  19476      1
 4:        G    V11110       sc_4  IE 2008   2035      1
 5:        G    V11110       sc_5  IE 2008     NA      1
 6:        G    V11110       sc_6  IE 2008    104      1
 7:        G    V11110   sc_TOTAL  IE 2008  46234      1
 8:      G45    V11110       sc_1  IE 2008   4077      2
 9:      G45    V11110       sc_3  IE 2008    456      2
10:      G45    V11110       sc_2  IE 2008   2851      2
11:      G45    V11110       sc_4  IE 2008    280      2
12:      G45    V11110       sc_5  IE 2008     NA      2
13:      G45    V11110       sc_6  IE 2008     NA      2
14:      G45    V11110   sc_TOTAL  IE 2008   7736      2
15:      G46    V11110       sc_1  IE 2008   5864      7
16:      G46    V11110       sc_3  IE 2008   1159      7
17:      G46    V11110       sc_2  IE 2008   4935      7
18:      G46    V11110       sc_4  IE 2008    715      7
19:      G46    V11110       sc_5  IE 2008     NA      7
20:      G46    V11110       sc_6  IE 2008     NA      7
21:      G46    V11110   sc_TOTAL  IE 2008  12991      7
22:      G47    V11110       sc_1  IE 2008     NA     16
23:      G47    V11110       sc_3  IE 2008   2275     16
24:      G47    V11110       sc_2  IE 2008  11690     16
25:      G47    V11110       sc_4  IE 2008   1040     16
26:      G47    V11110       sc_5  IE 2008    378     16
27:      G47    V11110       sc_6  IE 2008     69     16
28:      G47    V11110   sc_TOTAL  IE 2008     NA     16
    category indicator size_class geo year values grp_id

Now there are some NA which can be calculated by substracting the sum of available values from the total value for each group. For example, the missing value in row 5 can be calculated as sc_TOTAL - sc_1 - sc_2 - sc_3 - sc_4 - sc_6. But this can only be done if there is one missing value, in case there are more than one like in row 12 and 13 it is not possible.

In addition to that there is also a column with the category and the category has sub-categories. That enables us to calculate the missing value in row 28 for example. By substracting the sc_TOTAL from G46 and G45 from the sc_TOTAL from G. This in turn enables us to calculate the missing value in row 22 like we did it before.

I would like to create a function for it that does this automatically, preferrably with data.table since the whole dataset is large. I made some attempts but they were not succesful.

In the end my table should look like this:

    category indicator size_class geo year values grp_id
 1:        G    V11110       sc_1  IE 2008  19996      1
 2:        G    V11110       sc_3  IE 2008   3890      1
 3:        G    V11110       sc_2  IE 2008  19476      1
 4:        G    V11110       sc_4  IE 2008   2035      1
 5:        G    V11110       sc_5  IE 2008    733      1
 6:        G    V11110       sc_6  IE 2008    104      1
 7:        G    V11110   sc_TOTAL  IE 2008  46234      1
 8:      G45    V11110       sc_1  IE 2008   4077      2
 9:      G45    V11110       sc_3  IE 2008    456      2
10:      G45    V11110       sc_2  IE 2008   2851      2
11:      G45    V11110       sc_4  IE 2008    280      2
12:      G45    V11110       sc_5  IE 2008     NA      2
13:      G45    V11110       sc_6  IE 2008     NA      2
14:      G45    V11110   sc_TOTAL  IE 2008   7736      2
15:      G46    V11110       sc_1  IE 2008   5864      7
16:      G46    V11110       sc_3  IE 2008   1159      7
17:      G46    V11110       sc_2  IE 2008   4935      7
18:      G46    V11110       sc_4  IE 2008    715      7
19:      G46    V11110       sc_5  IE 2008     NA      7
20:      G46    V11110       sc_6  IE 2008     NA      7
21:      G46    V11110   sc_TOTAL  IE 2008  12991      7
22:      G47    V11110       sc_1  IE 2008  10055     16
23:      G47    V11110       sc_3  IE 2008   2275     16
24:      G47    V11110       sc_2  IE 2008  11690     16
25:      G47    V11110       sc_4  IE 2008   1040     16
26:      G47    V11110       sc_5  IE 2008    378     16
27:      G47    V11110       sc_6  IE 2008     69     16
28:      G47    V11110   sc_TOTAL  IE 2008  25507     16
    category indicator size_class geo year values grp_id

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.