How I can create logic to get weighted sum of variables of separated two tables?

Hello, everyone. I have these two datasets.

           id      number  density
 1      467800075    7     47.4167
 2      488141762    7     44.5417
 3      488141768    5    156.9275
           id      number   density
 1      467800075    2     810.7656
 2      488141762    5    1057.1979
 3      488141768    3     865.4010

I want to calculate weighted sum of two tables per each id and the table I want to get ultimately is like this :

           id      number   weighted_density
 1      467800075     9           ...
 2      488141762    12           ...
 3      488141768     8           ...

Of course I know the equation that calculating weighted sum is
{(number of x1 * figure of x1) + (number of x2 * figure of x2)} / (number of x1 + number of x2)
and generally we use functions like 'group_by' or 'aggregate' to get summated value by each id,
but don't have any idea how to make the table above.

I would be very appreciate if you help me. How should I code to derive that?


there are two ways I present to you. The frist using base R but being pretty restrictive if your data has some IDs not present in one or the other data set. This will make it more prone to mistakes if you are not aware of that - but it is straight up matrix calculations so really fast. The othher uses collapse, a package designed for grouped and weighted data transformation. It is also very fast written in C and C++) and can handle e.g. missing IDs at the first place, so there won't be any wrong results if your data is not perfect.

As a side note: you are doing a weighted mean, not a weighted sum. But without further ado, this are the two ways:

df1 <- data.frame(
  id      = c(467800075,488141762,488141768),
  number  = c(7,7,5),
  density = c(47.4167,44.5417,156.9275)
df2 <- data.frame(
  id      = c(467800075,488141762,488141768),
  number  = c(2,5,3),
  density = c(810.7656,1057.1979,865.4010)

# base R with matrix calculations
# only valid if every ID is present twice and the matrices are ordered
mat1 <- as.matrix(df1[2:3])
rownames(mat1) <- df1$id
mat2 <- as.matrix(df2[2:3])
rownames(mat1) <- df2$id
w_dens <- (mat1[,1] * mat1[,2] + mat2[,1] * mat2[,2])/(mat1[,1] + mat2[,1])
w_num  <- mat1[,1] + mat2[,1]
w_mat <- matrix(c(w_num,w_dens),ncol = 2)
rownames(w_mat) <- df1$id
#>           [,1]     [,2]
#> 467800075    9 217.0498
#> 488141762   12 466.4818
#> 488141768    8 422.6051

# using fast grouped and weighted calculations from collapse
# more versatile, but not as fast as straight up matrix calculations
library(collapse); library(data.table)
# convert to data.table objects since it is faster
dt1 <- qDT(df1)
dt2 <- qDT(df2)
# combine the data.table objects
dt <- rbindlist(list(dt1,dt2))
# actual calculations
dt |>
  fgroup_by(id) |> 
    number        = fsum(number),
    weighted_mean = fmean(density, w = number)
#>           id number weighted_mean
#> 1: 467800075      9      217.0498
#> 2: 488141762     12      466.4818
#> 3: 488141768      8      422.6051

Created on 2022-11-10 by the reprex package (v2.0.1)

Kind regards

1 Like

Thank you. I applied your second way and there wasn't any error, but I have some further questions.

First of all, yeah I was confused the meaning of words and what I wanna calculate is still weighted mean. Thus, the equation that I wrote previously is still what I want to calculate. In that sense, your code is right, isn't it?

And I didn't tell this previously, but actually the number of objects in first table is about 110,000, while that in second one is 90,000. That is, about 20,000 values of certain ids should be just those values, because there is no other figure to calculate weighted mean with. I am not sure that 'there won't be any wrong results even if your data is not perfect' means like this..
What I am wondering is, won't it be matter in regard to those cases?

1 Like

That's correct, I used your formula (a weighted mean) and hence both results output the same (correct) result.

Then you should stick to the second solution, since collapse will not care. A weighted mean with just one entry is a regular mean. If the entry is /text{weight} = a and /text{density} = 5, we just have

/text{weighted mean} = /frac{a /cdot 5}{a} = 5

That's exactly what it means. :slight_smile: Your missing IDs will not matter in the second approach, but in the first they would.

Kind regards

thank you so much!!!

1 Like

Glad I was able to help you.

Consider accepting the answer that solved your problem (checkmark below the answer) to indicate a solution was found. :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.