I want to calculate the distribution of NA per row and per column of my data.frame. Shouldn't the average be the same? Although the difference is very small, I do not understand why it exists.
apply(base, 1, function(x) round(sum(is.na(x)/length(x)),2)) %>%
as.vector() %>%
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0,0700 0,2000 0,2300 0,2141 0,2300 0,2800
apply(base, 2, function(x) round(sum(is.na(x)/length(x)),2)) %>%
as.vector() %>%
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0,0000 0,0000 0,0000 0,2136 0,0550 1,0000
But in the test example you did above in the full example both lines of code had round() and yet the values were different. Why were there 2 different representations if they both used the same function?
While round() can introduce numerical errors, here it looks to me like technocrat's explanation is the most important.
To make an even more obvious example,
base <- data.frame(A = rep(NA,5),
B = 1:5)
base
#> A B
#> 1 NA 1
#> 2 NA 2
#> 3 NA 3
#> 4 NA 4
#> 5 NA 5
rowSums(is.na(base))
#> [1] 1 1 1 1 1
colSums(is.na(base))
#> A B
#> 5 0
rowMeans(is.na(base))
#> [1] 0.5 0.5 0.5 0.5 0.5
colMeans(is.na(base))
#> A B
#> 1 0
so the total number of NAs is 5 in any case, but the distributions are very different! The columns have either all NA or no NA, the rows all have one NA and one non-NA.
There is no reason for the distribution of NAs within rows and columns to be the same (unless you make the assumption that the NAs are distributed randomly, which depends heavily on what your data represents, and even in that case you would expect the observed distributions not to be identical).
If you round before summing you are loosing digits, for example: