Why 2 diferents results?

juandmaz · October 23, 2023, 8:11pm

I want to calculate the distribution of NA per row and per column of my data.frame. Shouldn't the average be the same? Although the difference is very small, I do not understand why it exists.

apply(base, 1, function(x) round(sum(is.na(x)/length(x)),2)) %>%
  as.vector() %>% 
  summary()

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0,0700  0,2000  0,2300  0,2141  0,2300  0,2800

apply(base, 2, function(x) round(sum(is.na(x)/length(x)),2)) %>%
  as.vector() %>% 
  summary()

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0,0000  0,0000  0,0000  0,2136  0,0550  1,0000

technocrat · October 23, 2023, 8:47pm

Is the base object conformable? (equal number of rows and columns). If not, the length(x) divisors differ.

FJCC · October 23, 2023, 8:47pm

You are rounding to two decimal places but comparing the summaries to higher precision than that. See if this example makes the situation clearer.

base <- data.frame(A = c(1,NA,NA,4:10), B = c(NA,22,33,4:10), C = c(1:10))

#Do only the apply()
apply(base, 1, function(x) round(sum(is.na(x)/length(x)),2))
#>  [1] 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00

apply(base, 2, function(x) round(sum(is.na(x)/length(x)),2))
#>   A   B   C 
#> 0.2 0.1 0.0

#Do the full example
apply(base, 1, function(x) round(sum(is.na(x)/length(x)),2)) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0000  0.0000  0.0990  0.2475  0.3300

apply(base, 2, function(x) round(sum(is.na(x)/length(x)),2)) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    0.05    0.10    0.10    0.15    0.20

#Remove the rounding
apply(base, 1, function(x) sum(is.na(x)/length(x))) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0000  0.0000  0.1000  0.2500  0.3333

apply(base, 2, function(x) sum(is.na(x)/length(x))) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    0.05    0.10    0.10    0.15    0.20

^{Created on 2023-10-23 with reprex v2.0.2}

juandmaz · October 30, 2023, 3:56am

FJCC:

base <- data.frame(A = c(1,NA,NA,4:10), B = c(NA,22,33,4:10), C = c(1:10))

#Do only the apply()
apply(base, 1, function(x) round(sum(is.na(x)/length(x)),2))
#>  [1] 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00

apply(base, 2, function(x) round(sum(is.na(x)/length(x)),2))
#>   A   B   C 
#> 0.2 0.1 0.0

#Do the full example
apply(base, 1, function(x) round(sum(is.na(x)/length(x)),2)) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0000  0.0000  0.0990  0.2475  0.3300

apply(base, 2, function(x) round(sum(is.na(x)/length(x)),2)) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    0.05    0.10    0.10    0.15    0.20

#Remove the rounding
apply(base, 1, function(x) sum(is.na(x)/length(x))) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0000  0.0000  0.1000  0.2500  0.3333

apply(base, 2, function(x) sum(is.na(x)/length(x))) |>
  as.vector() |> 
  summary()
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    0.05    0.10    0.10    0.15    0.20

Thanks for the answer?
So, the round() change my results, but why? I dont understand.

technocrat · October 30, 2023, 4:47am

The result doesn't change, only it's representation, unless the return value of round() is reassigned to the same name.

result = mtcars$mpg/mtcars$drat
summary(result)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   3.467   4.757   5.529   5.544   6.183   8.064
round(summary(result),2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    3.47    4.76    5.53    5.54    6.18    8.06

juandmaz · October 30, 2023, 3:30pm

But in the test example you did above in the full example both lines of code had round() and yet the values were different. Why were there 2 different representations if they both used the same function?

AlexisW · October 30, 2023, 5:10pm

While round() can introduce numerical errors, here it looks to me like technocrat's explanation is the most important.

To make an even more obvious example,

base <- data.frame(A = rep(NA,5),
                   B = 1:5)

base
#>    A B
#> 1 NA 1
#> 2 NA 2
#> 3 NA 3
#> 4 NA 4
#> 5 NA 5

rowSums(is.na(base))
#> [1] 1 1 1 1 1
colSums(is.na(base))
#> A B 
#> 5 0

rowMeans(is.na(base))
#> [1] 0.5 0.5 0.5 0.5 0.5
colMeans(is.na(base))
#> A B 
#> 1 0

so the total number of NAs is 5 in any case, but the distributions are very different! The columns have either all NA or no NA, the rows all have one NA and one non-NA.

There is no reason for the distribution of NAs within rows and columns to be the same (unless you make the assumption that the NAs are distributed randomly, which depends heavily on what your data represents, and even in that case you would expect the observed distributions not to be identical).

If you round before summing you are loosing digits, for example:

round(0.49 + 0.49)
#> [1] 1
round(0.49) + round(0.49)
#> [1] 0

but that should only create small differences.

system · December 11, 2023, 5:10pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.