Hello there,
I have a dataframe with negative numbers that shouldn't be there (data errors). I was hoping to find a way to replace each negative value in the dataframe with the mean of the column that the value is found in.
Here's an example dataframe:
var1 var2 var3
Site1 1 3 -5
Site2 2 6 6
Site3 -3 4 3
Site4 4 -7 2
I thought about first replacing the negative values with 0:
df1[df1<0] <- 0
Which results in this:
var1 var2 var3
Site1 1 3 0
Site2 2 6 6
Site3 0 4 3
Site4 4 0 2
So that now I can replace the 0s with the mean of each column, so that it looks like this:
var1 var2 var3
Site1 1.00 3.00 2.75
Site2 2.00 6.00 6.00
Site3 1.75 4.00 3.00
Site4 4.00 3.25 2.00
But I don't know how to do this in any sort of quick or efficient manner. Any insight into how to approach this would be greatly appreciated.
Thank you so much!
FJCC
February 2, 2022, 8:57pm
2
Here is one method.
DF <- data.frame(Site = c("Site1", "Site2", "Site3", "site4"),
var1 = c(1,2,0,4), var2 = c(3,6,4,0),
var3 = c(0,6,3,2))
DF
#> Site var1 var2 var3
#> 1 Site1 1 3 0
#> 2 Site2 2 6 6
#> 3 Site3 0 4 3
#> 4 site4 4 0 2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DF <- DF |> mutate(across(.cols = var1:var3,
.fns = ~ifelse(.x == 0, mean(.x), .x)))
DF
#> Site var1 var2 var3
#> 1 Site1 1.00 3.00 2.75
#> 2 Site2 2.00 6.00 6.00
#> 3 Site3 1.75 4.00 3.00
#> 4 site4 4.00 3.25 2.00
Created on 2022-02-02 by the reprex package (v2.0.1)
Note that using this "mean" value might give strange results if many of the values in a column have been replaced with zero.
1 Like
Ahhh, yes, of course that would leave some strange results. I wonder if there's a way to set the values to NA and then replace the NA values with the mean - is there a way to incorporate an na.rm=T
or something like that?
Thanks so much for your help!
FJCC
February 2, 2022, 9:33pm
4
DF <- data.frame(Site = c("Site1", "Site2", "Site3", "site4"),
var1 = c(1,2,-4,4), var2 = c(3,6,4,-2),
var3 = c(-5,6,3,2))
DF
Site var1 var2 var3
1 Site1 1 3 -5
2 Site2 2 6 6
3 Site3 -4 4 3
4 site4 4 -2 2
library(dplyr)
DF <- DF |> mutate(across(.cols = var1:var3,
.fns = ~ifelse(.x < 0, NA, .x))) |>
mutate(across(.cols = var1:var3,
.fns = ~ifelse(is.na(.x), mean(.x,na.rm=TRUE), .x)))
DF
Site var1 var2 var3
1 Site1 1.000000 3.000000 3.666667
2 Site2 2.000000 6.000000 6.000000
3 Site3 2.333333 4.000000 3.000000
4 site4 4.000000 4.333333 2.000000
1 Like
Yaaaassss, you're the best! Thank you soooo much!
system
Closed
February 9, 2022, 9:40pm
6
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.