# calculate new columns from existing column by group and assigned the calculated value for entire group (observations)

I am trying to calculate the mean, standard deviation and z score to calculate the outlier.
I need to remove outlier if my var2 is greater than or equal to 4.

My R code:

#var2 > = 4

ot_data = clean_df %>% group_by(var1) %>%

mutate(Mean = mean(var2), SD = sd(var2))%>%

mutate(lower_limit = -Inf) %>%

mutate(lower_limit = (ifelse(var2 >= 4, (Mean - 2*SD),-Inf)))%>%

mutate(upper_limit = Inf) %>%

mutate(upper_limit = (ifelse(var2 >= 4, (Mean + 2*SD),Inf)))%>%

mutate(z_score = (ifelse(var2 >=4, (var2-Mean)/SD, 0)))

#subset data and remove the outlier by using the standard deviation and mean (upper and lover limit)

ot1 = ot_data %>% group_by(var1) %>% filter(var2 <= upper_limit & var2 >= lower_limit)

Hi!

In your case `clean_df`

Please have a look at this guide, to see how to create one:

# Short Version

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

`````````
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4,
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7,
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2,
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")))
```
``````
``````#dataframe create
var1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C")
var2 = c(3, 2,1, 0, 0, 2, 0, 0, 5, 1, 2, 3, 1, 1,0)
clean_df = data.frame(var1, var2)

#verify the data
clean_df

#Calculate lower_limit, upper_limit, and z_score if var2 >= 4
#Assign lower_limit, upper_limit, and z_score to entire group if var2 >=4

ot_data = clean_df %>% group_by(var1) %>%
mutate(Mean = mean(var2), SD = sd(var2)) %>%
mutate(lower_limit = -Inf) %>%
mutate(lower_limit = (ifelse(var2 >= 4, (Mean - 2*SD),-Inf)))%>%
mutate(upper_limit = Inf) %>%
mutate(upper_limit = (ifelse(var2 >= 4, (Mean + 2*SD),Inf))) %>%
mutate(z_score = (ifelse(var2 >=4, (var2-Mean)/SD, 0)))

**#need to assigned conditional values to the group instead of the individual observation**
ot_data
``````
``````
# Calculate lower_limit, upper_limit, and z_score
# for all groups of var1 after first having removed var2 >= 4

(ot_data <- clean_df %>%
filter(var2 < 4) %>%
group_by(var1) %>%
mutate(across(var2,
list(Mean = mean, SD = sd),
.names = "{.fn}"),
lower_limit = Mean - 2 * SD,
upper_limit = Mean + 2 * SD,
z_score = (var2 - Mean) / SD
))
``````

I need to keep all observations for var2
When i remove filter(var2 < 4) from above code, I am getting an error.

please be precise and state what the error is.
I have tested to see if I get an error, and I don't ...

``````
(ot_data <- clean_df %>%
# filter(var2 < 4) %>%
group_by(var1) %>%
mutate(across(var2,
list(Mean = mean, SD = sd),
.names = "{.fn}"),
lower_limit = Mean - 2 * SD,
upper_limit = Mean + 2 * SD,
z_score = (var2 - Mean) / SD
))
``````

It works!

I do need to assign NA (missing value) if var2 < 4 - So I just need to assign NA (missing value) if var2 < 4

Once I calculate outlier .... by using the Lower_limit and Upper_limit. I need to exclude observations if lower_limit > var2 or upper_limit < var2.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.