mutate new values in df based on condition that relates to different df

Salvardore · April 12, 2022, 12:30pm

Hello everyone,

I am currently involved in a tiny research project and I am facing a mutate-problem with dplyr relatively early in the process. I am trying to calculate the percentage of of a sub-group on a main group. The data looks like this:

Dataset 1:

area_id status count

1 01001 1 93928
2 01001 2 99368
3 01001 3 74463
4 01002 1 215774
5 01002 2 218672
6 01002 3 192941
7 01003 1 181151
8 01003 2 192095
9 01003 3 180232

Dataset 2

are_id count

1 01001 267759
2 01002 627387
3 01003 553478
4 01004 243065
5 01051 331176

So as you can see, table 1 splits the data of table 2 into various sub-groups, depending on a variable called "status". Therefore, table 1 has three times as much as rows as table 2. My goal is to compute a new column in table 1 that contains the percentage of each subgroup on the total count per area unit that is only included in table 2. I want it to look like this:
area_id status count percentage

1 01001 1 93928 0.3507967986
2 01001 2 99368 0.3711098413
3 01001 3 74463 etc.

I tried this, but it is not working:

table1 %>%
mutate(percentage=ifelse(table1$area_id == table2$area_id, table1$count/table2$count, "FALSE")

Can somebody help me on this one?

Thanks a lot.

pe2ju · April 12, 2022, 1:17pm

Can you provide minimal reproducible data for your case?

Just refer to https://reprex.tidyverse.org/
and https://reprex.tidyverse.org/

nirgrahamuk · April 12, 2022, 1:27pm

In general problems involving relationships between two or more tables are approached by applying an appropriate table join.

andresrcs · April 13, 2022, 12:51am

You don't need table2 at all, you can calculate all you need directly from table1

library(dplyr)

# Sample data on a copy/paste friendly format, replace this with your own data frame
table1 <- data.frame(
  stringsAsFactors = FALSE,
           area_id = c("01001","01001","01001",
                       "01002","01002","01002","01003","01003","01003"),
            status = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
             count = c(93928,99368,74463,215774,
                       218672,192941,181151,192095,180232)
)

# Relevant code
table1 %>% 
    group_by(area_id) %>% 
    mutate(percentage = count / sum(count))
#> # A tibble: 9 × 4
#> # Groups:   area_id [3]
#>   area_id status  count percentage
#>   <chr>    <dbl>  <dbl>      <dbl>
#> 1 01001        1  93928      0.351
#> 2 01001        2  99368      0.371
#> 3 01001        3  74463      0.278
#> 4 01002        1 215774      0.344
#> 5 01002        2 218672      0.349
#> 6 01002        3 192941      0.308
#> 7 01003        1 181151      0.327
#> 8 01003        2 192095      0.347
#> 9 01003        3 180232      0.326

^{Created on 2022-04-12 by the reprex package (v2.0.1)}

Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.

system · May 4, 2022, 12:52am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.