Trying to find the proportion of a variable in sorting by a different variable

the_chid · December 4, 2021, 8:39pm

Working on a uni project that I don't necessarily need to do in R, but I'm trying to learn it. I have tree species, the forest type the tree was found in, and the species/hectare. I need to find the proportion of each species per forest type. I know I could easily do this by hand or in excel, but any thoughts would be appreciated. Site_ID is the forest type.

forest_data %>%
group_by(Site_ID, Species) %>%
summarise(Species_amount_per_ha = n()*100,
Species_proportion_in_Forest_Type = )

I cannot figure out what to put after that equals sign to get, for example, 13200/(13200+3100) to get the proportion of Betula pendula in Site_ID BL. NA = unknown.

I am thinking some sort of sort function but I'm not sure how to get it sorted specifically by the three Site_IDs.

Any thoughts would be appreciated, thanks.

JackDavison · December 4, 2021, 8:50pm

Hello, welcome to RStudio Community! First of all, congratulations to go above and beyond and try to learn some R - the initial difficulty curve will be worth it as it will make future work so much easier!

When posting here, it is useful to provide a reproducible example - we can't copy-paste a screenshot of data, so we are limited in how we can help. I'm going to make a fake version of your data on my machine:

library(tidyverse)

dat = crossing(site = c("bl", "bl_c", "c"),
               species = c("betula", "salix", "larix")) |> 
  mutate(amount = rnorm(9, 2000, 500))

# A tibble: 9 x 3
  site  species amount
  <chr> <chr>    <dbl>
1 bl    betula   1614.
2 bl    larix    1686.
3 bl    salix    1963.
4 bl_c  betula   1687.
5 bl_c  larix    1311.
6 bl_c  salix    1784.
7 c     betula   2251.
8 c     larix    2175.
9 c     salix    2330.

So with the data like this, if I want to know the percentage of each species in each forest, I'd group by the the factor where, within each value, the "percentage" column should add up to 100%. Then, I mutate a new column which is equal to the value column divided by the sum of the value column. As the data frame is grouped, the sum of the value column is just the sum of that given group.

dat |> 
  group_by(site) |> 
  mutate(perc = amount / sum(amount))

# A tibble: 9 x 4
# Groups:   site [3]
  site  species amount  perc
  <chr> <chr>    <dbl> <dbl>
1 bl    betula   1427. 0.296
2 bl    larix    1696. 0.352
3 bl    salix    1700. 0.352
4 bl_c  betula   2104. 0.328
5 bl_c  larix    2368. 0.369
6 bl_c  salix    1942. 0.303
7 c     betula   2965. 0.392
8 c     larix    2236. 0.296
9 c     salix    2364. 0.312

The way to test this could now be to sum up the percentage column - each should be 100% - which it is!

dat |> 
  group_by(site) |> 
  mutate(perc = amount / sum(amount)) |> 
  summarise(sum(perc))

# A tibble: 3 x 2
  site  `sum(perc)`
  <chr>       <dbl>
1 bl              1
2 bl_c            1
3 c               1

Or we could visualise it...

dat |> 
  group_by(site) |> 
  mutate(perc = amount / sum(amount)) |> 
  ggplot(aes(x = site, y = perc, fill = species)) +
  geom_col()

NB: If you want to get rid of NA values, you could drop them by doing something like:

dat |> 
  drop_na(species)

the_chid · December 6, 2021, 8:54pm

Thank you so much for your help! This solved my problem. I'm actually posting another question using the same data, if you have a moment I would appreciate the help!

system · December 13, 2021, 8:54pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.