creating a dataframe by querying other data frames

Sean1979 · September 19, 2023, 3:06am

Hi R gurus,
I want to create a data frame that calculates abundances of traits from two other related data frames but don’t know what code would give me the desired output.
The first data frame, species abundance, contains species (rows) by sites (columns). e.g.;

          BaCo_1 BaFa_1 BaSl_1 BrSl_1 CaCo_1 CaFa_1 CaSl_1
Abla.nota      0      0      3      2      9      0      0
Albo.woro      0      0      0      1      0      0      2
Aust.coll      0      2      0      0      2      0      0
Bibu.Kadj      0      0      1      0      0      0      0
cala.sp        0      0      0      0      0      7      1

The second dataframe, species traits, contains species (rows) by traits (columns);

          MFA2 MFA4 MFA5 MFA6 MFA12 MFA14 flow1 flow2 flow3
Abla.nota    1    0    0    0     0     0     1     0     0
Albo.woro    0    0    0    1     0     0     1     0     0
alel.aust    0    1    0    0     0     0     0     0     1
Aulo.stri    1    0    0    0     0     0     0     0     1
Aust.anac    0    0    0    0     1     0     1     0     0
Aust.coll    0    0    0    0     1     0     1     0     0
Aust.subt    0    1    0    0     0     0     0     0     1
bero.sp      0    1    0    0     0     0     1     0     0
Bibu.Kadj    1    0    0    0     0     0     0     0     1
Bran.sowe    0    0    1    0     0     0     1     0     0
cala.sp      0    0    0    1     0     0     1     0     0

Note that the “traits” data frame contains more species than the “sites” data frame. This is because the trait data frame contains all species in the study, of which the “sites” data frame is a subset. Both examples above are subsets of the full data.
From these, I want to create a third data frame, sites (rows) by traits (columns), a dataframe which has the trait abundances at each site, regardless of the species that possess those traits;

       MFA2 MFA4 MFA5 MFA6 MFA12 MFA14 flow1 flow2 flow3
BaCo_1    0    0    0    0     0     0     0     0     0
BaFa_1    0    0    0    0     2     0     2     0     0
BaSl_1    4    0    0    0     0     0     3     0     1
BrSl_1    2    0    0    1     0     0     3     0     0
CaCo_1    9    0    0    0     2     0    11     0     0
CaFa_1    0    0    0    7     0     0     7     0     0
CaSl_1    0    0    0    3     0     0     3     0     0

I created this output by hand because I don’t know what code would produce this output. Each value is a summation of the abundances of all individuals in the sample (row) that possess a trait (column). For example, CaCo_1 x flow1 in the example output, 11, above is the summation of the 9 alba.nota and 2 Aust.coll from the first data frame that possess the flow1 trait identified in the second dataframe.
What code/package should I use to produce this outcome?

FJCC · September 19, 2023, 4:15am

Here are two slightly different version of a solution. In the first, I filter out elements of the data that are zero. This makes the AllData data frame much smaller. However, that means that not all of the listed sites and traits appear in the final data. In the second version, the zeros are kept at the price of a larger data set.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
Traits <- read.csv("~/R/Play/Traits.csv")
Traits
#>      Species MFA2 MFA4 MFA5 MFA6 MFA12 MFA14 flow1 flow2 flow3
#> 1  Abla.nota    1    0    0    0     0     0     1     0     0
#> 2  Albo.woro    0    0    0    1     0     0     1     0     0
#> 3  alel.aust    0    1    0    0     0     0     0     0     1
#> 4  Aulo.stri    1    0    0    0     0     0     0     0     1
#> 5  Aust.anac    0    0    0    0     1     0     1     0     0
#> 6  Aust.coll    0    0    0    0     1     0     1     0     0
#> 7  Aust.subt    0    1    0    0     0     0     0     0     1
#> 8    bero.sp    0    1    0    0     0     0     1     0     0
#> 9  Bibu.Kadj    1    0    0    0     0     0     0     0     1
#> 10 Bran.sowe    0    0    1    0     0     0     1     0     0
#> 11   cala.sp    0    0    0    1     0     0     1     0     0

Sites <- read.csv("~/R/Play/Sites.csv")
Sites
#>     Species BaCo_1 BaFa_1 BaSl_1 BrSl_1 CaCo_1 CaFa_1 CaSl_1
#> 1 Abla.nota      0      0      3      2      9      0      0
#> 2 Albo.woro      0      0      0      1      0      0      2
#> 3 Aust.coll      0      2      0      0      2      0      0
#> 4 Bibu.Kadj      0      0      1      0      0      0      0
#> 5   cala.sp      0      0      0      0      0      7      1

#version 1
Traits_long <- Traits |> pivot_longer(cols = -Species, 
                                      names_to = "Traits", 
                                      values_to = "Present") |> 
  filter(Present == 1)
Traits_long
#> # A tibble: 22 × 3
#>    Species   Traits Present
#>    <chr>     <chr>    <int>
#>  1 Abla.nota MFA2         1
#>  2 Abla.nota flow1        1
#>  3 Albo.woro MFA6         1
#>  4 Albo.woro flow1        1
#>  5 alel.aust MFA4         1
#>  6 alel.aust flow3        1
#>  7 Aulo.stri MFA2         1
#>  8 Aulo.stri flow3        1
#>  9 Aust.anac MFA12        1
#> 10 Aust.anac flow1        1
#> # ℹ 12 more rows

Sites_long <- Sites |> pivot_longer(cols = -Species,
                                  names_to = "Sites", values_to = "Count") |> 
  filter(Count > 0)
Sites_long
#> # A tibble: 10 × 3
#>    Species   Sites  Count
#>    <chr>     <chr>  <int>
#>  1 Abla.nota BaSl_1     3
#>  2 Abla.nota BrSl_1     2
#>  3 Abla.nota CaCo_1     9
#>  4 Albo.woro BrSl_1     1
#>  5 Albo.woro CaSl_1     2
#>  6 Aust.coll BaFa_1     2
#>  7 Aust.coll CaCo_1     2
#>  8 Bibu.Kadj BaSl_1     1
#>  9 cala.sp   CaFa_1     7
#> 10 cala.sp   CaSl_1     1

AllData <- inner_join(Sites_long, Traits_long, by = "Species", relationship = "many-to-many")

Summs <- AllData |> group_by(Sites, Traits) |> summarize(Total = sum(Count)) |> 
  pivot_wider(names_from = Traits, values_from = Total)
#> `summarise()` has grouped output by 'Sites'. You can override using the
#> `.groups` argument.
Summs
#> # A tibble: 6 × 6
#> # Groups:   Sites [6]
#>   Sites  MFA12 flow1  MFA2 flow3  MFA6
#>   <chr>  <int> <int> <int> <int> <int>
#> 1 BaFa_1     2     2    NA    NA    NA
#> 2 BaSl_1    NA     3     4     1    NA
#> 3 BrSl_1    NA     3     2    NA     1
#> 4 CaCo_1     2    11     9    NA    NA
#> 5 CaFa_1    NA     7    NA    NA     7
#> 6 CaSl_1    NA     3    NA    NA     3

#version 2
Traits_long <- Traits |> pivot_longer(cols = -Species, 
                                      names_to = "Traits", 
                                      values_to = "Present")
Traits_long
#> # A tibble: 99 × 3
#>    Species   Traits Present
#>    <chr>     <chr>    <int>
#>  1 Abla.nota MFA2         1
#>  2 Abla.nota MFA4         0
#>  3 Abla.nota MFA5         0
#>  4 Abla.nota MFA6         0
#>  5 Abla.nota MFA12        0
#>  6 Abla.nota MFA14        0
#>  7 Abla.nota flow1        1
#>  8 Abla.nota flow2        0
#>  9 Abla.nota flow3        0
#> 10 Albo.woro MFA2         0
#> # ℹ 89 more rows

Sites_long <- Sites |> pivot_longer(cols = -Species,
                                    names_to = "Sites", values_to = "Count")
Sites_long
#> # A tibble: 35 × 3
#>    Species   Sites  Count
#>    <chr>     <chr>  <int>
#>  1 Abla.nota BaCo_1     0
#>  2 Abla.nota BaFa_1     0
#>  3 Abla.nota BaSl_1     3
#>  4 Abla.nota BrSl_1     2
#>  5 Abla.nota CaCo_1     9
#>  6 Abla.nota CaFa_1     0
#>  7 Abla.nota CaSl_1     0
#>  8 Albo.woro BaCo_1     0
#>  9 Albo.woro BaFa_1     0
#> 10 Albo.woro BaSl_1     0
#> # ℹ 25 more rows

AllData <- inner_join(Sites_long, Traits_long, by = "Species", relationship = "many-to-many")

Summs <- AllData |> mutate(Count = Count * Present) |> 
  group_by(Sites, Traits) |> summarize(Total = sum(Count)) |> 
  pivot_wider(names_from = Traits, values_from = Total)
#> `summarise()` has grouped output by 'Sites'. You can override using the
#> `.groups` argument.
Summs
#> # A tibble: 7 × 10
#> # Groups:   Sites [7]
#>   Sites  MFA12 MFA14  MFA2  MFA4  MFA5  MFA6 flow1 flow2 flow3
#>   <chr>  <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 BaCo_1     0     0     0     0     0     0     0     0     0
#> 2 BaFa_1     2     0     0     0     0     0     2     0     0
#> 3 BaSl_1     0     0     4     0     0     0     3     0     1
#> 4 BrSl_1     0     0     2     0     0     1     3     0     0
#> 5 CaCo_1     2     0     9     0     0     0    11     0     0
#> 6 CaFa_1     0     0     0     0     0     7     7     0     0
#> 7 CaSl_1     0     0     0     0     0     3     3     0     0

^{Created on 2023-09-18 with reprex v2.0.2}

Sean1979 · September 19, 2023, 5:34am

thank you for responding so quickly. I have run the code myself and had success

system · September 26, 2023, 5:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.