complet beginner; Species number Dataframe

Hey guys,

I am completely new to Rstudio and statistical/data-analysis. For a new project I need to analyse data with Rstudio and I am lost... I was able to introduce myself in some points of Rstudio like readin data (xlsx etc..), how to work with perfect dataframes I got from people who already finished theri script, and some other things. But when it comes to own data and differently structured dataframes, nothing works from what they did...
Now i would like to count number of insect species on different plots, I determined myself (which means that I created the xlsx-sheet myself).

I have a dataframe with 576 rows and 6 columns. Column 1 is the plot ID (for example: Greece1) the following columns refere to the insects: Col 2 is family, Col 3 is sub-family, Col 4 is genus, Col 5 is subgenus and Col 6 is the species. For my question now I will only need the plot ID and Col4 and Col6 (Col4 and Col6 together are the insect species).

What I would like to do now is to count how much species I have per PlotID... and I have no idea how...
WOuld you have some advice or ideas? Do I need to change the structure of my xlsx? If so, how?
All the data in the columns are names like Greece1 or Genus is Stenolophus and species teutonus. Do I need to make them as.numeric or as.factors? I read about those functions but I did not fully understand them.

I hope that I don't ask to much, but as I am lost in Rstudio I have no idea how to continue..



To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

A good introduction to R is R for Data Science. There are now hundreds of books and online tutorials and R is so vast that one can't really "know" it. Take a good guide like this and then build out into specific topics as necessary.

To illustrate your question, we can borrow a built-in dataset that is analogous. mtcars has different content but the way columns are selected works the same.

#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#> [1] 32 11
portion <- mtcars[,c(1,2,6)]
#>                    mpg cyl    wt
#> Mazda RX4         21.0   6 2.620
#> Mazda RX4 Wag     21.0   6 2.875
#> Datsun 710        22.8   4 2.320
#> Hornet 4 Drive    21.4   6 3.215
#> Hornet Sportabout 18.7   8 3.440
#> Valiant           18.1   6 3.460
#> [1] 32  3

Created on 2022-12-22 by the reprex package (v2.0.1)

The car names to the left are row identifiers, not variables, so they don't count as columns in the data frame. To begin, we had 32 rows, each of 11 variable from which a new data frame was subset of all rows and just the first, fourth and sixth columns. portion can be used for tabulating.

Try that with your data and come back with a reprex as @andresrcs suggests for help with tabulating.

Is this what you want:


mydf <- tibble(
  ID = c("A","B","A","A", "C"),
  Col4 = c("gen1", "gen2", "gen3", "gen4", "gen5"),
  Col6 = c("spec1", "spec2", "spec3", "spec4", "spec5")

mydf %>% 
  group_by(ID) %>% 
  summarise(n = n())

# A tibble: 3 × 2
  ID        n
  <chr> <int>
1 A         3
2 B         1
3 C         1

Hey guys,

thanks for all the answers, sorry i needed some time to create the df!

  Site_ID = c("Greece1", "Greece1", "Greece1", "Muritz_2", "Muritz_2", "Muritz_2", "Muritz_2", "Spain4", "Spain4", "UK4"),
  Genus = c("Stenolophus", "Apion", "Cryptocephallus", "Apion", "Apion", "Apion", "Coccinula", "Microlestes", "Amara", "Amara"),
  species = c("teutonus", "rugicolle", "moreie", "pratense", "pratense", "pratense", "viridica", "minutulus", "anthobia", "aenea")

This would be what I got. What I want to do for my 570+ rows and my numerous Site-IDs is to count per Site (first column) how much insect species I have. Column 2 and 3 build together the species name. For example "Stenolophus teutonus" is one species of the Site Greece1. With this little Dataframe I am able to simply count and see that the Site Greece1 has 3 insect species or that Muritz_2 has 2 species.
But how do I do that for all my big DF?

I read something about the "vegan" library and the comand "specnumber".

What I would also like to do is afterwards plot the different sites and their species number. Probably a barplot with "speciesnumber" as the dependant variable on Y-axis and the Site-ID as the explanatory variable on the x-axis.

barplot(s~Site.ID, data = data.frame, ylab= "species number", x-lab = "examined plot")

would that even work once i figured out how to calculate the species number for each Site.ID?
I think I know how to do basic plots after watching tutorials and also read about ggplots but beginning with Rstudio is so difficult, when you have a specific question...

THank you very much guys

Hey Flm,

thhanks for youre reply! It would be something like that i guess... but in youre example it's for the ID only, isn't it?
I would need it for the SiteID and than in relation to the numbers of SiteIDs the number of species respectively (which would be formed by gen1+spec1; gen3+spec3; gen4+spec4 for Site A in your nice example!)

I posted a comment about my problem, maybe it makes it clearer. Thank you for yourhelp!!

Hello technocrat,

thank you very much for the nice beginners tutorial, I will try to use it! As time is always rushing and I am really really slow with learning programming etc. I try more with tutorial videos about specific questions and read commented scripts of colleagues (even though I don't really get everything). But probably I will need to...

However, thanks for you example, but I think what you have done is count the rows and the columns, am I right? What I would need is, in your example now, for each car (this would be my SiteIDand those can repeat various times), count how many different "wt" there are (those would be my species). With the difference that the species is formed by two columns together (see my reprex) :slight_smile:


If I understand correctly this should be a solution:


df <- 
    Site_ID = c("Greece1", "Greece1", "Greece1", "Muritz_2", "Muritz_2", "Muritz_2", "Muritz_2", "Spain4", "Spain4", "UK4"),
    Genus = c("Stenolophus", "Apion", "Cryptocephallus", "Apion", "Apion", "Apion", "Coccinula", "Microlestes", "Amara", "Amara"),
    species = c("teutonus", "rugicolle", "moreie", "pratense", "pratense", "pratense", "viridica", "minutulus", "anthobia", "aenea")
  ) %>% as_tibble()

df %>% 
  mutate(gen_spec = paste(Genus, species, sep = "_")) %>% 
  count(Site_ID, gen_spec)
# A tibble: 8 × 3
  Site_ID  gen_spec                   n
  <chr>    <chr>                  <int>
1 Greece1  Apion_rugicolle            1
2 Greece1  Cryptocephallus_moreie     1
3 Greece1  Stenolophus_teutonus       1
4 Muritz_2 Apion_pratense             3
5 Muritz_2 Coccinula_viridica         1
6 Spain4   Amara_anthobia             1
7 Spain4   Microlestes_minutulus      1
8 UK4      Amara_aenea                1

1 Like

Thank you Flm,

This is almost what I had in mind! I would like an absolut number now, so that I habe for greece the absolut number, aswell as for the other species!
Is it also maybe possible to have this result in an own df or vector? So that I can now go and do other things with it? For exampler calculate diversity indices?

Thank you very much

Is this what you want?

df %>% 
  mutate(gen_spec = paste(Genus, species, sep = "_")) %>% 
  count(Site_ID, gen_spec) %>% 
  group_by(Site_ID) %>% 
  summarise(sum = sum(n))
# A tibble: 4 × 2
  Site_ID    sum
  <chr>    <int>
1 Greece1      3
2 Muritz_2     4
3 Spain4       2
4 UK4          1

You can assign it using mytable <- before the code to use the table later

Hey Flm, sorry for the late answer.
That's almost it... With your code I get now the absolut number individuals: We had 3 species in Greece, so that I get 3 in the column, but for Muritz_2 we get 4. 4 is the number of individuals, but what I am searching for is the number of species, which would be 3 for greece, 2 for Muritz, 2 for Spain and 1 for UK :slight_smile:
You have another idea?

Sorry for my late replies, but I am trevelling for work right now and cannot check regularly.


Hi, try this:

df %>%
  select(Site_ID, species) %>%
  unique() %>%

# A tibble: 4 × 2
  Site_ID      n
  <chr>    <int>
1 Greece1      3
2 Muritz_2     2
3 Spain4       2
4 UK4          1

Posing and understanding the problem is always trickiest.

is just

#>  [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 4.070 3.730 3.780
#> [13] 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840 3.845 1.935
#> [25] 2.140 1.513 3.170 2.770 2.780

Created on 2023-01-17 with reprex v2.0.2

For things that feel like they should be really basic, there is almost always a function, if you can find it.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.