ggplot mutiple variables

Hersh · February 4, 2022, 3:35pm

Hello everyone,
I'm having trouble making graphical presentations in R.

My dataset represents stations as well as species names.
I have a station column (station 1, station2...) and several species columns and in these columns there is the number of each of these species for each of the stations.

Here is a representation of my dataset:

I would like to have 1 graph per station where in each of the graphs there are represented the different species according to their number (without representing the species where there is no 0)

I tried several different codes with hist(), group_by etc but I can't seem to get what I want.
It is possible with excel but much too long to make a graph for each station

thank you very much

dvetsch75 · February 4, 2022, 4:33pm

I would recommend first pivoting your data to follow tidy principles - then you can do something like this:

df <- tribble(
    ~station, ~unidentified, ~a.atlanticus, ~a.glacialis, ~a.medius, ~a.olrikii,
    1, NA, 3, NA, NA, NA,
    2, 7, NA, 8, 3, 2,
    3, 3, 4, NA, NA, NA,
    4, NA, 1, 6, 5, NA,
    5, 23, NA, NA, 5, 2,
    6, 4, NA, 15, 2, 1
)


df %>% 
    pivot_longer(
        cols = unidentified:a.olrikii,
        names_to = 'species',
        values_to = 'number'
    ) %>% 
    ggplot(aes(x = species, y = number)) +
    geom_col(aes(fill = species)) +
    facet_wrap(~station, )

Hersh · February 4, 2022, 10:15pm

Hi thank you very much for your answer, that's what I wanted!

On the other hand, the table that I put in my title is only a portion of my data set. Indeed I have about 40 stations and about 80 species in total.
Would you know how to proceed to avoid rewriting the values in the code and simply use the names of the columns. In addition, do you know how I could do so that each of the graphs only displays the species present.

Thank for answering !
Robin

For example in the example that I gave in station 1 there is only atlanticus present but the code even displays the other species with a value of 0.
Having a lot of species it will be unreadable have all the species represented in each of the graphs.

I have the same dataset but organized in another way, maybe easier to code:

FJCC · February 4, 2022, 11:07pm

Here is a small adjustment to @dvetsch75 's excellent answer. The first part, where the data are manually entered is there only because we do not have any of your data. If you already have the data in a data frame, you can use that instead.

library(tidyr)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
#> Warning: package 'tibble' was built under R version 4.1.2
df <- tribble(
  ~station, ~unidentified, ~a.atlanticus, ~a.glacialis, ~a.medius, ~a.olrikii,
  1, NA, 3, NA, NA, NA,
  2, 7, NA, 8, 3, 2,
  3, 3, 4, NA, NA, NA,
  4, NA, 1, 6, 5, NA,
  5, 23, NA, NA, 5, 2,
  6, 4, NA, 15, 2, 1
)

df %>% 
  pivot_longer(
    cols = unidentified:a.olrikii,
    names_to = 'species',
    values_to = 'number'
  ) %>% 
  filter(!is.na(number)) |> 
  ggplot(aes(x = species, y = number)) +
  geom_col(aes()) +
  facet_wrap(~station, scales = "free_x") +
  theme(axis.text.x = element_text(angle = 90,vjust = 0.5))

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

Hersh · February 5, 2022, 12:29pm

Hi, Thank you very much for your answer and your time.

Unfortunately, this gives me an error message:
"Error: data must be a data frame, or other object coercible by fortify(), not an S3 object with class mts/ts.
Additionally: Advisory message:
In is.na(number): is.na() applied to an object of type 'closure' which is neither a list nor a vector"

Here is my code:
Adult %>%
pivot_longer(
cols = A.atlanticus:T.sp.,
names_to = "species",
values_to = "number"
) %>%
filter(!is.na(number)) +
ggplot(aes(x = species, y = number)) +
geom_col(aes()) +
facet_wrap(~Station, scales = "free_x") +
theme(axis.text.x = element_text(angle=90,vjust=0.5))

Adult is my dataset which I imported into R like this:
Adult <- read.csv("Station-AdultFish/station-AdulFish-sp.csv", header=TRUE, sep=";")

I also replaced "|> " with the pipe %>% because it gave me an error for "|> "

I also have this advisory message in R:
"Did you accidentally pass aes() to the data argument?
Additionally: Advisory message:
In is.na(number):
is.na() applied to an object of type 'closure' which is neither a list nor a vector"

It's weird because last night I tested @dvetsch75's code and it worked..

Thank you very much and sorry for all these questions
Hersh

dvetsch75 · February 5, 2022, 1:58pm

Looks like you move from %>% to + one step too early in your pipeline. Between filter And ggplot, I think you wanted %>% instead.

And for what it’s worth, @FJCC is using the pipe provided by newer versions of base R (and more consistent with other programming languages, like Julia), whereas I’m still in the habit of using the magrittr::%>%. As far as I know, they do exactly the same thing.

Hersh · February 5, 2022, 10:01pm

Thank a lot for your answer !
Indeed I had this small error thank you very much!
However it doesn't seem to work.. I don't know if it's just because there are too many tables at once but I don't see any bars in the barplot at all :

this is only part of the graph

Accompanied by this error message: "Error in diff.default(xscale): VECTOR_ELT() can only be applied to a 'list', not a 'character'"
How would you proceed to display only a few stations. Is there an extra line or a selection to make before the graphs to display the stations I want

Thank

dvetsch75 · February 6, 2022, 1:23am

How many stations do you have? If you have more than about 7-10, you may want to think about using ggsave to save your plot, and make sure the dimesions you specify are quite large. If you use facet_wrap with a variable that has many unique values, the plot becomes squished and it becomes hard to read.

Hersh · February 7, 2022, 8:42am

Hello and thank you again for your answers!

I have about forty stations
How do you use the ggsave function?
Otherwise I can also display for example only 10 stations at the beginning, then the next 10 and so on

How can I select only certain stations from the station column?

thank you very much

dvetsch75 · February 7, 2022, 1:50pm

Here are the docs for ggsave. If you wanted to try to plot all of the stations together, you could make the width and height arguments very large.

If you wanted to subset your data to 10 stations at a time, you could either:

Explicitly name the stations you want to see each time, by saying

df  %>% 
   filter(station %in% c('C004', 'C010', ...))

Or, if you don't care which 10 you see at a time, only that you see 10, you could do something like:

df %>%
    filter(row_number() <= 10)

Assuming you have 40 stations exactly, and you want to facet by 10 at a time, this is probably how I would do it:

list_of_dfs <- lapply(
    1:4,
    function(x) {
        start_row_num <- ((x - 1) * 10) + 1
        end_row_num <- x * 10
        df %>% 
            filter(
                between(row_number(), start_row_num, end_row_num)
            )
    }
)

This gives you a list of dataframes, where list_of_dataframes[[1]] is the first 10 stations, list_of_dataframes[[2]] is the next 10, etc.

Hersh · February 7, 2022, 3:01pm

Thank you very much !
I think the second option is more what I wanted to do
How do you integrate it into the code, do we make this line before launching the code for the histogram or is it a line to integrate into the code for the histogram?

Thanks to you

dvetsch75 · February 7, 2022, 3:28pm

I would keep working with lists. So if you ran the above code and have list_of_dataframes, then you wanted to plot it, you would do something like this:

list_of_plots <- lapply(
    list_of_dataframes,
    function(df) {
        # df is each dataframe in the list of dataframes
        df %>% 
        pivot_longer(
            cols = unidentified:a.olrikii,
            names_to = 'species',
            values_to = 'number'
        ) %>% 
        ggplot(aes(x = species, y = number)) +
        geom_col(aes(fill = species)) +
        facet_wrap(~station)
    } # End of function
) # End of lapply

I would really recommend doing some reading about the apply family of functions (or read the docs for purrr). Functions like lapply and purrr::map really simplify processes like this and can help avoid ugly loops.

Hersh · February 11, 2022, 8:44am

Hi @dvetsch75 ,
First of all thank you again for your previous answers which helped me a lot.
I allow myself to relaunch this topic in relation to a change in the organization of my dataset.
My previous dataset was organized online, I transformed it because I thought it would be easier to use.
But I realize that it would be easier to start directly from the basic dataset, so I save time and limit errors.
My basic dataset is organized in columns :

As you see, all species are in one column, number of individuals all in one column.
I would like to be able to make the same graphs as the others.
The problem here is that instead of selecting only a column name in which the data is, I have to select the attributes that interest me inside a column where the data associated with it is in another column .

I'm struggling to explain so I hope you have roughly understood how my dataset is organized.
Once again here I present to you a "summary" of the dataset because it is much too long.
But for example, I would first like to select only Zone A1, where I represent the proportion of Types
Then still in A1, I would like to select only Adults in "Type" and represent the distribution of Species.
Then I would like to see Adult and Larvae represented at the same time, etc.

This is just to show you how I would like to manipulate my data.

thank you very much for your time
Hersh

nirgrahamuk · February 11, 2022, 10:07am

Hello,
I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

dvetsch75 · February 14, 2022, 7:09pm

I think you should be using multiple plots involving facet_*. Try and organize your thoughts around what data you are trying to present in each collection of plots, and keep in mind that sometimes it's better to write longer, more boring, and more explicit code.

Here's how I would do what you described:

library(dplyr)
library(ggplot2)
df <- tribble(
    ~zone, ~type, ~station, ~species, ~number,
    'A1', 'Adult', 1, 'Atlanticus', 2,
    'A1', 'Adult', 1, 'Olrikii', 1,
    'A1', 'Larvae', 2, 'Medius', 5,
    'A2', 'Larvae', 1, 'Glacialis', 7,
    'A2', 'Larvae', 2, 'Unidentified', 3, 
    'A2', 'Adult', 2, 'Glacialis', 2, 
    'A2', 'Larvae', 2, 'Medius', 4, 
    'A3', 'Zoo', 1, 'Capilatta', 17,
    'A3', 'Adult', 3, 'Olrikii', 1
)

# Proportion of types per station
df %>% 
    group_by(zone, type) %>% 
    summarize(
        zone_type_sum = sum(number),
        .groups = 'drop_last' # You can skip this, it is the default behavior, but it helps make your code more transparent in what it is doing
    ) %>% 
    mutate(
        zone_sum = sum(zone_type_sum),
        proportion = zone_type_sum / zone_sum
    ) %>% 
    ggplot(aes(x = type, y = proportion)) + 
    geom_col() + 
    facet_wrap(~zone)

# Distribution of Species for adults 

df %>% 
    filter(type == 'Adult') %>% 
    group_by(zone, species) %>% 
    summarize(
        num_per_species = sum(number)
    ) %>% 
    mutate(
        num_per_zone = sum(num_per_species),
        proportion = num_per_species / num_per_zone
    ) %>% 
    ggplot(aes(x = species, y = proportion)) + 
    geom_col(aes(fill = species)) +
    facet_wrap(~zone)

# And you can follow a similar approach as needed

There are certainly ways to write more concise code than the above, but I think the above forces you to think about your analysis more in a more cogent way, i.e. each time you want to do something, you have to explicitly specify the following:

What is my dataset?
What relationship am I trying to communicate?
What steps do I need to take to transform the dataset to include those relationships, if they are not there already?
How do I plot my data?

system · March 7, 2022, 7:09pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.