Need help to create advanced plot with group_by

zorro · July 12, 2019, 11:17am

Hi

I have the following data frame as a sample

VAPublicIP2 = data.frame(Session = c("Apr2016", "Feb2017", "Jan2016", "Mar2018", "Mar2017", "Dec2016", "May2018", "Nov2018", "Oct2016", "Sep2017") , 
                         Plugin = c( "1234567", "7353565", "5553565", "7353565", "5553565", "7353565", "5553565", "5553565", "1234567", "5553565"))

VAPublicIP2
   Session  Plugin
1  Apr2016 1234567
2  Feb2017 7353565
3  Jan2016 5553565
4  Mar2018 7353565
5  Mar2017 5553565
6  Dec2016 7353565
7  May2018 5553565
8  Nov2018 5553565
9  Oct2016 1234567
10 Sep2017 5553565

My goal is to plot a graph with year as the X axis, the number of top 10 Plugin on the Y axis with Plugin ID shown with color in the plot. I am am thinking the codes should look something like this:

ggplot(data = VAPublicIP2) + 
  geom_point(mapping = aes(x = *SessionYear*, y = *Number of Top 10 Plugin*, color = Plugin))

As you can see the values for X and Y must be derived from the sample data and this is I am able to do so far:

To get one SessionYear and to count plugin
VAPublicIP2[grepl("2017",VAPublicIP2$Session, ignore.case = FALSE),] %>% count(Plugin)

but then ...how do I get for every year and combine them? I know group_by may help but I cant seem to fit it in this code.

The X axis should be in increasing order of Session Year like this: 2016 2017 2018. So for 2016 it is actually combining Apr2016, Jan2016, Dec2016 and Oct2016. The same applies to other Sessions.

The Y axis for 2016 will have the number of Plugins for 1234567, 5553565, and 7353565 (All in 2016). The same goes for other Session Year.

The plot should later has line connecting the same Plugin for each Session Year

Please help. Thanks.

ron · July 12, 2019, 12:08pm

Hi,

Something like:

library(tidyverse)

VAPublicIP2 = data.frame(Session = c("Apr2016", "Feb2017", "Jan2016", "Mar2018", "Mar2017", "Dec2016", "May2018", "Nov2018", "Oct2016", "Sep2017") , 
                         Plugin = c( "1234567", "7353565", "5553565", "7353565", "5553565", "7353565", "5553565", "5553565", "1234567", "5553565"))

DF <- VAPublicIP2 %>%
    mutate(year = substring(Session, first = 4)) %>%
    group_by(year,Plugin) %>%
    summarise(n=n())

ggplot(DF, aes(x = year, y = n, colour = Plugin)) + geom_point() + geom_line()

Note, untested code. You may want to convert year to integer and ungroup DF, depending on whether you will re-use it for other purposes and whether there are any missing years.

Ron.

zorro · July 15, 2019, 8:06am

Hi Ron,

Thanks for the codes, it works.

The cahllenge now is with the real data where the Session values are not that uniform. It could be April2016, October2018 or even Apr2016-v2. How to achieve the same result?

The other thing is I have also other columns with the real data. There is a column "Risk"
which I want to include in facet_wrap like this facet_wrap(~ Risk, nrow = 2). However I get error:

Error: At least one layer must contain all faceting variables: `Risk`.
* Plot is missing `Risk`
* Layer 1 is missing `Risk`
* Layer 2 is missing `Risk`

which I believe the reason is "Risk" is not in DF after summarise.
How do I do the facet_wrap with "Risk"?

ron · July 15, 2019, 9:23am

Hi @zorro,

Here's an updated example.

library(tidyverse)

VAPublicIP2 = data.frame(Session = c("Apr2016-v2", "Feb_2017", "123_Jan2016", "March2018", "March 2017", "Dec2016 2222", "May2018-v2", "November 2018", "Oct2016", "Sep2017") , 
                        Plugin = c( "1234567", "7353565", "5553565", "7353565", "5553565", "7353565", "5553565", "5553565", "1234567", "5553565"),
                        Risk = rep(c("high", "low"), times = 5))

DF <- VAPublicIP2 %>%
   mutate(year = sub(".*(\\d{4}).*", "\\1", Session)) %>%
   group_by(Risk, year,Plugin) %>%
   summarise(n=n())

ggplot(DF, aes(x = year, y = n, colour = Plugin)) + geom_point() + geom_line() + facet_wrap(~ Risk, nrow = 2)
#> geom_path: Each group consists of only one observation. Do you need to
#> adjust the group aesthetic?
#> geom_path: Each group consists of only one observation. Do you need to
#> adjust the group aesthetic?

^{Created on 2019-07-15 by the reprex package (v0.2.1)}

I think with more data the warnings you can see in my reprex will go away.

Adding Risk to the group_by carries it through into DF and allows you to facet by it.

The sub command uses a regular expression (regex). It will pick out the first four-digit sequence \\d{4} surrounded by sequences of any characters. By having the \\d{4} surrounded by parentheses (\\d{4}) allows you to refer to it in the replace with argument of sub using \\1 (1 since it is the first thing enclosed in parentheses in your regex - in other situations you may have more than one set of parentheses in a regex).

This is only limited to the first instance of a four digit sequence in your Session variable. So if you had 1234-April2014, it would grab 1234. It could be made slightly more specific, eg if you know that all years are after 1999, you could use ".*(20\\d{2}).*" as your regex. Or ".*((19|20)\\d{2}).*" would limit it to the first instance of 19xx or 20xx digit strings.

Hope this helps.

Ron.

system · August 5, 2019, 9:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.