grouping data in pivot_longer function

Hi, I have flight delay dataset with different airlines in 1 columns and different delay resons columns as shown below

I want to create bar charts by different delay reson and by carrier airlines to compare the delay reasons and hence need to group by them .
I am getting total delay time not grouped by airline and reason.

I am using pivot_longer function and below is my code:
'''
df2 %>%
pivot_longer(cols=contains("DELAY"),
names_to='REASONS',
values_to='DELAYS') %>%
summarise(DELAYS = sum(DELAYS))%>%
group_by('REASONS','OP_UNIQUE_CARRIER')
'''

What am I doing wrong and how to fix it.

Thanks,
Blyzz

The group_by() function should be before the summarise() function. Also, you do not need to have quotation marks around the column names in the group_by() function.

It should be something like below
Thanks!

df2 %>%
     pivot_longer(cols = contains("DELAY"), names_to = "REASONS", values_to = "DELAYS") %>%
    group_by(UNIQUE_CARRIER, REASONS) %>%
   summarise(DELAYS = sum(DELAYS))

As the first two replies have pointed out, the group_by function will need to go before the summarize function, but I wanted to provide a little context as to why this is. First, after you use pivot_longer, this is what your table will look like this:

df %>%
  pivot_longer(cols = contains("DELAY"), names_to = "REASONS", values_to = "DELAYS")

# A tibble: 20 x 3
   unique_carrier REASONS             DELAYS
   <chr>          <chr>                <dbl>
 1 UA             carrier_delay            7
 2 UA             weather_delay            0
 3 UA             nas_delay               12
 4 UA             security_delay           0
 5 UA             late_aircraft_delay      0
 6 UA             carrier_delay            0
 7 UA             weather_delay            0
 8 UA             nas_delay               15
 9 UA             security_delay           0
10 UA             late_aircraft_delay      0
11 UA             carrier_delay            0
12 UA             weather_delay            0
13 UA             nas_delay               74
14 UA             security_delay           0
15 UA             late_aircraft_delay      0
16 UA             carrier_delay            0
17 UA             weather_delay            0
18 UA             nas_delay                8
19 UA             security_delay           0
20 UA             late_aircraft_delay     53

If we went ahead and used summarize right after this, we would only receive the total of delays as you said, not per airline or delay reason. This is because when the summarize() command is being used on the above table, it is creating a column called DELAYS based on the sum of the DELAYS column. The table is not being grouped by anything, so it looks at the full column. Also, there is only one column here so there is nothing to group by.

df %>%
  pivot_longer(cols = contains("DELAY"), names_to = "REASONS", values_to = "DELAYS") %>%
  summarise(DELAYS = sum(DELAYS))
# A tibble: 1 x 1
  DELAYS
   <dbl>
1    169

To achieve the result you want, we need to first group by the columns we want and then summarize within those groups. When we group the table, under the hood the columns that are not in the group_by() statement are broken up into chunks corresponding to the groups. In this case, the columns that have similar carriers and reasons will be together. Then, when summarize() is used, it will then calculate the specified operation, here calculating the sum of the delays, within each group that was created under the hood by R.

df %>%
  pivot_longer(cols = contains("DELAY"), names_to = "REASONS", values_to = "DELAYS") %>%
  group_by(unique_carrier, REASONS) %>%
  summarise(DELAYS = sum(DELAYS))
# A tibble: 5 x 3
# Groups:   unique_carrier [1]
  unique_carrier REASONS             DELAYS
  <chr>          <chr>                <dbl>
1 UA             carrier_delay            7
2 UA             late_aircraft_delay     53
3 UA             nas_delay              109
4 UA             security_delay           0
5 UA             weather_delay            0
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.