Summarise Function

Iman_456 · March 30, 2024, 7:43am

Hey everyone, beginner to R. I am using the "starwars" dataset. Just wondering, are there any other efficient ways for me to combine the sum of the height column with the sum of the mass column besides this way? I could only think of this way to be the most easiest/efficient. Thank you.

My code:

starwars %>%
drop_na(height, mass) %>%
summarise(height = sum(height),
mass = sum(mass),
height_mass = height + mass) %>%
View()

Result:

Screenshot 2024-03-30 at 3.40.54 AM

dromano · March 30, 2024, 1:37pm

It depends — why do you want to combine them?

Iman_456 · March 30, 2024, 7:42pm

There is no particular "goal," I was just working with the summarise, separate, and unite functions, and working with the concept of combining two columns together. So I was wondering if there were any other efficient ways for me to add the totals of the two columns together over than the way I did it, which I think is pretty efficient, but it's always nice to know other methods.

jrkrideau · March 31, 2024, 2:44am

A completely different way of doing the same thing using the {data.table } package.

suppressMessages(library(data.table)); suppressMessages(library(tidyverse))
DT <-  as.data.table(starwars)

DT2 <- na.omit(DT, c("height", "mass"))
DT3 <-  DT2[, .(xx = sum(height), 
              yy = sum(mass)) ][ , zz := xx + yy]

DT3

I don't know if we can call it more efficient but {datatable} can be more efficient in many cases.

nedallo · April 1, 2024, 1:06am

Hi dear
you can use the mutate function in dplyr if you want as follow :

library(dplyr)

df <- data.frame(
x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8)
)

you can use mutate to create a new column containing the sum of columns x and y:

df <- df %>% mutate(sum_xy = x + y)

hope this will be helpful

Kind Regards

gbravo · April 3, 2024, 7:40am

What do you mean by efficient is unclear. From a coding perspective or a computational one? Anyway, what you did looks pretty efficient to me (from both perspectives). Of course, you could have done it in one step, if not interested in separately summing height and mass as well:

starwars %>%
  drop_na(height, mass) %>%
  summarise(
    height_mass = sum(height) + sum(mass)
  ) %>%
  print()

jrkrideau · April 3, 2024, 11:24am

Both actually. I don't work with very large data sets so I, personally, don't really benefit all that much on the computational side but there are reports of very impressive gains in processing time in some but not all functions versus base R or {dplyr}. Reading and writing a .csv file with fread() & fwrite() gives impressive time savings over read.csv & write.csv for example.

It is reported to be considerably more memory efficient in some operations.

From a coding perspective, data.table syntax is often less verbose than base R or dplyr. Here is a very simple example.

dat1$gsq <- dat1$gini^2 # base

dat1[, gsq := gini^2] # data.table

The character count difference is not huge in this case but noticeable. It can be quite impressive in more complicated statements.

I have learned is that there can be severe conflicts between {data.tabe} and {tidyverse} and I am a great fan of many parts of {tidyverse}, especially {lubridate} and {ggplot2} .

One should always load {data.table} before {tidyverse}.

jrkrideau · April 3, 2024, 11:43am

You are right. Apparently, I was thinking it was sensible to preserve the height and mass sums though I don't see why I was thinking that.

gbravo · April 3, 2024, 12:08pm

I sometimes work with large datasets but never tried something beyond the standard vectorization done by base R (and inherited by the tidyvserse) for such simple operations. Of course, there are many ways to parallelize the computation if you need to save time. On the other hand, you're right in pointing out that data.table::fread() is much quicker than read.csv() or dplyr::read_csv(), although I prefer to use the parquet format from arrow when dealing with large datasets.

dromano · April 3, 2024, 12:29pm

Out of curiosity, what does that mean?

gbravo · April 3, 2024, 1:15pm

Arrow is a collection of tools to handle large datasets, using the Parquet format to store data on the disk. You can find more here

jrkrideau · April 3, 2024, 1:23pm

Just for fun. I think I have almost the same coding equivalents for the original question with the starwars data. Note I am not including the conversion of starwars to data.table format as I am assuming it is the standard working format.

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
library(tictoc)

tic()
starwars %>%
  drop_na(height, mass) %>%
  summarise(
    height_mass = sum(height) + sum(mass)
  ) %>%
  print()
toc()   


DT <-  as.data.table(starwars)

tic()
DT <- na.omit(DT, c("height", "mass"))
DT2[, (sum(height) + sum(mass))][]
toc()

dromano · April 3, 2024, 1:53pm

Thanks for the link, @gbravo!

system · April 24, 2024, 1:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.