Summarise Function

Hey everyone, beginner to R. I am using the "starwars" dataset. Just wondering, are there any other efficient ways for me to combine the sum of the height column with the sum of the mass column besides this way? I could only think of this way to be the most easiest/efficient. Thank you.

My code:

starwars %>%
drop_na(height, mass) %>%
summarise(height = sum(height),
mass = sum(mass),
height_mass = height + mass) %>%
View()

Result:

Screenshot 2024-03-30 at 3.40.54 AM

It depends — why do you want to combine them?

1 Like

There is no particular "goal," I was just working with the summarise, separate, and unite functions, and working with the concept of combining two columns together. So I was wondering if there were any other efficient ways for me to add the totals of the two columns together over than the way I did it, which I think is pretty efficient, but it's always nice to know other methods.

A completely different way of doing the same thing using the {data.table } package.

suppressMessages(library(data.table)); suppressMessages(library(tidyverse))
DT <-  as.data.table(starwars)

DT2 <- na.omit(DT, c("height", "mass"))
DT3 <-  DT2[, .(xx = sum(height), 
              yy = sum(mass)) ][ , zz := xx + yy]

DT3

I don't know if we can call it more efficient but {datatable} can be more efficient in many cases.

Hi dear
you can use the mutate function in dplyr if you want as follow :

library(dplyr)

df <- data.frame(
x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8)
)

you can use mutate to create a new column containing the sum of columns x and y:

df <- df %>% mutate(sum_xy = x + y)

image

hope this will be helpful

Kind Regards

What do you mean by efficient is unclear. From a coding perspective or a computational one? Anyway, what you did looks pretty efficient to me (from both perspectives). Of course, you could have done it in one step, if not interested in separately summing height and mass as well:

starwars %>%
  drop_na(height, mass) %>%
  summarise(
    height_mass = sum(height) + sum(mass)
  ) %>%
  print()

Both actually. I don't work with very large data sets so I, personally, don't really benefit all that much on the computational side but there are reports of very impressive gains in processing time in some but not all functions versus base R or {dplyr}. Reading and writing a .csv file with fread() & fwrite() gives impressive time savings over read.csv & write.csv for example.

It is reported to be considerably more memory efficient in some operations.

From a coding perspective, data.table syntax is often less verbose than base R or dplyr. Here is a very simple example.

dat1$gsq <- dat1$gini^2 # base

dat1[, gsq := gini^2] # data.table

The character count difference is not huge in this case but noticeable. It can be quite impressive in more complicated statements.

I have learned is that there can be severe conflicts between {data.tabe} and {tidyverse} and I am a great fan of many parts of {tidyverse}, especially {lubridate} and {ggplot2} .

One should always load {data.table} before {tidyverse}.

1 Like

You are right. Apparently, I was thinking it was sensible to preserve the height and mass sums though I don't see why I was thinking that.

I sometimes work with large datasets but never tried something beyond the standard vectorization done by base R (and inherited by the tidyvserse) for such simple operations. Of course, there are many ways to parallelize the computation if you need to save time. On the other hand, you're right in pointing out that data.table::fread() is much quicker than read.csv() or dplyr::read_csv(), although I prefer to use the parquet format from arrow when dealing with large datasets.

Out of curiosity, what does that mean?

Arrow is a collection of tools to handle large datasets, using the Parquet format to store data on the disk. You can find more here

Just for fun. I think I have almost the same coding equivalents for the original question with the starwars data. Note I am not including the conversion of starwars to data.table format as I am assuming it is the standard working format.

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
library(tictoc)

tic()
starwars %>%
  drop_na(height, mass) %>%
  summarise(
    height_mass = sum(height) + sum(mass)
  ) %>%
  print()
toc()   


DT <-  as.data.table(starwars)

tic()
DT <- na.omit(DT, c("height", "mass"))
DT2[, (sum(height) + sum(mass))][]
toc()
1 Like

Thanks for the link, @gbravo!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.