How to perform a row-wise calculation in a tibble

junghoonshin · May 26, 2020, 6:29am

I have a tibble with m rows and n columns (all containing numeric values) and a m-length numeric vector. I want to divide ith row of the tibble with the ith element of the vector (i ranges from 1 to m). What would be the most efficient way to do this?

nirgrahamuk · May 26, 2020, 8:45am

#iris has dimensions 150 rows and 5 columns
#lets make a vector with 150 entries to divide each petal length by.
#in this case I'll make it be 2*petal length, 
so we can see that the ultimate calculation happens

dim(iris)
head(iris)

(myvec <- 2* iris$Petal.Length)

(iris$new_calc <- iris$Petal.Length / myvec)

junghoonshin · May 26, 2020, 8:52am

Perhaps I should have been more specific. What I want is to divide all columns in the ith row with the ith element of the vector, not just one column. Below is my example code to do this.

my_data = tibble(a=1:10, b=1:10*2, c=1:10*3)
my_vector = 1:10
my_data %<>% mutate_all(~./my_vector)

This code performs exactly what I want, but I guess it's not the most efficient way to do this when my_data is very large because the calculation is repeated along the columns. Is there any alternative way?

nirgrahamuk · May 26, 2020, 8:57am

I dont understand this comment. either you want to perform a calculation on each cell of your original table, or you dont ?

junghoonshin · May 26, 2020, 9:07am

I do. What I meant was this: If I do this,

my_data %<>% mutate_all(~./my_vector)

I think it's same as doing this.

my_data %<>% mutate(a=a/my_vector, b=b/my_vector, c=c/my_vector, ...)

Thus if my_data has 10 thousand columns, the vectorized division calculation should also be performed 10 thousand times, which made me guess this is not the most efficient way to do this particular job.

nirgrahamuk · May 26, 2020, 9:12am

sorry, I think we are at an empass.
if you want 10,000 columns divided, I dont see another option than writing code that divides them all.

Unless you are hinting that you expect certain columns to be duplicates, in which case you might optimise to skip these, or borrow the values from similar columns, but in the general case. No, theres no way to do all the calculations you said you wish to do without .. doing them...

However it is the case the tidyverse/dplyr trades away performance for its friendly syntax.
So you could use a faster way to do all the divisions by using data.table package. This has its own syntax that you can learn. also, lately there have been improvements to the dtplyr package (extra t between the d and the p) this lets you make your instructions in dplyr syntax but then uses data.table to do the calculations.

I think you would see a speed up with the data.table back end. However, it would still be performing numerous divisions (as it seems thats what you require)

junghoonshin · May 26, 2020, 9:59am

Thank you for your quick and informative answers!

nirgrahamuk · May 26, 2020, 10:00am

you're very welcome. and if you find a particulary solution you favour, you can come back and share the knowledge

martin.R · May 26, 2020, 10:20am

This is a lot more straightforward:

my_data = matrix(c(1:10, 1:10*2, 1:10*3), nrow = 10, ncol = 3)
my_vector = 1:10
my_data <- my_data/my_vector

nirgrahamuk · May 26, 2020, 10:36am

of course, the matrix data structure would be much the faster. doh !

junghoonshin · May 26, 2020, 11:00am

Thank you. Perhaps the best solution seems to be first to convert my_data into matrix and then perform the calculation.

Leon · May 26, 2020, 12:35pm

Something like this @junghoonshin?

m = 10
n = 6
X = matrix(data = round(rnorm(m*n), 1), nrow = m, ncol = n)
v = round(rnorm(m), 1)
Y = t(sapply(seq(1, m), function(i){
  return(X[i,] / v[i])
}))

Hope it helps

martin.R · May 26, 2020, 1:10pm

X/v is all that's required in this example. No loops, no apply functions.

Leon · May 26, 2020, 1:40pm

Right you are - Nice!

system · June 16, 2020, 1:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.