Sum by row names

dicsonsantos · April 28, 2021, 8:41pm

Hi everyone, I'm freshly using R and I'm having some problem with my code.

I have a huge data frame (df1) of some economic data like this:

state    account       t1           t2           t3           t4          sum
<list>    <list>  <int> <list> <int> <list> <int> <list> <int> <list> <int> <list>
st1      account1      1            2            3            1            7
st1      account2      2            3            1            2            8
st1      account3      1            2            1            3            7 
st1      account4      1            1            1            1            4
st2      account1      0            4            6            1            11
st2      account2      1            1            2            2            6
st2      account3      2            3            1            2            8
st2      account4      3            5            1            2            11
 .          .          .            .            .            .            .
 .          .          .            .            .            .            .
 .          .          .            .            .            .            .

And another dataframe (df2) with some results of (df1) as:

state    account     value
<list>    <list>  <int> <list>
st1      result_t1     1            
st1      result_t2     2            
st1      result_t3     1            
st1      result_t4     1            
st2      result_t1     0            
st2      result_t2     1            
st2      result_t3     2            
st2      result_t4     3            
 .          .          .                        
 .          .          .                    
 .          .          .

I need to create a new df (df3) to do some operations between the accounts over time (t1 to t4), like:

result_t1 = account1_t1 - account2_t1 - account3_t1 + account4_t1
result_t2 = account1_t2 - account2_t2 - account3_t2 + account4_t2
result_t3 = account1_t3 - account2_t3 - account3_t3 + account4_t3
result_t4 = account1_t4 - account2_t4 - account3_t4 + account4_t4

I tried something like this, but did not worked:

result~ x1 - x2 - x3 + x4 

x1 <- df1$Account("Account1")
x2 <- df1$Account("Account2")
x3 <- df1$Account("Account3")
x4 <- df1$Account("Account4")

After the operation I need to check if the result in the new lines of (df3) are equal to the result in (df2), and this I really don't know begin to solve this.

The desirable output would be something like the data of the operations of (df3), the values of (df2) and one column to say if they're equal or not, like:

(from    (from     (from         (from
  df2)      df2)       df2)         df3)
state    account     value        value        equal?
<list>    <list>  <int> <list> <int> <list>    <list>
st1      result_t1     1            1            yes
st1      result_t2     2            2            yes
st1      result_t3     1            0            no
st1      result_t4     1            1            yes
st2      result_t1     0            0            yes
st2      result_t2     1            1            yes
st2      result_t3     2            2            yes
st2      result_t4     3            3            yes
 .          .          .            .             .
 .          .          .            .             .
 .          .          .            .             .

So, anyone knows how to help me to solve this?

williaml · April 28, 2021, 10:00pm

Can you make the example reproducible? It will make it easier to provide an answer.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

dicsonsantos · April 29, 2021, 8:26pm

Hey @williaml, thanks for giving me the tip, I think that will be easier now, here's the example:

df1 that is a huge data base that I alredy have:

df1 <- data.frame(stringsAsFactors = TRUE,
                 State = c("RJ","RJ","RJ","RJ","SP","SP","SP","SP"),
                 Account = c("revenue","rev_transfers","contribution_rev",
                             "expenses_transfers","revenue", "rev_transfers",
                             "contribution_rev","expenses_transfers"),
                 value_t1 = c(5, 1, 4, 9, 4, 7, 4, 6),
                 value_t2 = c(6, 3, 5, 7, 1, 4, 2, 6),
                 value_t3 = c(9, 8, 0, 1, 2, 7, 8, 3),
                 value_t4 = c(7, 2, 2, 8, 9, 4, 5, 1),
                 sum_t = c(sum_t <- df1$value_t1 + df1$value_t2 + df1$value_t3 + df1$value_t4)
                  )

df2 its another data base that I alredy have:

df2 <- data.frame(stringsAsFactors = TRUE,
                  State = c("RJ","RJ","RJ","RJ","SP","SP","SP","SP"),
                  Account = c("result_t1","result_t2","result_t3",
                              "result_t4","result_t1", "result_t2",
                              "result_t3","result_t4"),
                  result_df2 = c(-3,-8,1,7,9,2,-1,-2)                 
                  )

I'll separate the problems in 2 parts to try to make it easier:

PART 1

What I need to do is to create one formula that will do basic operations, like:

result_df3 <- df1$value_t1 - df1$value_t2 - df1$value_t3 + df1$value_t4

But the my problem is that I need to make R do this operations calling the rows by its names, because of the size of the data base, like:

result_df3 <- df1$revenue - df1$rev_transfers - df1$contribution_rev + df1$expenses_transfers

And I don't know how to do this operations calling by the row's name, only by the column's name.

PART 2
2. With this formula, I need to create a new df (df3), and compare with the results of df2, using some kind of if loop code to say if they're equal or not, like:


df3 <- data.frame(stringsAsFactors = FALSE,
                  State = c("RJ","RJ","RJ","RJ","SP","SP","SP","SP"),
                  Account = c("result_t1","result_t2","result_t3",
                              "result_t4","result_t1", "result_t2",
                              "result_t3","result_t4"),
                  result_df2,
                  result_df3 = c(result_df3 <- df1$value_t1 - df1$value_t2 - df1$value_t3 + df1$value_t4),
                  equal = c(
                    for(State in df3){
                      if (result_df2 == result_df3){
                      print("yes")
                    } else {
                      print("no")
                    }})
                    
                  )

And my if is returning me this:

1: In if (result_df2 == result_df3) {:
   the condition has length> 1 and only the first element will be used

Do you or anyone know how to solve this?

williaml · April 29, 2021, 10:21pm

There is probably a better way of doing this, but this is an answer for part one if I understood correctly. I assume that this is df3 as well. The numbers for part 2 can be compared pretty easily using dplyr, but not sure where they come from.

library(tidyverse)

df1 <- data.frame(stringsAsFactors = TRUE,
                  State = c("RJ","RJ","RJ","RJ","SP","SP","SP","SP"),
                  Account = c("revenue","rev_transfers","contribution_rev",
                              "expenses_transfers","revenue", "rev_transfers",
                              "contribution_rev","expenses_transfers"),
                  value_t1 = c(5, 1, 4, 9, 4, 7, 4, 6),
                  value_t2 = c(6, 3, 5, 7, 1, 4, 2, 6),
                  value_t3 = c(9, 8, 0, 1, 2, 7, 8, 3),
                  value_t4 = c(7, 2, 2, 8, 9, 4, 5, 1)
) %>% 
  mutate(sum_t = value_t1 + value_t2 + value_t3 + value_t4) # I couldn't get yours to work, but this will work


df1 %>% 
  pivot_longer(-(State:Account), names_to = "names", values_to = "values") %>% 
  pivot_wider(names_from = Account, values_from = values) %>% 
  mutate(new_value = revenue - rev_transfers - contribution_rev + expenses_transfers)


# A tibble: 10 x 7
   State names    revenue rev_transfers contribution_rev expenses_transfers new_value
   <fct> <chr>      <dbl>         <dbl>            <dbl>              <dbl>     <dbl>
 1 RJ    value_t1       5             1                4                  9         9
 2 RJ    value_t2       6             3                5                  7         5
 3 RJ    value_t3       9             8                0                  1         2
 4 RJ    value_t4       7             2                2                  8        11
 5 RJ    sum_t         27            14               11                 25        27
 6 SP    value_t1       4             7                4                  6        -1
 7 SP    value_t2       1             4                2                  6         1
 8 SP    value_t3       2             7                8                  3       -10
 9 SP    value_t4       9             4                5                  1         1
10 SP    sum_t         16            22               19                 16        -9

dicsonsantos · May 2, 2021, 8:57pm

Hey, @williaml, thank you this solved my problem.

But, when I'm summing the columns, there's some Nans, to solve I've search and tried this:

mutate(sum_t = value_t1 + value_t2 + value_t3 + value_t4,na.rm=TRUE)

But the "na.rm=true" just give me one new column named na.rm with a lot of true (lol), do you know how to solve this?

williaml · May 2, 2021, 10:37pm

Do you mean something like this?

df <- tibble(a = c(1:5),
             b = c(1:5),
             c = c(1, 2, 3, NA, NA))

df %>% 
  rowwise() %>% 
  mutate(d = sum(a,b,c, na.rm = TRUE)) %>% 
  ungroup() # to escape rowwise

# A tibble: 5 x 4
      a     b     c     d
  <int> <int> <dbl> <dbl>
1     1     1     1     3
2     2     2     2     6
3     3     3     3     9
4     4     4    NA     8
5     5     5    NA    10

dicsonsantos · May 3, 2021, 6:14pm

This solved but one more problem appeared: the values in the new column are all with the same using my df. Reading and making some tests with my df in Excel I found that the result in the column is the sum of the operations in all the rows, do you know why this is happening?

williaml · May 3, 2021, 8:53pm

Probably best if you put this in a new question with a reproducible example.

system · May 24, 2021, 8:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.