Data Frame dividing column values by the respective row value

RH12 · November 19, 2020, 9:16pm

Dear All,

I have a data frame with one date column and 8 columns with stock prices. I am trying to divide all column values by their respective value of a particular row (i.e. by a price at a specific date) to normalize the data. Unfortunately, I have not figured out how to do this yet (the best solution would be a solution where the specific date of the row can be selected, such as 2020-03-02, to quickly change the base values to normalize the data if needed).

Thanks!

pieterjanvc · November 19, 2020, 9:30pm

Hi,

It's not entirely clear what you like to do, so maybe if my suggestion below is not what you are looking for, create a reprex following the guide below. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

You can use the group_by() from dplyr to perform actions on certain rows like this

library(dplyr)

#Generate some random data
set.seed(10)
myData = data.frame(
  date = as.Date(sample(c("2020-11-19", "2020-11-18"), 10, replace = T)),
  val1 = 1:10, val2 = runif(10))

myData
#>          date val1       val2
#> 1  2020-11-19    1 0.65165567
#> 2  2020-11-19    2 0.56773775
#> 3  2020-11-18    3 0.11350898
#> 4  2020-11-18    4 0.59592531
#> 5  2020-11-18    5 0.35804998
#> 6  2020-11-19    6 0.42880942
#> 7  2020-11-18    7 0.05190332
#> 8  2020-11-18    8 0.26417767
#> 9  2020-11-19    9 0.39879073
#> 10 2020-11-19   10 0.83613414

#Apply functions by group (in this case date)
myData %>% group_by(date) %>%
  mutate(val1 = scale(val1)[,1],
         val2 = (val2 - min(val2)) / (max(val2) - min(val2)))
#> # A tibble: 10 x 3
#> # Groups:   date [2]
#>    date          val1   val2
#>    <date>       <dbl>  <dbl>
#>  1 2020-11-19 -1.14   0.578 
#>  2 2020-11-19 -0.892  0.386 
#>  3 2020-11-18 -1.16   0.113 
#>  4 2020-11-18 -0.675  1     
#>  5 2020-11-18 -0.193  0.563 
#>  6 2020-11-19  0.0991 0.0686
#>  7 2020-11-18  0.772  0     
#>  8 2020-11-18  1.25   0.390 
#>  9 2020-11-19  0.842  0     
#> 10 2020-11-19  1.09   1

^{Created on 2020-11-19 by the reprex package (v0.3.0)}

Hope this helps,
PJ

RH12 · November 19, 2020, 9:54pm

Hi @pieterjanvc

Perhaps I did not express myself clearly enough. Let's have a look at the following example data frame, which is similar to the original:

data.frame(
      StockA = c(15, 20, 26, 10, 5),
      StockB = c(27, 16, 15, 10, 23),
                            date = as.factor(c("2020-03-01",
                                               "2020-03-02","2020-03-03",
                                               "2020-03-04","2020-03-04"))
                    )

What I try to do is to divide all column values of the columns StockA and StockB by the values in a certain row. For example, I want all values in the columns StockA and StockB to be divided by the values in row 2020-03-02 (i.e. 20 and 16 respectively). Therefore the values in this row are normalized to 1 and the other values are adjusted accordingly. But I did not find any code for this, because I do not want to select the corresponding lines for the code by [x,] but by explicitly mentioning the date of the row. In addition, the original data frame is much larger, so it is not possible to do it manually by just dividing each column by the row value.

Hopefully it is clearer now and thanks in advance.

andresrcs · November 20, 2020, 12:00am

Is this what you mean?

library(dplyr)

sample_df <- data.frame(
    StockA = c(15, 20, 26, 10, 5),
    StockB = c(27, 16, 15, 10, 23),
    date = as.Date(c("2020-03-01",
                       "2020-03-02","2020-03-03",
                       "2020-03-04","2020-03-04"))
)

sample_df %>% 
    mutate(across(.cols = starts_with("stock"),
                  .fns = ~ . / sample_df %>%
                      filter(date == '2020-03-02') %>%
                      pull(cur_column())
                  )
           )
#>   StockA StockB       date
#> 1   0.75 1.6875 2020-03-01
#> 2   1.00 1.0000 2020-03-02
#> 3   1.30 0.9375 2020-03-03
#> 4   0.50 0.6250 2020-03-04
#> 5   0.25 1.4375 2020-03-04

^{Created on 2020-11-20 by the reprex package (v0.3.0.9001)}

RH12 · November 20, 2020, 9:07am

Hi @andresrcs

Yes, that's exactly what I'm looking for, thank you. There is just one small problem left:
Is it possible to multiply all values by 100, i.e. that the normalized values are 100 instead of 1 and all other values adjust accordingly?

I very much apologise for my question but I am still a RStudio beginner.

andresrcs · November 20, 2020, 11:07am

You simply have to multiply

sample_df %>% 
    mutate(across(.cols = starts_with("stock"),
                  .fns = ~ . / sample_df %>%
                      filter(date == '2020-03-02') %>%
                      pull(cur_column()) * 100)
    )

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

system · November 27, 2020, 11:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.