Comparison of columns

Mara_M · February 26, 2021, 3:35pm

Hello,
probably my problem is easily solvable for many of you, however I have not been working with R Studio for long and need some help.

Here is my problem:
I have two columns of a relatively large data set. The first is the retail price and the second is the sale price. Now I want to compare these two columns and know by how many percent the sales price is higher than the manufacturer price in each case.
Since the dataset has a lot of numbers to compare and R studio probably can't give me the percentage change for each case, it would be helpful if there were percentage ranges.
For example, in 200 cases there was an increase between 50% and 100%, 140 cases with an increase between 100% and 150% and so on.

I am grateful for any tips and thank you in advance.

StatSteph · February 26, 2021, 3:54pm

The best place to start when posting a question is with a reproducible example. It helps if you show a little bit of your data for us to give the best answer and not very generic answers. See a guide here: FAQ: How to do a minimal reproducible example ( reprex ) for beginners

lars · February 26, 2021, 4:35pm

Hello Mara,

I fully agree with the given advice to provide a reprex, but I also like to point you to section 20.4.3 Scalars and recycling rules which explains how R is capable of vectorised operations.

The example given below contains 1.000.000 records and calculates per case the price and percentage difference.

library(tidyverse)
set.seed(1)

df <- tibble(retail_price = (runif(10^6) * 100) %>% round(2),
             sales_price = (retail_price * (1 + runif(10))) %>% round(2)
             )

df %>% glimpse()
#> Rows: 1,000,000
#> Columns: 2
#> $ retail_price <dbl> 26.55, 37.21, 57.29, 90.82, 20.17, 89.84, 94.47, 66.08...
#> $ sales_price  <dbl> 30.27, 63.09, 99.05, 99.14, 21.51, 144.90, 133.03, 113...

df <- df %>%
  mutate(diff_price = sales_price - retail_price,
         diff_perc = sales_price/retail_price - 1,
         )

df
#> # A tibble: 1,000,000 x 4
#>    retail_price sales_price diff_price diff_perc
#>           <dbl>       <dbl>      <dbl>     <dbl>
#>  1        26.6         30.3       3.72    0.140 
#>  2        37.2         63.1      25.9     0.696 
#>  3        57.3         99.0      41.8     0.729 
#>  4        90.8         99.1       8.32    0.0916
#>  5        20.2         21.5       1.34    0.0664
#>  6        89.8        145.       55.1     0.613 
#>  7        94.5        133.       38.6     0.408 
#>  8        66.1        113.       46.9     0.710 
#>  9        62.9        100.       37.0     0.589 
#> 10         6.18        10.2       3.97    0.642 
#> # ... with 999,990 more rows

df %>% 
  ## count occurences for each diff_perc
  count(diff_perc, sort = TRUE)
#> # A tibble: 64,895 x 2
#>    diff_perc     n
#>        <dbl> <int>
#>  1    0.0667  5389
#>  2    0.643   3001
#>  3    0.696   2699
#>  4    0.643   1813
#>  5    0.613   1702
#>  6    0.408   1604
#>  7    0.613   1472
#>  8    0.696   1048
#>  9    0.729    817
#> 10    0.589    615
#> # ... with 64,885 more rows

^{Created on 2021-02-26 by the reprex package (v1.0.0)}

Hope this helps you further.

Mara_M · March 1, 2021, 9:56am

Hey,
thanks for your replies!
I solved the problem with excel.

system · March 22, 2021, 9:56am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.