Creating multiple variables efficiently

Vodstrup · May 20, 2020, 9:48am

Hello there!

I'm decently new to R and hope you can help me! Thanks in advance.

Keeping it short:

I have a dataset with a lot of variables.

I have one with years from 2010-2018.
I have several with income, taxes and so on for each year. Each of these also has a variable which shows what currency it is.
Then I have a lot of variables with the exchange rate of the currencies is a given year. Ie a whole variable called "Currencies11", one called "currencies12" and so on.

Id like to calculate the income, taxes and so on for each year, but I can't seem to find an effective way to do this.

Can you help me?

Best regards

siddharthprabhu · May 20, 2020, 9:59am

Hi @Vodstrup, welcome to RStudio Community.

It would be much easier to help you if we could see what your data looks like. You could share your data in an easy-to-copy format by posting the output of dput(df) where df is the name of the data frame. If your data has too many rows you can restrict it to a small sample with dput(head(df, n = xx)) where xx is the number of rows you want to extract.

If your data is private, please create a minimal reproducible example by following the guide below.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Vodstrup · May 20, 2020, 10:13am

Like this?

r1420_ny

A tibble: 2,029,569 x 27

 KOB Erklaering Aarets_res Res_foer_skat Valuta aar   Daekningsmaaned~ Hovedbranche Ansatte Regionskode IVS   Branche

1 353 Revision -499 -666 DKK 2012 12 467400 D Region Hov~ APS Taerti~
2 353 Revision -464 -619 DKK 2011 12 467400 D Region Hov~ APS Taerti~
3 353 Revision -65 -72 DKK 2010 12 467400 D Region Hov~ APS Taerti~
4 676 Revision NA NA DKK 2013 12 464210 I Region Syd~ Øvri~ Taerti~
5 676 Revision NA NA DKK 2011 12 464210 I Region Syd~ Øvri~ Taerti~
6 676 Revision NA NA DKK 2010 12 464210 I Region Syd~ Øvri~ Taerti~
7 687 Ingen -5 -5 DKK 2013 12 477110 B Region Hov~ APS Taerti~
8 687 Ingen 17 23 DKK 2012 12 477110 B Region Hov~ APS Taerti~
9 687 Revision 9 12 DKK 2011 12 477110 B Region Hov~ APS Taerti~
10 687 Revision -5 -3 DKK 2010 12 477110 B Region Hov~ APS Taerti~

... with 2,029,559 more rows, and 15 more variables: Nchar , DB07 , Valutacut , Valuta.y ,

Valuta09 , Valuta10 , Valuta11 , Valuta12 , Valuta13 , Valuta14 , Valuta15 ,

Valuta16 , Valuta17 , Valuta18 , Valuta19

siddharthprabhu · May 20, 2020, 10:33am

No, that is not an easy-to-copy format. Please re-read the instructions in my previous post.

Since your data set is huge, maybe you could share just the first 20 or so rows with dput(head(r1420_ny, n = 20)).

Vodstrup · May 20, 2020, 10:51am

Sorry! I accidently printed the dataset below the other output. This should be it: (Note that the "Valuta" is currency and although all these might be "1" they are not for all observations)

And thanks!

structure(list(KOB = c(353, 353, 353, 676, 676, 676, 687, 687, 
687, 687, 802, 802, 880, 880, 880, 921, 921, 921, 988, 988), 
    Erklaering = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Revision", 
    "Udvidet", "Review", "Assistance", "Ingen"), class = "factor"), 
    Aarets_res = c(-499, -464, -65, NA, NA, NA, -5, 17, 9, -5, 
    -210, -147, 59574, 63310, 55724, -17, 84, 31, 1822, 982), 
    Res_foer_skat = c(-666, -619, -72, NA, NA, NA, -5, 23, 12, 
    -3, -210, -147, 60609, 64540, 56781, -21, 111, 42, 2354, 
    1289), Valuta = c("DKK", "DKK", "DKK", "DKK", "DKK", "DKK", 
    "DKK", "DKK", "DKK", "DKK", "DKK", "DKK", "DKK", "DKK", "DKK", 
    "DKK", "DKK", "DKK", "DKK", "DKK"), aar = c("2012", "2011", 
    "2010", "2013", "2011", "2010", "2013", "2012", "2011", "2010", 
    "2011", "2010", "2012", "2011", "2010", "2012", "2011", "2010", 
    "2011", "2010"), Daekningsmaaneder = c(12, 12, 12, 12, 12, 
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12
    ), Hovedbranche = c(467400, 467400, 467400, 464210, 464210, 
    464210, 477110, 477110, 477110, 477110, 649900, 649900, 463300, 
    463300, 463300, 475220, 475220, 475220, 999999, 999999), 
    Ansatte = c("D", "D", "D", "I", "I", "I", "B", "B", "B", 
    "B", "B", "B", "B", "B", "B", "C", "C", "C", "B", "B"), Regionskode = c("Region Hovedstaden", 
    "Region Hovedstaden", "Region Hovedstaden", "Region Syddanmark", 
    "Region Syddanmark", "Region Syddanmark", "Region Hovedstaden", 
    "Region Hovedstaden", "Region Hovedstaden", "Region Hovedstaden", 
    "Region Syddanmark", "Region Syddanmark", "Region Midtjylland", 
    "Region Midtjylland", "Region Midtjylland", "Region Midtjylland", 
    "Region Midtjylland", "Region Midtjylland", "Region Midtjylland", 
    "Region Midtjylland"), IVS = c("APS", "APS", "APS", "Øvrige", 
    "Øvrige", "Øvrige", "APS", "APS", "APS", "APS", "APS", "APS", 
    "Øvrige", "Øvrige", "Øvrige", "APS", "APS", "APS", "A/S", 
    "A/S"), Branche = c("Taertier", "Taertier", "Taertier", "Taertier", 
    "Taertier", "Taertier", "Taertier", "Taertier", "Taertier", 
    "Taertier", "Taertier", "Taertier", "Taertier", "Taertier", 
    "Taertier", "Taertier", "Taertier", "Taertier", "Taertier", 
    "Taertier"), Nchar = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
    6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), DB07 = c("46", 
    "46", "46", "46", "46", "46", "47", "47", "47", "47", "64", 
    "64", "46", "46", "46", "47", "47", "47", "99", "99"), Valutacut = c("K", 
    "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", 
    "K", "K", "K", "K", "K", "K", "K"), Valuta.y = c("Danish Krone", 
    "Danish Krone", "Danish Krone", "Danish Krone", "Danish Krone", 
    "Danish Krone", "Danish Krone", "Danish Krone", "Danish Krone", 
    "Danish Krone", "Danish Krone", "Danish Krone", "Danish Krone", 
    "Danish Krone", "Danish Krone", "Danish Krone", "Danish Krone", 
    "Danish Krone", "Danish Krone", "Danish Krone"), Valuta09 = c(1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    Valuta10 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1), Valuta11 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Valuta12 = c(1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Valuta13 = c(1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    Valuta14 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1), Valuta15 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Valuta16 = c(1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Valuta17 = c(1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    Valuta18 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1), Valuta19 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -20L), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

siddharthprabhu · May 20, 2020, 11:03am

Thank you, that helps. Now could you let us know what you'd like to calculate? You mentioned income & taxes for each year in your first post. But since the data set isn't in English, it'd help to know which variables are of interest.

Vodstrup · May 20, 2020, 12:16pm

Sorry for the slow replies!

I have the year of the data "aar", what currency it is given in "Valuta" for the observations. Then the different variables "Valuta 11", "Valuta12" and so on is the exchange rate of 2011, 2012 and so on.
"Aarets_res" is the year-end result of a company. (there's multiple relevant variables that I'd like to do the same with)

As so Id love to multiple Valuta 11 with Aarets_res for the year 2011. The problem is that "Valuta 11" is its' own variable. I then need to multiply Valuta 12 with Aarets_res for the year 2012 and so on.

This would be really neat to do for Aarets_res, res_foer_skat and other variables (that indicate a number)

Thanks a lot for helping!

Vodstrup · May 20, 2020, 12:58pm

I tried the following:

if (r1420_ny$aar==2017) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta17
} else if (r1420_ny$aar==2016) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta16
} else if (r1420_ny$aar==2015) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta15
} else if (r1420_ny$aar==2014) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta14
} else if (r1420_ny$aar==2013) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta13
} else if (r1420_ny$aar==2012) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta12
} else if (r1420_ny$aar==2011) {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta11
} else {r1420_ny$Aarets_res_kr <- r1420_ny$Aarets_resr1420_ny$Valuta10}

However it seems very inconvinient and doesn't multiple with the correct year

I get the errormessage:
"In if (r1420_ny$aar == 2017) { :
the condition has length > 1 and only the first element will be used
2: In if (r1420_ny$aar == 2016) { :
the condition has length > 1 and only the first element will be used
3: In if (r1420_ny$aar == 2015) { :
the condition has length > 1 and only the first element will be used
4: In if (r1420_ny$aar == 2014) { :
the condition has length > 1 and only the first element will be used
5: In if (r1420_ny$aar == 2013) { :
the condition has length > 1 and only the first element will be used
6: In if (r1420_ny$aar == 2012) { :
the condition has length > 1 and only the first element will be used"

siddharthprabhu · May 20, 2020, 1:16pm

This calculation would be much easier if the data set was in tidy format with Valutaxx variables in rows instead of columns.

The code below applies this transformation to the sample data before proceeding with the multiplication. The results are stored in the variables Aarets_res_new and Res_foer_skat_new.

library(tidyverse)

# Code to generate `data` not shown.

data %>% 
  pivot_longer(cols = matches("Valuta\\d{2}")) %>% 
  separate(name, into = c("valuta", "year_num"), sep = 6) %>% 
  filter(aar == str_c("20", year_num)) %>% 
  select(-c(valuta, year_num)) %>% 
  mutate(Aarets_res_new = Aarets_res * value, Res_foer_skat_new = Res_foer_skat * value) %>% 
  print(width = Inf)
# A tibble: 20 x 19
     KOB Erklaering Aarets_res Res_foer_skat Valuta aar   Daekningsmaaneder Hovedbranche Ansatte Regionskode        IVS    Branche 
   <dbl> <fct>           <dbl>         <dbl> <chr>  <chr>             <dbl>        <dbl> <chr>   <chr>              <chr>  <chr>   
 1   353 Revision         -499          -666 DKK    2012                 12       467400 D       Region Hovedstaden APS    Taertier
 2   353 Revision         -464          -619 DKK    2011                 12       467400 D       Region Hovedstaden APS    Taertier
 3   353 Revision          -65           -72 DKK    2010                 12       467400 D       Region Hovedstaden APS    Taertier
 4   676 Revision           NA            NA DKK    2013                 12       464210 I       Region Syddanmark  Øvrige Taertier
 5   676 Revision           NA            NA DKK    2011                 12       464210 I       Region Syddanmark  Øvrige Taertier
 6   676 Revision           NA            NA DKK    2010                 12       464210 I       Region Syddanmark  Øvrige Taertier
 7   687 Ingen              -5            -5 DKK    2013                 12       477110 B       Region Hovedstaden APS    Taertier
 8   687 Ingen              17            23 DKK    2012                 12       477110 B       Region Hovedstaden APS    Taertier
 9   687 Revision            9            12 DKK    2011                 12       477110 B       Region Hovedstaden APS    Taertier
10   687 Revision           -5            -3 DKK    2010                 12       477110 B       Region Hovedstaden APS    Taertier
11   802 Revision         -210          -210 DKK    2011                 12       649900 B       Region Syddanmark  APS    Taertier
12   802 Revision         -147          -147 DKK    2010                 12       649900 B       Region Syddanmark  APS    Taertier
13   880 Revision        59574         60609 DKK    2012                 12       463300 B       Region Midtjylland Øvrige Taertier
14   880 Revision        63310         64540 DKK    2011                 12       463300 B       Region Midtjylland Øvrige Taertier
15   880 Revision        55724         56781 DKK    2010                 12       463300 B       Region Midtjylland Øvrige Taertier
16   921 Revision          -17           -21 DKK    2012                 12       475220 C       Region Midtjylland APS    Taertier
17   921 Revision           84           111 DKK    2011                 12       475220 C       Region Midtjylland APS    Taertier
18   921 Revision           31            42 DKK    2010                 12       475220 C       Region Midtjylland APS    Taertier
19   988 Revision         1822          2354 DKK    2011                 12       999999 B       Region Midtjylland A/S    Taertier
20   988 Revision          982          1289 DKK    2010                 12       999999 B       Region Midtjylland A/S    Taertier
   Nchar DB07  Valutacut Valuta.y     value Aarets_res_new Res_foer_skat_new
   <int> <chr> <chr>     <chr>        <dbl>          <dbl>             <dbl>
 1     6 46    K         Danish Krone     1           -499              -666
 2     6 46    K         Danish Krone     1           -464              -619
 3     6 46    K         Danish Krone     1            -65               -72
 4     6 46    K         Danish Krone     1             NA                NA
 5     6 46    K         Danish Krone     1             NA                NA
 6     6 46    K         Danish Krone     1             NA                NA
 7     6 47    K         Danish Krone     1             -5                -5
 8     6 47    K         Danish Krone     1             17                23
 9     6 47    K         Danish Krone     1              9                12
10     6 47    K         Danish Krone     1             -5                -3
11     6 64    K         Danish Krone     1           -210              -210
12     6 64    K         Danish Krone     1           -147              -147
13     6 46    K         Danish Krone     1          59574             60609
14     6 46    K         Danish Krone     1          63310             64540
15     6 46    K         Danish Krone     1          55724             56781
16     6 47    K         Danish Krone     1            -17               -21
17     6 47    K         Danish Krone     1             84               111
18     6 47    K         Danish Krone     1             31                42
19     6 99    K         Danish Krone     1           1822              2354
20     6 99    K         Danish Krone     1            982              1289

The important consideration however is performance. The pivot_longer() operation will temporarily grow your data frame by a factor of 10 (resulting in 20 million rows) before filter() reduces it back to the original size.

You may want to try this out on a subset of your data and check how long it takes to run.

Vodstrup · May 20, 2020, 1:38pm

This worked! Thanks a lot!

Would it be too much to as in raw terms the steps of the code?

best regards

nirgrahamuk · May 20, 2020, 2:01pm

Might be a usecase for the library(dtplyr), and the lazy_dt function.
To get benefits of speed from a data.table back-end with the advantage of nice dplyr syntax.

siddharthprabhu · May 20, 2020, 3:11pm

Sure. Here's what's happening:

pivot_longer() pivots any column with a name matching the pattern Valuta following by 2 digits (i.e. Valuta09, Valuta10 etc.) into long form. This results in two columns name and value where name is the column's original name and value is the number it contained.
Then we use separate() to split the column name at the 6th character. This allows us to extract the digit portion of the name.
With filter() we only retain rows where the year in aar matches the digit portion we extracted in step 2 (after appending 20 to it in order to create a YYYY year). This achieves the year matching.
Now we just need to multiply the variables of interest with the value column which is done in the mutate() step.

system · May 27, 2020, 3:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.