I would like to create a new variable in a dataframe, essentially using the equivalent of a "sumif function" in excel.
I have a dataframe which looks like this (but is much larger):
Country Year Number of IMF programs during the year
AFG 2000 1
ARG 2000 1
AFG 2001 0
ARG 2001 1
I would like to create a new variable "number of IMF programs over the period": the sum of the number of programs in the country, over the period.
In the example, we can see from the column "IMF programs during the year" that Afghanistan had 1 program over the period (1 in 2000, 0 in 2001), and Argentina had 2.
so the table would look like this:
Country Year IMF programs during the year N of programs over the period
BRA 2000 1 1
ARG 2000 1 2
BRA 2001 0 1
ARG 2001 1 2
I have seen tutorials, but they only show how to create subsets of observations matching certain criteria, and then calculate sums within the subsets. I would be able to do this, but it is not what I need. I would need another variable...
Would you be able to provide me with some guidance ?
I do not understand where the values in the new column come from. Are they just a running seq within each year independent of the value in "IMF programs during the year"?
Instead of using group_by and summarize I used group_by and mutate.
DF <- data.frame(Country = c("B", "A", "B", "A"),
Year = c(2000, 2000, 2001, 2001),
IMF_Prg = c(1,1,0,1))
DF
#> Country Year IMF_Prg
#> 1 B 2000 1
#> 2 A 2000 1
#> 3 B 2001 0
#> 4 A 2001 1
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DF %>% group_by(Country) %>% mutate(PrgInPeriod = sum(IMF_Prg))
#> # A tibble: 4 x 4
#> # Groups: Country [2]
#> Country Year IMF_Prg PrgInPeriod
#> <fct> <dbl> <dbl> <dbl>
#> 1 B 2000 1 1
#> 2 A 2000 1 2
#> 3 B 2001 0 1
#> 4 A 2001 1 2
Thank you so much for the effort you've put in, it's incredible and I'm very grateful.
It nearly works, as it does create a new column with the sum of programs... but it sums programs for all countries !
hence the last column is:
PrgInPeriod
4
4
4
4
Do not worry about it though, I have found a (suboptimal, but good enough) solution:
I use
to see the number of programs per country, and then I will manually enter them into a vector (which I will then use as a column) when creating my dataframe.
datayolo <- data.frame(Country=c("AFG", "ARG", "AFG", "ARG"),
Year=c(2000,2000,2001,2001),
`IMF Programs during the year`=c(1,1,0,1),
check.names=FALSE)
datayolo[["N of programs over the period"]] <- tapply(datayolo[["IMF Programs during the year"]],
datayolo$Country,
sum)[datayolo$Country]
datayolo
#> Country Year IMF Programs during the year N of programs over the period
#> 1 AFG 2000 1 1
#> 2 ARG 2000 1 2
#> 3 AFG 2001 0 1
#> 4 ARG 2001 1 2
Those are backticks around the first IMF Programs during the year so it works inside the data.frame() function, check.names = FALSE keeps dots from replacing the spaces in the name.
The others are standard double-quotes.
Since the OQ is for "sumif", the third argment to tapply is "sum", but any other function could be used, e.g. mean, median, sd.