I am trying to create a discrete variable from two existing variables. The new discrete variable will contain only four values. The name of my relevant variables and conditions based on which I want to generate my new variable are -
New Variable Name: GDPLife
Existing variable: GDPpercapita and LifeExpectancy
Conditions:
GDPLife = 1 if GDPpercapita > 10000 and LifeExpectancy > 70
GDPLife = 2 if GDPpercapita > 10000 and LifeExpectancy <= 70
GDPLife = 3 if GDPpercapita < 10000 and LifeExpectancy > 70
GDPLife = 4 if GDPpercapita < 10000 and LifeExpectancy <= 70
I would also like to mention that there are some missing values in the variable GDPpercapita and its okay if the new variable shows NA in those cases.
In addition to that, I was wondering is it also possible to add labels to the values in the variable GDPLife? For example, I would like to add the following labels.
If GDPLife =1 then the label will be "High Income, High Life Expectancy"
If GDPLife =2 then the label will be "High Income, Low Life Expectancy"
If GDPLife =3 then the label will be "Low Income, High Life Expectancy"
If GDPLife =4 then the label will be "Low Income, Low Life Expectancy"
I would like to suggest to always provide your data to your potential helpers. It would be much easier for them to help you if they have your data Having said that, it seems that you are using the gapminder data or a similar dataset. Since I don't have your data, I'm just going to roll with gapminder!
As @cderv suggested, the case_when() function in the dplyr package is your friend for what you are trying to do. Also, regarding your question about creating labels, he suggested the use of factors, which is the right way for creating labels. However, my personal advice would be to create an additional column with your labels.
library(gapminder)
library(dplyr)
# Load the dataset
data("gapminder")
my_data <- gapminder
my_data <-
my_data %>%
mutate(
gdp_life = case_when(
gdpPercap > 10000 & lifeExp > 70 ~ 1,
gdpPercap > 10000 & lifeExp <= 70 ~ 2,
gdpPercap < 10000 & lifeExp > 70 ~ 3,
gdpPercap < 10000 & lifeExp <= 70 ~ 4
),
label = case_when(
gdp_life == 1 ~ "High income, high life expectancy",
gdp_life == 2 ~ "High income, low life expectancy",
gdp_life == 3 ~ "Low income, high life expectancy",
gdp_life == 4 ~ "Low income, low life expectancy"
)
)
my_data
A tibble: 1,704 x 8
country continent year lifeExp pop gdpPercap gdp_life label
<fct> <fct> <int> <dbl> <int> <dbl> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. 4 Low income, low life expectancy
2 Afghanistan Asia 1957 30.3 9240934 821. 4 Low income, low life expectancy
3 Afghanistan Asia 1962 32.0 10267083 853. 4 Low income, low life expectancy
4 Afghanistan Asia 1967 34.0 11537966 836. 4 Low income, low life expectancy
5 Afghanistan Asia 1972 36.1 13079460 740. 4 Low income, low life expectancy
6 Afghanistan Asia 1977 38.4 14880372 786. 4 Low income, low life expectancy
7 Afghanistan Asia 1982 39.9 12881816 978. 4 Low income, low life expectancy
8 Afghanistan Asia 1987 40.8 13867957 852. 4 Low income, low life expectancy
9 Afghanistan Asia 1992 41.7 16317921 649. 4 Low income, low life expectancy
10 Afghanistan Asia 1997 41.8 22227415 635. 4 Low income, low life expectancy
... with 1,694 more rows
@cderv suggested the book R for Data Science in his post and I would like to emphasize his suggestion.