I am trying to create a discrete variable from two existing variables. The new discrete variable will contain only four values. The name of my relevant variables and conditions based on which I want to generate my new variable are -

New Variable Name: GDPLife
Existing variable: GDPpercapita and LifeExpectancy

GDPLife = 1 if GDPpercapita > 10000 and LifeExpectancy > 70
GDPLife = 2 if GDPpercapita > 10000 and LifeExpectancy <= 70
GDPLife = 3 if GDPpercapita < 10000 and LifeExpectancy > 70
GDPLife = 4 if GDPpercapita < 10000 and LifeExpectancy <= 70

I would also like to mention that there are some missing values in the variable GDPpercapita and its okay if the new variable shows NA in those cases.

In addition to that, I was wondering is it also possible to add labels to the values in the variable GDPLife? For example, I would like to add the following labels.

If GDPLife =1 then the label will be "High Income, High Life Expectancy"
If GDPLife =2 then the label will be "High Income, Low Life Expectancy"
If GDPLife =3 then the label will be "Low Income, High Life Expectancy"
If GDPLife =4 then the label will be "Low Income, Low Life Expectancy"

are you using the tidyverse already ?

You could find dplyr::case_when useful to recode variable based on condition

For the labelling you could use factors.

Hey @Naveed,

I would like to suggest to always provide your data to your potential helpers. It would be much easier for them to help you if they have your data :slight_smile: Having said that, it seems that you are using the gapminder data or a similar dataset. Since I don't have your data, I'm just going to roll with gapminder!

As @cderv suggested, the case_when() function in the dplyr package is your friend for what you are trying to do. Also, regarding your question about creating labels, he suggested the use of factors, which is the right way for creating labels. However, my personal advice would be to create an additional column with your labels.


# Load the dataset
my_data <- gapminder

my_data <-
  my_data %>%
    gdp_life = case_when(
      gdpPercap > 10000 & lifeExp > 70 ~ 1,
      gdpPercap > 10000 & lifeExp <= 70 ~ 2,
      gdpPercap < 10000 & lifeExp > 70 ~ 3,
      gdpPercap < 10000 & lifeExp <= 70 ~ 4
    label = case_when(
      gdp_life == 1 ~ "High income, high life expectancy",
      gdp_life == 2 ~ "High income, low life expectancy",
      gdp_life == 3 ~ "Low income, high life expectancy",
      gdp_life == 4 ~ "Low income, low life expectancy"


A tibble: 1,704 x 8
   country     continent  year lifeExp      pop gdpPercap gdp_life label                          
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>    <dbl> <chr>                          
 1 Afghanistan Asia       1952    28.8  8425333      779.        4 Low income, low life expectancy
 2 Afghanistan Asia       1957    30.3  9240934      821.        4 Low income, low life expectancy
 3 Afghanistan Asia       1962    32.0 10267083      853.        4 Low income, low life expectancy
 4 Afghanistan Asia       1967    34.0 11537966      836.        4 Low income, low life expectancy
 5 Afghanistan Asia       1972    36.1 13079460      740.        4 Low income, low life expectancy
 6 Afghanistan Asia       1977    38.4 14880372      786.        4 Low income, low life expectancy
 7 Afghanistan Asia       1982    39.9 12881816      978.        4 Low income, low life expectancy
 8 Afghanistan Asia       1987    40.8 13867957      852.        4 Low income, low life expectancy
 9 Afghanistan Asia       1992    41.7 16317921      649.        4 Low income, low life expectancy
10 Afghanistan Asia       1997    41.8 22227415      635.        4 Low income, low life expectancy
... with 1,694 more rows

@cderv suggested the book R for Data Science in his post and I would like to emphasize his suggestion.


