Replacing multipe values in a column r

ShelbyLab · September 20, 2020, 6:48am

I am trying to create a function that takes in two variables, the continent and the column that would like to be worked with from a dataframe. I am then trying to calculate the mean value of the column for that particular continent to replace the NAs that are in that column for that continent. However, I seem to be having trouble when it comes to the actual replacement of the values, I keep running into errors. I have tried multiple ways such as replace, replace_na and mutate but I keep getting errors that I cannot seem to get away from. This code works when it is not in a function, but the minute I add it to the function I seem to get this error.

df<-structure(list(location = c("Algeria", "Angola", "Benin", "Botswana", 
"Burkina Faso", "Burundi"), iso_code = c("DZA", "AGO", "BEN", 
"BWA", "BFA", "BDI"), continent = c("Africa", "Africa", "Africa", 
"Africa", "Africa", "Africa"), date = c("2020-09-02", "2020-09-02", 
"2020-09-02", "2020-09-02", "2020-09-02", "2020-09-02"), total_cases = c(44833, 
2654, 2145, 1733, 1375, 445), new_cases = c(339, 30, 0, 9, 5, 
0), new_cases_smoothed = c(372.143, 53, 4.286, 24.429, 3.286, 
2.143), total_deaths = c(1518, 108, 40, 6, 55, 1), new_deaths = c(8, 
1, 0, 0, 0, 0), new_deaths_smoothed = c(8.857, 0.857, 0.143, 
0.429, 0, 0), total_cases_per_million = c(1022.393, 80.751, 176.934, 
736.937, 65.779, 37.424), new_cases_per_million = c(7.731, 0.913, 
0, 3.827, 0.239, 0), new_cases_smoothed_per_million = c(8.487, 
1.613, 0.354, 10.388, 0.157, 0.18), total_deaths_per_million = c(34.617, 
3.286, 3.299, 2.551, 2.631, 0.084), new_deaths_per_million = c(0.182, 
0.03, 0, 0, 0, 0), new_deaths_smoothed_per_million = c(0.202, 
0.026, 0.012, 0.182, 0, 0), population = c(43851043, 32866268, 
12123198, 2351625, 20903278, 11890781), population_density = c(17.348, 
23.89, 99.11, 4.044, 70.151, 423.062), median_age = c(29.1, 16.8, 
18.8, 25.8, 17.6, 17.5), aged_65_older = c(6.211, 2.405, 3.244, 
3.941, 2.409, 2.562), aged_70_older = c(3.857, 1.362, 1.942, 
2.242, 1.358, 1.504), gdp_per_capita = c(13913.839, 5819.495, 
2064.236, 15807.374, 1703.102, 702.225), extreme_poverty = c(0.5, 
NA, 49.6, NA, 43.7, 71.7), cardiovasc_death_rate = c(278.364, 
276.045, 235.848, 237.372, 269.048, 293.068), diabetes_prevalence = c(6.73, 
3.94, 0.99, 4.81, 2.42, 6.05), female_smokers = c(0.7, NA, 0.6, 
5.7, 1.6, NA), male_smokers = c(30.4, NA, 12.3, 34.4, 23.9, NA
), handwashing_facilities = c(83.741, 26.664, 11.035, NA, 11.877, 
6.144), hospital_beds_per_thousand = c(1.9, NA, 0.5, 1.8, 0.4, 
0.8), life_expectancy = c(76.88, 61.15, 61.77, 69.59, 61.58, 
61.58)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))


fun1 <- function(cont, column)
{
  countries<-df%>%
    filter(continent == cont)
  
  m<-mean(countries[[column]],na.rm=T)

    df[,column]<-ifelse(is.na(df[,column]) & df$continent==cont,m,(df[,column]=df[,column]))
}

fun1("Europe","median_age")

Error:
Error during wrapup: Can't recycle input of size 208 to size 1.
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

smichal · September 20, 2020, 10:53am

Hi @ShelbyLab,

if I understand your code correctly you would like to replace the NAs by the continent's mean value for each column. Is that correct?

I'm adding a second continent just to check the output

library(tidyverse)
# to check later
df2 <- df %>% 
  union_all(df %>% 
              mutate(continent = "Europe",
                     location = stringi::stri_rand_strings(6, 10)) %>%
              # multiply all numbers by two to get different results for two different continents
              mutate_if(is.numeric, ~.x * 2))

Let's have a look at a column with NAs

df2 %>% 
  group_by(continent) %>% 
  select(extreme_poverty)

# A tibble: 12 x 2
# Groups:   continent [2]
   continent extreme_poverty
   <chr>               <dbl>
 1 Africa                0.5
 2 Africa               NA  
 3 Africa               49.6
 4 Africa               NA  
 5 Africa               43.7
 6 Africa               71.7
 7 Europe                1  
 8 Europe               NA  
 9 Europe               99.2
10 Europe               NA  
11 Europe               87.4
12 Europe              143.

I would recommend to split the tbl by continent

df2_nested <- df2 %>% 
  group_nest(continent)
df2_nested

  continent                data
  <chr>     <list<tbl_df[,28]>>
1 Africa               [6 x 28]
2 Europe               [6 x 28]

And then I would apply a proper function to each sub-tbl.

Let's try explicitly first, i.e. without a function

sub_df <- df2_nested$data[[1]]  
sub_df %>% 
  # replace the NAs in one vector column by the mean of the column (excl. NAs)
  mutate(extreme_poverty = replace_na(extreme_poverty, mean(extreme_poverty, na.rm = TRUE))) %>%
  # select only a few columns to check
  select(location, extreme_poverty)

# A tibble: 6 x 2
  location     extreme_poverty
  <chr>                  <dbl>
1 Algeria                  0.5
2 Angola                  41.4 (was NA)
3 Benin                   49.6
4 Botswana                41.4 (was NA)
5 Burkina Faso            43.7
6 Burundi                 71.7

Now, let's try more colums using mutate_at

sub_df %>% 
  # apply a function to the columns listed in 'vars()'
  mutate_at(vars(extreme_poverty, female_smokers),
            function(.x) {replace_na(.x, mean(.x, na.rm = TRUE))}) %>% 
  # select only a few columns to check
  select(location, extreme_poverty, female_smokers)

And this is the function to be applied to each continent sub-df

replace_na_in_columns <- function(continent_sub_df, vars_columns) {
  continent_sub_df %>% 
    mutate_at(vars_columns,
              # short form of the above
              ~replace_na(.x, mean(.x, na.rm = TRUE))) 
}

# check function
sub_df %>% 
  replace_na_in_columns(vars(extreme_poverty, female_smokers)) %>% 
  select(location, extreme_poverty, female_smokers)

# A tibble: 6 x 3
  location     extreme_poverty female_smokers
  <chr>                  <dbl>          <dbl>
1 Algeria                  0.5           0.7 
2 Angola                  41.4           2.15
3 Benin                   49.6           0.6 
4 Botswana                41.4           5.7 
5 Burkina Faso            43.7           1.6 
6 Burundi                 71.7           2.15

And, finally, this is how to apply the function to the nested tbl from the beginning

result_nested<- df2_nested %>% 
  # create a new column by mutate and applying our replace_na_... function
  # to each element (row) of the column 'data'
  mutate(CLEANED_CONTINENT_SUB_DATA = map(data, ~replace_na_in_columns(continent_sub_df = .x, 
                                                                       # apply to all column from new_cases to life_expectancy
                                                                       vars_columns = vars(new_cases:life_expectancy))))
result_nested

# A tibble: 2 x 3
  continent                data CLEANED_CONTINENT_SUB_DATA
  <chr>     <list<tbl_df[,28]>> <list>                    
1 Africa               [6 x 28] <tibble [6 x 28]>         
2 Europe               [6 x 28] <tibble [6 x 28]>

Unnest the new sub-df

result <- result_nested %>% 
  select(-data) %>% 
  unnest(CLEANED_CONTINENT_SUB_DATA)

system · October 11, 2020, 10:53am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.