Imputation to normalize variable got error of missing values

Sophialai · February 2, 2020, 1:26am

Hi R-lovers,

I am doing a school project and trying to clean up the data by normalizing and grouping the large-value-range variable to two-three classes. The original missing values of 'avg_income' already got imputed with its mean as it shown in the image above without any missing values.
But when I run this function trying to group 'avg_income' into 'High' and 'Low' categories, it generated a lot of missing values.
Could anyone tell me the logic of why this happened?
Below are my codes following an image of summary of 'avg_income' after I run the function:

combine.avg_income <- function(x){
if (is.na(x)){return(NA)}
else if(x>47315){return('High')}
else (return('Low'))
}

df$avg_income <- sapply(df$avg_income, combine.avg_income)
summary(avg_income)

5169f1504d4aae21cd22b68faf57527

Thanks in advance,
Sophia

technocrat · February 2, 2020, 2:12am

Hi. Couple of preliminaries; screenshots are seldom very helpful, while a FAQ: What's a reproducible example (`reprex`) and how do I do one? with data (actual or representative) attracts more answers. Think of it as the human equivalent of R's lazy evaluation.

Also, there's a FAQ: Homework Policy

Let's simulate your problem by reducing it to the basics.

You already have a variable, df$avg_income that's been scrubbed of NAs. I'm going to create a proxy from some made-up data and illustrate a tidy solution.

require(charlatan)
#> Loading required package: charlatan
require(dplyr)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
require(tibble)
#> Loading required package: tibble
phony <- enframe(ch_integer(n = 50000, min = 4940, max = 2000001)) 
phony %>% select(-name) %>% rename(income = value) -> phony
mid <- median(phony$income)
phony %>% mutate(category = ifelse(income < mid, "low","high"))
#> # A tibble: 50,000 x 2
#>     income category
#>      <dbl> <chr>   
#>  1 1724785 high    
#>  2 1094812 high    
#>  3 1322170 high    
#>  4 1308088 high    
#>  5 1066873 high    
#>  6 1386093 high    
#>  7 1616536 high    
#>  8  180569 low     
#>  9  123062 low     
#> 10 1004668 high    
#> # … with 49,990 more rows

^{Created on 2020-02-01 by the reprex package (v0.3.0)}

ch_integer() doesn't respect set.seed(), so it will return a different result each time. And, of course, you don't need it with your real data.

Sophialai · February 2, 2020, 2:33am

Hi technocrat,

Thanks a lot the quick response.
I will test out your code and see if it works for my case or not.
And thanks for the great tips of posting questions. I will definitely follow that if I have any questions next time.

Good night,
Sophia

system · February 23, 2020, 2:33am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.