Imputation to normalize variable got error of missing values

technocrat · February 2, 2020, 2:12am

Hi. Couple of preliminaries; screenshots are seldom very helpful, while a FAQ: What's a reproducible example (`reprex`) and how do I do one? with data (actual or representative) attracts more answers. Think of it as the human equivalent of R's lazy evaluation.

Also, there's a FAQ: Homework Policy

Let's simulate your problem by reducing it to the basics.

You already have a variable, df$avg_income that's been scrubbed of NAs. I'm going to create a proxy from some made-up data and illustrate a tidy solution.

require(charlatan)
#> Loading required package: charlatan
require(dplyr)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
require(tibble)
#> Loading required package: tibble
phony <- enframe(ch_integer(n = 50000, min = 4940, max = 2000001)) 
phony %>% select(-name) %>% rename(income = value) -> phony
mid <- median(phony$income)
phony %>% mutate(category = ifelse(income < mid, "low","high"))
#> # A tibble: 50,000 x 2
#>     income category
#>      <dbl> <chr>   
#>  1 1724785 high    
#>  2 1094812 high    
#>  3 1322170 high    
#>  4 1308088 high    
#>  5 1066873 high    
#>  6 1386093 high    
#>  7 1616536 high    
#>  8  180569 low     
#>  9  123062 low     
#> 10 1004668 high    
#> # … with 49,990 more rows

^{Created on 2020-02-01 by the reprex package (v0.3.0)}

ch_integer() doesn't respect set.seed(), so it will return a different result each time. And, of course, you don't need it with your real data.