Replacement value function for handling text data

ronnie34 · February 5, 2018, 1:25pm

Hello,
I have the following code for a function that can replace a given dataset (df) with the function (mvf)
foo <- function(df, mvf) {
# Make data frame to clean
df_cln <- df

# Find the values to replace
replacements <- sapply(df, mvf, na.rm = TRUE)

# Loop over the columns, and replace the missing values
for (col in seq_len(ncol(df_cln)) ) {
    # Get the replacement
    replacement <- replacements[col]
    
    # Get the positions of the missing values in the column
    missing_vals <- is.na(df_cln[, col])
    
    # Replace the missing values
    df_cln[missing_vals, col] <- replacement
}
# Return the cleaned data
df_cln

}

The problem I am having is finding a function (mvf) that will replace the missing values with the mean of the column that the missing value is found in, so that it can work with my above foo() function.
Is there any help that could be suggested for this please?

cderv · February 5, 2018, 2:04pm

Without your df and mvf, I am not quite sure I understood the question. Can you provide them ?

As a solution or maybe just hints if I misunderstood, I provide a dummy example you can reproduce on how to fill NA in a table by a value calculated on each column.

find replacement based on a function applied to each column
use feed this named list to replace_na to literally replace each NA of each column by the corresponding replacement value

library(tidyverse)

dummy <- tibble::tribble(
  ~ V1, ~ V2, ~ V3,
  1, NA, 3,
  NA, 2, 17,
  5, 8, NA,
)

mean_replace <- purrr::map(dummy, mean, na.rm = TRUE)
# a named list of replacement
str(mean_replace)
#> List of 3
#>  $ V1: num 3
#>  $ V2: num 5
#>  $ V3: num 10

dummy %>%
  tidyr::replace_na(mean_replace)
#> # A tibble: 3 x 3
#>      V1    V2    V3
#>   <dbl> <dbl> <dbl>
#> 1  1.00  5.00  3.00
#> 2  3.00  2.00 17.0 
#> 3  5.00  8.00 10.0

Created on 2018-02-05 by the reprex package (v0.1.1.9000).

ronnie34 · February 5, 2018, 2:17pm

So a dummy dataset would be

      X1     X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   <int> <int>        <dbl>       <dbl>        <dbl>       <dbl>   <chr>
 1    20    20          5.1         3.8                   1.5            0.3             setosa
 2    21    21          5.4         3.4                   1.7            0.2             setosa
 3    22    22          5.1         3.7                   1.5            0.4             setosa
 4    23    23          4.6         NA                   1.0            0.2             setosa
 5    24    24          5.1         3.3                   1.7            0.5             setosa
 6    25    25          4.8         3.4                   1.9            0.2             setosa
 7    26    26          5.0         3.0                   1.6            0.2             setosa
 8    27    27          5.0         3.4                   1.6            0.4             setosa

and the mvf is the function I am having trouble in finding.
The output should be from
foo(irismissing,meanreplacement)

irismissing[20:100,]

# A tibble: 81 x 7
      X1     X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   <int> <int>        <dbl>       <dbl>        <dbl>       <dbl>   <chr>
 1    20    20          5.1         3.8                   1.5            0.3             setosa
 2    21    21          5.4         3.4                   1.7            0.2             setosa
 3    22    22          5.1         3.7                   1.5            0.4             setosa
 4    23    23          4.6         xbar                 1.0            0.2             setosa
 5    24    24          5.1         3.3                   1.7            0.5             setosa
 6    25    25          4.8         3.4                   1.9            0.2             setosa
 7    26    26          5.0         3.0                   1.6            0.2             setosa
 8    27    27          5.0         3.4                   1.6            0.4             setosa

where xbar is the mean of the Sepal.Width column excluding the NA term

ronnie34 · February 5, 2018, 2:24pm

An example of the missing value function I have at the moment is

replacement1<-function(df){
  df[,which(colSums(is.na(df))>0)][is.na(df[,which(colSums(is.na(df))>0)])]=mean(df[,which(colSums(is.na(df))>0)],na.rm=T)
  return(df)
}

ronnie34 · February 5, 2018, 2:27pm

I tried generating another foo() function which is the following:

foo<-function(data,fun){
  return(fun(data))
}
replacement1<-function(df){
  df[,which(colSums(is.na(df))>0)][is.na(df[,which(colSums(is.na(df))>0)])]=mean(df[,which(colSums(is.na(df))>0)],na.rm=T)
  return(df)
}
T1<-foo(irismissing,replacement1)

However when running this in my console, the dataset still had the missing values in and I received an error stating

Warning message:
In mean.default(df[, which(colSums(is.na(df)) > 0)], na.rm = T) :
  argument is not numeric or logical: returning NA

cderv · February 5, 2018, 2:28pm

Can you edit your questions and reformat so that it is more readable to us ? Separing text and code ?

use `some code` to produce some code or use ``` above and under your code like this

```
some code
```

to produce

some code

thank you !

And did you try to apply in some way replace_na ? This function allow to replace NA value in columns.

ronnie34 · February 5, 2018, 2:39pm

I've made the edits you said however they're not appearing in the form I edited them in for some reason
and I tried applying that but didn't have any luck?

cderv · February 5, 2018, 2:44pm

You have to put the triple backtick (```) above and under.
When you are writing a question in discourse, you have a preview. Check the preview to see the changes before validating the editing.

Morevover, you have a button to do that to.

Select your paragraph of code and click on that button.

The idea is to provide something close to a reprex to help us help you. It is what I did above. It will allow me to copy paste your code more easily.

Have you tried to play with replace_na or not ? Does it do want you want ? If not, what is missing ?

cderv · February 5, 2018, 2:49pm

I just saw that your question here is a duplicate of this other one that you post
https://forum.posit.co/t/missing-value-function

It is not necessary to ask several time the same question. I just cross reference to make the two question in relation.

ronnie34 · February 5, 2018, 2:50pm

I had posted the question again since it was't clear in the other post, sorry my mistake

cderv · February 5, 2018, 2:52pm

If I understand correctly you can create your missing value function with replace_na. I show you an example.
what don't you understand ?

ronnie34 · February 5, 2018, 2:53pm

The problem is that i'm trying not to use any libraries such as the tidyverse library to answer the question

cderv · February 5, 2018, 3:18pm

It is the kind of critical information you have precise at the beginning. We can't guess...

Why do no want to use some libraries that exists to make your life easier ? Sometimes you don't want to have no dependencies but when doing some analysis, it does not worth it. Moreover, libraries like dplyr are optimized in performance and in stability to help you.

With this in mind, about your code now. Some advices :
When you are trying to debug something try doing it step by step.

is df[,which(colSums(is.na(df))>0)] working ?
is df[,which(colSums(is.na(df))>0)][is.na(df[,which(colSums(is.na(df))>0)])] ?
...

You will encounter error code that will help you understand.

Here, you are trying to do all NA replacement in all column in one step. One easiest thing is to do it by column:

  apply(df, 2, function(vec) {
    vec[is.na(vec)] <- mean(vec, na.rm = T)
  })

This snippet will apply a function to each column (2). The function applies on a column, search for NA and replace by the mean of the column.

In your code there is some issue with the dimension, and the way you are trying to replace. I won't go into detail but you have to take care of what is your right hand side (RHS) that you want to assign to your left hand sign (LHS). When I try you code the LHS throws an error.

Hope it is clearer to you now.