Hi everyone,
I have a data frame with NA value and I need to remove it.
I tried all function like "na.omit" or "is.na" or "complete.cases" or "drop_na" in tidyr.
All of these function work but the problem that they remove all data.
For example:
> DF <- data.frame(x = c(1, 2, 3, 7, 10), y = c(0, 10, 5,5,12), z=c(NA, 33, 22,27,35))
> DF %>% drop_na(y)
x y z
1 1 0 NA
2 2 10 33
3 3 5 22
4 7 5 27
5 10 12 35
> DF %>% drop_na(z)
x y z
2 2 10 33
3 3 5 22
4 7 5 27
5 10 12 35
With these function, I'm removing all values in row 1.
What I want to do is to remove only NA values from column z without deleting/removing values for x and y. Maybe to have something like below or masking this values. Because later I need to do a PCA and I can't remove such an important data in x and y.
x y z
1 1 0
2 2 10 33
3 3 5 22
4 7 5 27
5 10 12 35
Hope I was clear enough by explaining my problem
Thanks in advance
I already google it a lot, but all solution are like removing column/row or replacing it with 0 or with mean.
Your code works but for me zero is a value that why I was hoping if there is a solution to extract the NA values and not replacing with 0 or any values.
Thanks a lot for your respond
I 2nd Anirban's comment, NA stands for Not Available and is the way to represent a blank in R, you can't have columns of different length on a dataframe or a matrix.
Because I need to do an 3D PCA. Don't know why but I have problem reading my NA values.
For example, if I need to do spearman correlation with table containing NA values there is no problem, everything is working. But when I start doing PCA, I have an error as I have NA values. So that why I asked if there is a possibility to remove it or any solution.
Apparently, there is a library called (missMDA), which can handle "PCA with NA" but never used it!
PCA can take the correlation matrix as an argument. So, if you already have that, say R based on Spearman's correlation, you can try with princomp(covmat = R).
I can't add anything specific since I don't use PCA, but as a general R piece of advice, I'd encourage you to reframe your question @Amonda - it isn't that you need to get rid of NA values necessary, but rather understand how PCA handles missing data and go from there. It seems like you're treating NA values as a nuisance or bug, when they're very much a feature.
It sounds like you've got at least two avenues here so far:
It would be great if you could give both options a try, and depending on what you find, create a reprex (reproducible example) similar to what you had in your initial post. The reprex package would be helpful now that you're adding in potentially other packages, though. You can get some details here on creating reprexs.
I'm not a statistician, so my understanding of PCA is very vague, but as far as I know, when you deal with missing values (NAs) in general, you have two basic options, delete the observation (i.e. the whole row) or impute the missing values, for the latter here is a nice article with several options for this task, but I can't advice on which is more suitable for PCA.
Here is why you cannot just remove a value from a variable without removing the whole observation where the value is:
PCA is based on linear algebra--it works only with matrices and vectors--i.e. numerical variables. This means you can't just remove a value from a variable while keeping the other variables as you are working with matrices.
Even if a function exists that can deal with missing variables for PCA, the function most likely will still remove the whole observation to decompose your matrix.
Because PCA works with matrices, it assumes that you are providing a filled rectangle with r rows and n columns.
Not knowing your data, I have no opinion on imputing
PCA deletes the entire row if there is even one missing value. The choice is to impute a value or delete the row. Imputing a single value is generally accepted in a large data set. Imputing multiple values makes people more uneasy. Try missingdata.org or there are hundreds of sites that you can find by searching key words: imputing, missing, data. If you still need to remove NA you could convert all data to text, replace NA with a blank or a period, and then convert back to numeric. This is the brute force approach and will work if the people are creative when entering data.
You're trouble is that 'NA' designates a missing value. So, anything you replace it with must either still indicate a missing value, or be a value. You have no real alternatives. So, you can either interpolate an estimated value (say column average) or use some more sophisticated apprroch to interpolating a missing value. R treats NA as a missing value.
Various routines in R deal with NAs in different ways, so your best approach is not to get fussy about the data if it is otherwise correct. Instead look at the commands you plan to use for your PCA. If you are employing prcomp(), look at the "na.action" section in help.