Goodmorning, dear all.
I'm writing you, because i've just started to work with R for biomedical application and I'm analyzing a large dataset with a lot of outliers and missing data. For this reason I visualized all outliers with a boxplot and I identified their position, through the following functions:
boxplot(DATAFRAME[,93:166])
outvals<-boxplot(DATAFRAME[93:166],plot=FALSE)$out #number of outliers>1000
At this point I would have two questions for you:
- Excluding the methods above-mentioned, what is 'the state of the art' outliers detection on big dataset? Do I have to analyze variable by variable?
- How can I remove all of these ouliers or replace their values with NA? Because i don't feel comfortable removing such a large number of outliers and at the same time I can only remove them one by one, but i would still like to have an idea about a faster procedure.
I'm sorry for the banal questions, but I would really like to improve my knowledge in data science and I've just started!
Thanks in advance, have a nice day.