Outliers treatment methods for big data in RStudio

ilaria.cotellessa · April 9, 2020, 10:38am

Goodmorning, dear all.
I'm writing you, because i've just started to work with R for biomedical application and I'm analyzing a large dataset with a lot of outliers and missing data. For this reason I visualized all outliers with a boxplot and I identified their position, through the following functions:

boxplot(DATAFRAME[,93:166])
outvals<-boxplot(DATAFRAME[93:166],plot=FALSE)$out #number of outliers>1000

At this point I would have two questions for you:

Excluding the methods above-mentioned, what is 'the state of the art' outliers detection on big dataset? Do I have to analyze variable by variable?
How can I remove all of these ouliers or replace their values with NA? Because i don't feel comfortable removing such a large number of outliers and at the same time I can only remove them one by one, but i would still like to have an idea about a faster procedure.

I'm sorry for the banal questions, but I would really like to improve my knowledge in data science and I've just started!
Thanks in advance, have a nice day.

fabianromero · April 9, 2020, 11:38pm

Hello @ilaria.cotellessa

I'm not sure I understood your question. I have used this code to remove outliers from the boxplot.

geom_boxplot(outlier.shape = NA)

Good look

system · April 30, 2020, 11:38pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.