Suppose I have a dataset with 6 features: x1, ..., x6.
There are 10% of the data with fully available of these 6 features (full_data).
The remaining 90% containing missing values (about 30% of all the values). The missing values are scattered across all features. Let's call it miss_data.
All of these features are informative, and they should be included in a model to predict patient's outcome. I intend to establish a model based on the 6 features to predict patient's outcome (say, cox proportional hazard model)
My strategy is that:
- Train an autoencoder (AE) using miss_data to learn the distribution patterns of the 6 features and their relationships to each other.
- Use the trained AE to impute its own training dataset (miss_data) --> imputed_data
- I develop cox hazard model on the imputed_data: model = coxph(Surv(time, status) ~ x1 + x2 + x3 + x4 + x5 + x6, data = imputed_data)
- Test the model on the full_data. Note that I haven't touch the full_data so far.
Can someone comment on this strategy? Is it fine to treat the data like this?
Thank you very much!