Can I use imputed data as train data and then validate the model by real-world data?

Suppose I have a dataset with 6 features: x1, ..., x6.
There are 10% of the data with fully available of these 6 features (full_data).
The remaining 90% containing missing values (about 30% of all the values). The missing values are scattered across all features. Let's call it miss_data.

All of these features are informative, and they should be included in a model to predict patient's outcome. I intend to establish a model based on the 6 features to predict patient's outcome (say, cox proportional hazard model)

My strategy is that:

  1. Train an autoencoder (AE) using miss_data to learn the distribution patterns of the 6 features and their relationships to each other.
  2. Use the trained AE to impute its own training dataset (miss_data) --> imputed_data
  3. I develop cox hazard model on the imputed_data: model = coxph(Surv(time, status) ~ x1 + x2 + x3 + x4 + x5 + x6, data = imputed_data)
  4. Test the model on the full_data. Note that I haven't touch the full_data so far.

Can someone comment on this strategy? Is it fine to treat the data like this?

Thank you very much!

Are you looking to do inference or prediction?

1 Like

My ultimate goal is doing prediction, which is to test the Cox's model performance in predicting the full_data.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.