Can I use imputed data as train data and then validate the model by real-world data?

lkhangkv1995 · February 9, 2022, 7:52am

Suppose I have a dataset with 6 features: x1, ..., x6.
There are 10% of the data with fully available of these 6 features (full_data).
The remaining 90% containing missing values (about 30% of all the values). The missing values are scattered across all features. Let's call it miss_data.

All of these features are informative, and they should be included in a model to predict patient's outcome. I intend to establish a model based on the 6 features to predict patient's outcome (say, cox proportional hazard model)

My strategy is that:

Train an autoencoder (AE) using miss_data to learn the distribution patterns of the 6 features and their relationships to each other.
Use the trained AE to impute its own training dataset (miss_data) --> imputed_data
I develop cox hazard model on the imputed_data: model = coxph(Surv(time, status) ~ x1 + x2 + x3 + x4 + x5 + x6, data = imputed_data)
Test the model on the full_data. Note that I haven't touch the full_data so far.

Can someone comment on this strategy? Is it fine to treat the data like this?

Thank you very much!

Max · February 9, 2022, 12:41pm

Are you looking to do inference or prediction?

lkhangkv1995 · February 9, 2022, 2:24pm

My ultimate goal is doing prediction, which is to test the Cox's model performance in predicting the full_data.

system · March 2, 2022, 2:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.