Hello. I have a question about modelling best practices when it comes to using a predictor that only applies to a given subpopulation of my dataset.
To be more specific, I am trying to predict result_A of a patient based on result_B. Here I also use other predictors such as age, gender and so on that are also available for all rows (patients) in the table.
However, I also want to incorporate previous_result_A to take into account clinical history which is obviously super important. The thing is not all rows will have previous_result_A, because not all patients would have done the test before. I did a bit of snooping around here and am under the impression that all I need to do is to create another column with 0 for rows that don't have previous_result_A and 1 otherwise, and then fill in the blanks in previous_result_A with 0 or some constant.
As a beginner, I'd just like a bit of clarification that that is indeed the way to proceed. And also whether that approach is applicable to all models. Thanks.
Good question, will save a lot of headaches. Take the simplest case
Y \sim X + \epsilon
in which the response variable Y depends only on the treatment variable X plus some error term \epsilon.
X can be continuous, categorical, ordinal or binary.
There is a crucial difference between binary 1/0 and continuous 1/0; one is a logical value and the other is numeric. In the case of previous_result_A it's the difference between TRUE/FALSE (did or did not have a test) and 14/0 which means did have test & test result was 14 and did not have test & test result was 0.
Without knowing the overall study design it's hard top say more that X:boolean may or may not be useful. If it does discriminate among the study population in a useful way, then you subset and use the previous_result_A continuous value in a separate model.
Thank you for your response. All my predictors are continuous (except gender obviously), and they only have positive (non zero) results. previous_result_A will also be continuous and positive but will be blank when a patient hasn't done that test before.
I'm wondering if you could elaborate a bit on what you meant by:
"if it does discriminate among the study population" - what I understand you meaning here is that if clinical history exists is a Boolean value then all is good, previous_result_A is simply 1 or 0, however if the actual value matters,
"subject and use the continuous value in a separate model" - is that correct?
I'm new to modelling and unsure about how to even get started here with number 2 (or what it really means), any resources you could point me towards? Or helpful advice you could share?
Let previous_result_A be a continuous, X_i in the Y \sim X_i ... X_j + \epsilon model. As the model is examined in stepwise fashion, the analyst will either discard or keep X_i depending on its \alpha and explanatory power in terms of the test statistic.
Applying that model to the subset with TRUE previous_result_A and using the continuous values of X_i, a decision can be made whether the variable has explanatory power.