Hi,
I have a question about the procedure with the low representation of several levels of categorical variable.
I've build logit model which try to anwser question Does an internship during the period of education affect the probability of employment until 3/6/12 months after the end of education.
My independent variable is build from question : Did you have any form of internship during your education?
And there are 6 possibilities:
no
optional internship
compulsory intership
volunteer work
5 yes, during practical vocational training as part of classes
work in line with the field of education
7.work incompatible with the field of education
As you can see on screenshot there is a good representation of option nr. 1,3,5 and the other are quite low.
decoded these categorical variables into binary variables and built such a model
logit1_3_mies = glm(data$stan_do_3_mies ~ praktyk_zajecia + staz_obow + wolontariat + staz_n_obow + prac_zgod + prac_niezgod, data = data,family = 'binomial')
So the base level is anwser "no" but there are 3 binary variable with representation <100. I think it affects model a lot but I am not sure if I should just delete it. What would be the base level then? All missing categories? Cause I want to check diffrence betweeen intership and no intership. Would be nice to keep optional intership but the presention is so low
I did 12 models, for 3months period, 6months period and 12months and divided them into whole data vs only higher education and then all 6 models into with control variable and without control variable ( that was current state, sex, education level, parents education) and the results are:
For model with only this variabe there is statistical significance for all categories ( inlduing obligatory and compulsory)
For models with control variables only obligatory intership is statistcial diffrent from 0
But the question is if such a small levels don't biased my Logit.
I am focusing more on statistical side of this casue I am intrested only with estiamtors and marginal efects but prediction accuracy of my model is sth around 60% without control variables and 75% with control variables.
The questions are levels with low representation, I feel like deleting this part from dateset is losing some information. Maybe some ROSE library and duplication? But hmm feels weird to duplicate a group of 12-100 people into 3000 I've seen on YT where someone use ROSE library to make 1000 people into 50000 feels more legit.
BTW. That's my bachelor thesis so I want to do it right