Categorical variables - not enough representation

staniach21 · July 31, 2023, 10:30pm

Hi,
I have a question about the procedure with the low representation of several levels of categorical variable.
I've build logit model which try to anwser question Does an internship during the period of education affect the probability of employment until 3/6/12 months after the end of education.

My independent variable is build from question : Did you have any form of internship during your education?
And there are 6 possibilities:

no
optional internship
compulsory intership
volunteer work
5 yes, during practical vocational training as part of classes
work in line with the field of education
7.work incompatible with the field of education

As you can see on screenshot there is a good representation of option nr. 1,3,5 and the other are quite low.
decoded these categorical variables into binary variables and built such a model
logit1_3_mies = glm(data$stan_do_3_mies ~ praktyk_zajecia + staz_obow + wolontariat + staz_n_obow + prac_zgod + prac_niezgod, data = data,family = 'binomial')

So the base level is anwser "no" but there are 3 binary variable with representation <100. I think it affects model a lot but I am not sure if I should just delete it. What would be the base level then? All missing categories? Cause I want to check diffrence betweeen intership and no intership. Would be nice to keep optional intership but the presention is so low

staniach21 · August 1, 2023, 10:11pm

Is there anyone that can help me with stastistical knowladge?

startz · August 1, 2023, 10:40pm

Just go ahead and keep all the categories and see what happens. Then do a test to see if optional and compulsory are both different from zero.

staniach21 · August 1, 2023, 11:31pm

I did 12 models, for 3months period, 6months period and 12months and divided them into whole data vs only higher education and then all 6 models into with control variable and without control variable ( that was current state, sex, education level, parents education) and the results are:

For model with only this variabe there is statistical significance for all categories ( inlduing obligatory and compulsory)
For models with control variables only obligatory intership is statistcial diffrent from 0

But the question is if such a small levels don't biased my Logit.

I am focusing more on statistical side of this casue I am intrested only with estiamtors and marginal efects but prediction accuracy of my model is sth around 60% without control variables and 75% with control variables.

The questions are levels with low representation, I feel like deleting this part from dateset is losing some information. Maybe some ROSE library and duplication? But hmm feels weird to duplicate a group of 12-100 people into 3000 I've seen on YT where someone use ROSE library to make 1000 people into 50000 feels more legit.
BTW. That's my bachelor thesis so I want to do it right

startz · August 1, 2023, 11:48pm

Small levels don't bias the results. Getting unbiased results depends on the sample size, not the values of individual variables.

system · August 22, 2023, 11:49pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.