Aim: To predict companies Credit Rating

Training Data: Internal data with financial numbers and financial ratios spanning across 3 years

Training Data Target Variable: Credit Rating with 20 discrete values

Training Data Remarks: Existence of missing data

Scoring Data: External data from various data sources with financial numbers and financial ratios spanning across 3 years

Scoring Data Remarks: Existence of missing data is higher than Training Data depending on data source

Which method should i use to predict credit rating? Logistics Regression comes to my mind first. However, there are missing values in both training and scoring data. A lot of imputation needs to be done and the model may not be accurate. I can accept predicting Credit Rating into 3 groups:

Group 1: A to F

Group 2: G to L

Group 3: M to V

rather than predicting the 20 discrete value from A to V.

I think the accuracy of a 20 discrete value model will be challenging.

Can anyone advise me?

Your model choice is OK. As for missing values, you can use interpolation to find a possible fill, or simple exclude these observation

Tree-based models tend to work relatively well with classification. Number of classes is not as important as number of examples per class. If you have good number of examples for each of the class then predicting 20 classes should be fine.

If I remember correctly, `xgboost`

can work with missing data out of the box, so you can try working with it and then compare results with data imputation.

Also, keep in mind that you are working with time data, so splitting in train/test set should be done based on time (e.g., last 3 months of data are test, rest is train). Otherwise there is a high chance to leak data.

Hi Peter_Griffin,

Can you advise which model like GBM, xgboost etc should I use? The model used should be able to handle missing values decently? I will refrain from interpolation or excluding.

Hi mishabalyasin,

Thanks for the quick reply.

I will leave missing data as they are and will not do any data imputation. I will give xgboost a try. Can you give more details on why you recommend xgboost?

As for time data which you mentioned, my data are all yearly financial numbers of different years. For example, company A with year 2013-2017 total revenue and other financial numbers. I think there should be no issue since the scoring data set will be another set of companies with the same set of yearly financial numbers.

`xgboost`

along with deep learning probably cover around 90% of winning solutions on Kaggle.

It tends to give relatively good performance "out of the box" and there are many handles you can tune for your problem specifically.