New to modeling in R - Which model to choose?

lonelybutter · August 5, 2023, 7:02am

The basis of my dataset is likelihood to default at end of the month. I have been messing around with some glm() modeling to determine the probability of yes/no outcome based on an initial input of variables, but I do not know if this translates into my actual scenario where the variables change everyday leading up to the end of the month.

The customer may have a very unlikely chance to default based on initial variables but as the month goes on, this chance could change to very likely. However, from my existing glm() testing, I am always using the initial variables from that first day (call it the first day of the month). Is there a way with glm() to have it factor in how a customer's values change day over day leading up to the end of the month so I get a true probability based on all days so far, or do I need to expand to a different model type?

HarveyBanks · August 8, 2023, 2:30pm

Hey,
Time Series models are designed to handle data that changes over time. You can model the likelihood of default as a time series and use techniques like AutoRegressive Integrated Moving Average (ARIMA) or more advanced models like Seasonal Decomposition of Time Series (STL) or Exponential Smoothing.

technocrat · August 9, 2023, 8:44am

Reframe the question in terms of the outcome variable, default at the end of month, conventionally symbolized Y given the treatment (independent) variables, conventionally symbolized X_1, X_2 \dots X_k

P[Y=1|X_1, X_2 \dots X_k]

Now, make the simplifying assumption that X_1 \dots X_k are assessed only on the day before month's end. On some basis you are able to estimate the odds ratio for Y given specified values of X_1 \dots X_k based on past observational studies of customer populations. You have a proportion of customers with identical values of X_i \dots X_k with elevated odds ratios for default.

These customers divide into four groups

Those whose current value of values of X_1 \dots X_k were steady throughout the month
Those who began at a substantially higher risk of default but steadily improved to arrive at the current values of X_1\dots X_k
Those who began at a substantially lower risk of default but steadily deteriorated to arrive at the current values of values of X_1\dots X_k
The remainder whose variation to arrive at the current values of values of X_1 \dots X_k was highly erratic and some fell from their highs and some jumped from their lows on the day before.

In other words, does history matter? And if so, how? Should the four groups receive differing treatment?

That would be the purpose of looking at the dailies to develop

P[Y=1|X_{1_{t_1 \dots t_k}}, X_{2_{t_1 \dots t_k}},\dots X_{k_{t_1 \dots t_k}}]

using time series modeling.

system · August 30, 2023, 8:45am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.