Structuring data for prediction when multiple rows belong to the same group

Hi all,

I'm not sure if this is the right place for this question, apologies if I'm out of order.

I am trying to predict horse racing data where I have a set of features for each horse in a race, and also a results data set where I know which horse wins.

I am a little confused about how best to structure the data in a manner that the probabilities of predictions for a particular race end up summing to 1.

In all other machine learning tasks I have done prior to this, each prediction is independent of others, but in this case a bunch of rows all belong to the same 'group'.

Any ideas will be highly appreciated. Thanks very much.

I thought a bit about it, and I don't think it is a given that all probabilities in the race sum to 1.

You can use your predictors and just do binary classification to try and predict whether given horse is going to win (1) or not (0). Then you'll have probabilities for that result and if you want then you can group by race and calculate softmax probabilities so that they sum to 1. It's a simplification, of course, but it seems to me that it can be a good approximation of your domain.

In real life, whether horse wins or not of course depends on other horses, so you should find a way to encode information about other horses in the race. For example, similar thing is happening in football, hockey and any other sport. Probability of a given team to win on a given day depends on opposing team, so it seems very similar to your problem.

Does that make sense?


Maybe, instead of predicting whether a horse wins, you could predict the time it takes for a horse to complete the race. Then the winner would be the horse with the shortest time.


I think @nwerth has the correct idea. And then if you really want to get probabilities you can decide on an appropriate distribution for the times of each horse and do some bootstrap sampling to estimate the proportion of races each horse would win.

1 Like

Thanks very much for your response. Softmax approach sounds like a very good and logical way to do it. Completely agree with your point regarding encoding information about the field for every horse as well, and I'm already doing that. Cheers.

Thanks @nwerth and @dstander for your responses. Completely agree with your approaches, unfortunately I don't have finish time data to model against which is why I'm having to think about modelling win/not.