Saturday, May 1, 2010

Data Modeling the Kentucky Derby

The topic of Kentucky Derby bubbled up in one of my conversations with a co-worker and we talked  about the odds as to which horse will win. I have been taking up a Data Mining class and we are going through a lot of data modeling techniques and algorithms. A lot of these can be easily done using a wonderful open-source Java application called WEKA which is "a collection of machine learning algorithms for data mining tasks".

It dawned on me that it may be a good exercise to model the data from the Kentucky Derby and try to predict the winner based on selected variables. Rick Janava has done some neural net analysis in the past to determine winners in six-furlong [0.75 miles] claiming races [Kentucky Derby is 1.25 miles]. Using his approach ".. so far in the first 300 races, 39% of the winners have been predicted at odds which average better than 4.5 to 1".

Janava's work lends a good starting point for identifying the more significant variables that need to be used in the model. The data modeling for the Kentucky Derby is a good exercise in determining the most successful WEKA learning algorithm/s for this sort of data set. Hopefully by the next Kentucky Derby I can have some work to show for this problem.

No comments: