If we're being honest, most of our March Madness brackets imploded in some way over the last few weeks and now hold on by a thread with the Final Four in a few days.
However, Will Geoghegan's men's bracket is still intact and in the top 0.2% of more than 14 million brackets on ESPN after training a machine learning model to fill out his bracket for the Big Dance.
"I think it's cool that something like this can work well. Because we look at March Madness and we see all the craziness and all the upsets that no one saw coming, like Oral Roberts and UCLA," Geoghegan says. "But, at the end of the day, it's two one seeds and a two seed in the Final Four. And so these analytics can still be successful even in such a kind of volatile format as March Madness."
This isn't the first time the former professional runner, who now works in the computer science industry, has done something like this but it's possibly the most success he's had with a sports machine learning model. Close to seven years ago, Geoghegan created a model to draft his fantasy football team.
It worked until Adrian Peterson, who the model selected first, was suspended for the season.
"I've always liked kind of applying this stuff to things like sports because anything with a lot of data that's available, you can usually make a good model," Geoghegan said. "Sports and data definitely go hand in hand in this."
A few years later, he trained a machine learning model to fill out a March Madness bracket; however, it wasn't as successful as this year's because of overfitting. The model was too specific and complicated, so it learned the data he gave really well versus extrapolating into the future.
"No matter how you know how perfectly tuned your model is, these are still games that are being played and there's a huge element of randomness," Geoghegan. "Not randomness from the player's perspective necessarily but from the model's perspective. Sometimes the worst team will win, and that's just how it goes. The biggest takeaway was just making kind of a good, general model that didn't try is too hard to get everything right but just has a good kind of high-level map of where things stand."
Taking what he learned from previous codes and models, Geoghegan used AdaBoost, which he said is essentially "an algorithm for combining a collection of relatively weak predictors into a single strong predictor." He pulled data from the Massey Ratings instead of using player or game-level data.
Essentially, the model aggregated the opinions of experts who create the college basketball rankings. It used the seeds and the various ranking systems as weak predictors with training data going back to 2003.
"It's able to kind of find the relationships between them in a way to combine all of them into one kind of rating system," Geoghegan said. "If you get really into the math, you can prove that it's guaranteed to do better than the best single rating system."
Within three hours, his model and bracket were set, and when he compared it to his bracket he did by hand, the picks were logical and not too wildly outrageous. Geoghegan said none of the picks really made him scratch his head too much.
And it worked. The model correctly predicted Rutgers over Clemson, USC over Kansas, Arkansas in the Elite Eight and Houston in the Final Four. The biggest miss, like most brackets, was UCLA's overtime upset of Alabama.
The model also didn't predict Cinderella-esque teams like UCLA or Oral Roberts. The data stops with the end of the conference championships, so if a team, like those or Oregon State, suddenly gets hot in the tournament, the model most likely won't predict that.
The model originally predicted that Baylor would beat Gonzaga, 69–57. However, it now thinks the Zags will be crowned the national champions and become the first undefeated men's college basketball team since Indiana during the 1975–76 season.
In the future, Geoghegan is planning to use more data with a similar approach since this system only looked at how teams were rated going into the tournament versus how ratings changed throughout the season.
"I've always been into programming. There's a creative aspect to it, where you're starting with a blank file, and you're creating something," Geoghegan said. "And I think it's really cool on the data side to be able to take megabytes worth of ones and zeros and turn it into useful predictions about the future and about the world.
"Obviously, March Madness isn't as high impact as a lot of other applications of this stuff. But it's turning data into useful insights about the world we live in."