Survival Analysis: A Bracket Strategy
By John Ezekowitz, Harvard Sports Analysis Collective, and Luke Winn, SI.com
The image above is a network of all 68 NCAA tournament teams: where their regular-season schedules intersected, and who won (arrows out) or lost (arrows in) each game. It's a sight to behold -- but it also has the power to help predict what'll happen over the next three weeks. Read on, and we'll explain.
One thing we should be able to agree on, in a week where we all have different-looking brackets: The NCAA tournament is not like the regular season. Games are played on neutral courts for higher stakes. The goal is no longer to amass a "body of work" for the selection committee, but rather to survive and advance. It follows that the best forecasting models would be those that treat the NCAA tournament as a unique setting, rather than collection of regular-season games, and try to assess each team's likelihood of survival.
While studying a statistical model typically used in clinical drug trials, for a public health class at Harvard, John had an idea: What if he applied a similar "survival analysis" to teams in the NCAA tournament, to assess their risk of falling out of the bracket at each stage? The model he created, when retroactively applied to the past five NCAA tournaments, was able to outperform projections by kenpom.com and teamrankings.com, with an average of 44 correct picks per bracket and three out of five champions correctly identified (see chart below).
|NCAA tournament projections, 2007-2011|
The Survival Model doesn't ignore efficiency: it uses kenpom.com's adjusted offensive and defensive efficiencies, plus the site's strength of schedule ratings as the "control" variables for its initial ordering of teams. The Survival Model then makes adjustments based on data that it found to have a correlation to NCAA tournament success. The four significant factors were:
• Consistency: how little a team's efficiency margin varied from game-to-game.
• Experience: a team's returning minute percentage multiplied by the number of NCAA tournament games in which it appeared last season.
• Out-Degree Network Centrality: This is where the spiderweb at the top the post comes into play. The number of games a team played against NCAA tournament teams (network centrality) and the number of games it won against NCAA tournament teams (its out degree, or arrows running away from its network node) was significant. Different values were assigned to home, road and neutral wins within the network.
• The negative interaction of the Experience and Out-Degree Centrality variables. They get multiplied together to account for declining returns, so the model doesn't overestimate a team with a ton of experience and NCAA tournament games.
Take another look at the network image at the top of the post, in which teams' nodes are sized according to their seed. You'll see one "isolate," or team that didn't face a single opponent from the NCAA tournament field all season. That's No. 14-seeded South Dakota State. In the five-year testing sample, isolates failed to win a single NCAA tournament game, although no isolate has ever been seeded as highly as the Jackrabbits. The consistent teams that are major hubs of connectivity -- like Kentucky and Ohio State -- have the lowest risk of early failure.
Using all of this data -- the efficiency control variables, along with the consistency and the interactive experience/centrality numbers -- John used the Cox Proportional Hazards model to rank the tournament teams 1-68 based on their relative risk of failure. Using these rankings, we set about filling out the 2012 bracket. All picks but three in the opening round went in favor of the lower-risk team; to accommodate for upsets, we knocked out the highest-risk No. 4 seed (Michigan), No. 5 seed (Temple, as long as it faces Cal), and No. 6 (San Diego State). This is the bracket the Survival Method produced:
The bracket is a bit chalky, in part because the selection committee paired the model's favorite No. 12s, Virginia Commonwealth and Long Beach State, with its favorite No. 5s, Wichita State and New Mexico. The Survival Method's biggest upset pick in the third round is Florida over Missouri, but it had a had a low degree of certainty on the Tigers, who had the highest rate of "variance" of any No. 1 or 2 seed.
That means Missouri's projected rate of survival was all over the map -- anywhere from flaming out in the third round or making a run to the title game. The model is not saying that the Tigers are doomed, just that it doesn't know what to make of them. Head over to Harvard Sports Analysis Collective for John's full explanation of the Survival Method, and his 1-68 ranking of this year's tourney teams.