Fiu, Cirminiello, Mitchell on TV - Campus Insiders | Buy College Football Tickets

BCS Analysis, Part One

Mr Pac Ten
Posted Feb 27, 2010


Collegefootballnews’ Matthew Smith Looks at the BCS, what’s wrong with it and how to fix it, part one: Computer Rankings

It seems like almost every year there is a controversy involving the BCS rankings, either in terms of who gets to go to the national title game, who gets the at-large berths, or sometimes both. Over the course of a few articles, I’ll be looking at the problems with the BCS system and suggesting some practical, reasonable ways to make it better. This isn’t about overhauling the system and turning it into a plus-one or playoff (whether that’s a better way to go is another debate entirely); this is about making the current system better.

The first section will be about the computer ranking systems, what’s wrong with them and what needs to be done to fix them. Before you get started reading this article, I strongly recommend that you take a look at the article well-known sports statistician Bill James wrote on Slate a little over a year ago, link here. You may agree with his conclusions, or you may disagree with them, but he does a great job of laying out many of the issues that the system faces from the perspective of someone who is a heck of a lot more experienced at dealing with sports statistics than I am.

The way I see it, there are four fundamental issues with the current computer rankings arrangement, with what I consider to be the most important at the top:

1) A major lack of clarity and transparency about how each individual model actually works
2) Little to no audit trail
3) No apparent evaluation process of the models
4) A major lack of model diversity

Incidentally, here are the links to the ranking systems, as well as the overview from the BCS:
BCS Page
Sagarin
Anderson & Hester
Billingsley
Colley Matrix
Massey
Wolfe

To take the issues in order:
1) A major lack of clarity and transparency about how each individual model actually works
Let’s start with Sagarin. A quick look at his web page told me absolutely nothing about how his “ELO Chess” system worked. I had the bright idea of looking up the system on google, and found the relevant entry on Wikipedia, link here. Apparently it would be too much trouble for someone who’s helping to move around millions of dollars between teams and conferences to bother being explicit and clear about what he’s actually doing. Fortunately I could find the details on the internet, though that’s assuming he hasn’t made any changes to the basic ELO methodology; no one seems to know one way or the other.

Next up: Anderson & Hester. You can dig around their website as long as you want; you won’t find a single thing that explicitly says what they are or aren’t doing, beyond the fact that they don’t consider margin or any kind of preseason rankings. You’ll find some vague language about how they incorporate conference rankings, but nothing close to any real detail about how that might actually work.

Next up: Billingsley. Whatever you may think of his actual methodology (I’m not a fan), give him credit: it’s fairly clear what the heck he’s doing, as he gives an explanation of his system and how it works . He even provides this link to the data which feeds his system, as well as week to week calculations, which seem to fill in the holes from his system description.

Next up: Colley Matrix. He gives both a brief explanation of his system and how it works, as well as a more detailed explanation . He even provides links to online casinos! Wait… wasn’t he not supposed to do that? You know, since the NCAA (and presumably the BCS too) discourages that sort of thing? Anyway, the most relevant point is that he actually shows what he’s doing in terms of how his model works.

Next up: Massey. He gives a decent description of his “real” system (which does use margin), as well as a note elsewhere on his site that explains that his BCS system is like his “real” system, but without using margin (it’s unclear whether it factors in home/away splits, though). Ultimately, it’s there, for the most part… you just have to dig a while before you can figure it out.

Next up: Wolfe. He gives a brief explanation of his system and how it works, and then sends you to a list of academic papers that apparently are supposed to fully explain the “Bradley-Terry” model. I gave up on those, but digging around the internet found me this link that apparently explains in more detail how the model works… I think .

So one system (A&H) provides zero detail behind what they’re doing; two systems (Sagarin, Wolfe) sort of describe what you’re doing, and let you dig around the internet if you actually want to figure out what’s going on; one system (Massey) sort of explains his system, as long as you’re willing to dig around his own site; and two systems (Billingsley, Colley) actually provide pretty good descriptions that are easily accessible.

Maybe I’m just expecting too much, but I consider that to be completely unacceptable. Not just some, not just most, but ALL OF THE SYSTEMS should make it easy for anyone to find explanations of what they’re actually doing. Vague descriptions of what they’re trying to do, needing to google mathematical model details to get any kind of idea what’s going on, or having to dig through a website just to find out how a model works… it’s amazing, and really disturbing, that with so much money at stake the BCS systems (and by extension, the BCS itself, since they’re supposed to be in charge) can’t be bothered to clearly explain their systems, much less show their work.

How in the world is anyone supposed to have faith in the system when it’s organized in such a haphazard, unprofessional manner? If the BCS wants to have any credibility when it comes to the model it uses, the first thing it absolutely needs to do is, on its own site, to provide clear, detailed and accurate descriptions of each of the models and how they work. Again, there are millions at stake. This is a no-brainer, considering the amount of money at stake, as well as the congressional hearings about the BCS.

2) Little to no audit trail
This ties to the first point. To be taken seriously, the BCS needs to create clear audit trails, not just for some of their models, but for all of them. Right now, I have zero clue whether anyone from BCS or anyone else has ever audited the models, their inputs or their results, much less whether they currently audit all the models on a regular basis. Some of the models say they use Wolfe’s data, which is accessible online, but for all we know one of them could have missed a couple rows of data, or had a broken formula in the model that swapped South Carolina for Southern California, or had any number of other things go wrong.

I don’t know how easy it would be to arrange, but the BCS needs to clearly state that the data, methodologies and results have been audited by independent, professional data auditors, or provide full working models of each of the six systems every week for the public to have the opportunity to do audits themselves, or preferably both. This is less of a no-brainer than the providing clear model descriptions to the public, because of the money involved in hiring auditors, hosting large models for public downloads, or both, but again, considering the amount of money at stake, this really needs to happen.

3) No apparent evaluation process of the models
As far as I can tell, there is no organized system in place to monitor systems and their methodologies to make sure that the systems make sense, and the results are reasonable. Right now, the people running the systems basically need to send the BCS their results, watch them get posted, and that’s that.

If any of the models give results that look especially wacky, there could be a public outcry, and then maybe it would get replaced, but if that’s really how the oversight and evaluation “process” works, then there is a huge problem. There should be real standards used to evaluate models on an ongoing basis, and they should be made public.

I’m not saying that there is one specific set of standards that I think the BCS has to use; there are plenty of ideas that could be used, from predictive (how well do the week 9 model outputs predict week 10 winners) to retro-dictive (how well do the week 9 model outputs tie to the results from weeks 1 – 9) to simply comparing the model outputs to the various human polls (almost certainly NOT a good idea, as the human polls have substantial flaws of their own). You could even create an evaluation process where each model maker picks from a list of evaluation standards prior to every year, which has the advantage of building diversity among model types, with some being more predictive and others aligned to various other standards. The important point isn’t so much what specific process gets used, but rather than some sort of process gets used, because right now there’s no public standard, and I strongly suspect no official, organized standard at all, even privately used by the BCS.

I should clarify this point slightly. I’m saying that there should be some sort of standard to make sure that the models are “good” (to whatever definition). It isn’t necessary that a model be the “best” at whatever standard, be it predictive, retro-dictive, or whatever; if you force a model to exactly conform to some set standard, then you’ve defined the model, and lost diversity (thanks to Ken Massey for making this point). However, if a model is supposed to be accurate on a retro-dictive standard and it turns out to be substantially less accurate than, say, the arbitrary standard of “ ‘pick’ the team that ended up with the best record, if they had the same record 'pick' the home team” (completely ripped off from TMQ’s NFL columns), then maybe it’s not a good model and needs to be replaced. And if the model-makers have no idea what sort of metric to use because they have no idea what their model is really supposed to be doing, then maybe it’s not a good model and needs to be replaced.

And on the same note, one thing which would increase the credibility and usefulness of the computer ranking systems enormously is a requirement for the people running the models to provide ongoing commentary about their results. There should be a mandate for the models to not just spit out results, but also to be able to defend any results which “look weird” to the public.

If a model thinks that Team X is severely underrated by the human polls, being able to explain that it’s because they’ve been red-hot lately, or because they’ve had a nightmarish schedule that’s not being given enough weight, or because their two losses were both by less than a field goal in overtime, or whatever, turns a model from just producing values on a table into becoming a useful part of the public discourse on evaluating teams. It turns a weakness into a strength.

Moreover, every model out there should be required to explain and defend their results towards the end of the regular season, both the final result set and the one before (so as to actually inform the debate rather than just spit out a series of numbers). Whether that means they’d be required to talk about just their top 2 (for the title race), their top 10 (for BCS at-large bids), or something else, I don’t know. What I do know is that they should be required to do at least something towards the end of the year. Again, it turns the weakness of a system that just spits out numbers (and who the hell knows whether they’re right, reasonable, etc.) into the strength of a system that meaningfully contributes to the debate. And if that means pay the model-makers more money, or if it means hire a couple of technical writers to help them put those details together, then it’s worth it.

4) A major lack of model diversity
If you read Bill James’ piece, linked at the top, this is more or less tied to his biggest concern, or as he put it: “the ground rules of the calculations are irrational and prevent the statisticians from making any meaningful contribution.” I’m not quite as adamant as he is that any modeling system which ignores margin, home/away etc. is fundamentally broken, but I do think that it’s a fundamental problem that NONE of the systems are allowed to consider these aspects. It’s an especially worrisome problem because ignoring margin biases the computers against teams that have relatively weak schedules and relatively high margins of victory, which generally mean the teams from non-BCS conferences, like Boise St, Utah, TCU, etc. Considering the anti-trust issues that are being raised, is it really a good idea for the BCS to be so obviously biased in this direction?

For what it’s worth, I like the fact that the BCS has some systems that explicitly devalue margin, because even though it hinders accuracy in evaluating teams, it does reward sportsmanship, or at least not reward bad sportsmanship. That said, throwing it out completely in all models is flat-out silly.

If you don’t believe me, just ask the guys who are actually running BCS computers. Sagarin, in his own website, says: “In ELO CHESS, only winning and losing matters; the score margin is of no consequence, which makes it very ‘politically correct’. However it is less accurate in its predictions for upcoming games than is the PURE POINTS, in which the score margin is the only thing that matters.” Translation: I’m publishing and putting my name on a crappy system, but I’m OK with this because I get to be part of the process and have people look at this other system of mine which I do believe in.

Massey isn’t nearly as explicit in his contempt of the system he’s using for the BCS (and may well not feel that way to nearly the same degree), but it’s telling that he’s built a long description of his “real” model (the one that considers score and venue, not just who won and lost), and didn’t bother building one for the BCS model.

The other four seem to be happy with not using margin, but it’s telling that one of the six, by his own words, seems to think it’s BS, and another sure isn’t banging the “no-margin” drum. And these are the people who are actually running BCS models.

Fixing this problem goes hand in hand with the idea of continually evaluating systems. There should be a variety of useful systems out there, and there should be a defined, ongoing process for evaluating them, and selecting which ones get used and which ones don’t.

If the BCS thinks that using margin isn’t a good idea, they should prove it. Take the no-margin models they’re currently using, and compare them to models that do use margin (a good starting point would be Sagarin and Massey’s own “real” models). Look at, say, the last four years’ results, and see how they do in predictive, retro-dictive, or whatever other standards might be useful. Come up with an answer, one way or the other, as to what works best. And make the process public, so the world can see how it works.

.

Ultimately, the BCS suffers from an enormous credibility problem. Even if it can bypass any Congressional action, as long as the public doesn’t have faith in it, it will continue to suffer from protests, and it will continue to be a constant source of controversy. The first step towards creating that faith is to fix the part of it which is understood the least, the computer rankings.

The computer models are far from the only things that needs fixing, of course, but they comprise one area which clearly does need to be fixed, which for the most part can be fixed reasonably easily and controversy-free (though parts 3 and 4 might be a bit tough), and which at least has the potential to make a big positive impact on public opinion. If the BCS can turn a opaque, seemingly arbitrary process which almost no one understands into a clear, accurate, process which most people do understand, that’s a huge step forward. And if they can make the computer processes actually contribute meaningfully into the evaluation process instead of just being cogs in the machine, that would be another huge step forward. Now they need to summon the will to make it happen.

Mr Pac-10's 2009 Blog

Questions, comments or suggestions? Email me at cfn_ms@hotmail.com