There seem to be three reasonable options:
- Sort by mu with a cutoff based on variance or number of games
- Sort by mu-k*sigma for some k between 1 and 3
- Implement the isotropic leaderboard's algorithm and be done with it
I'm pretty sure that nobody wants the fourth option, sorting by mu with no cutoff.
I actually do want option number four. The problem with it is only that the system is probably quite wrong. Playing one game, no matter how good you do, shouldn't get you to high enough of a rating that you would be rated like number one on the leaderboard. I mean, that's actually an empirical question - does this make for a better rating system than the alternative or not? You all react against the mu sort because you think it should probably be worse. Well, I tend to agree with that line of thought, but it's an empirical question, and if it is one which we are right on, it actually just means the entire rating system is bad. Well, I suppose I would probably prefer an 'active' leaderboard, such that you fall off after a certain period of inactivity, but still I definitely want a mu sort.
If I understand correctly, the purpose of sorting by mu-k*sigma is the same as that of implementing a cutoff. In both cases, the goal is to keep the top of the leaderboard from being filled up by mediocre players who have been lucky in a small number of games. Either option deviates from a rating system's one truly objective goal: estimating the probability that any given player beats another.
Well, I don't actually think this is the goal of either, particularly of the mu-k*sigma sort, where I think the goal is to spur more playing. But you're quite right on the goal of a rating system, and this is really why I would want a mu sort - you are expected to be better than every player below you and worse than every player above you. This isn't the case with the current system.
Microsoft Research appears to advocate the mu-k*sigma approach,
If they have any research saying this, it's market research. Seriously, they put this in their general information about the system, but you don't see it in the scholarly papers, and there's really no statistical backing for it.
but they don't take a strong stance on what k should be used. Using any k>0 means sorting players by a deliberate underestimate of their actual skill, but the degree of that underestimate varies with k. With k=3, a player's rank derives from the skill level that we're 99% confident is below their actual skill. To me, that seems a little excessive and possibly unfair to new players. This is what the leaderboard on drunkensailor.org is doing now, and it seems to be what Goko does as well.
We actually still don't really know what Goko does. For sure they have some uncertainty thing such that playing more helps your rating, but for all I know it actually gets folded into a single rating number and not separated out as a mu and sigma kind of thing.
Isotropic used k=1.
Actually, iso used k = 3. You might be confused because the numbers they showed were mu+/-3*sigma, so it looks like they just subtracted the two numbers. But the second number displayed was 3*sigma, not just sigma.
In other words, a player's rank derived from the skill level that was 84%<?> certain to be below their actual skill level.
Except that this assumes that players' skills are normally distributed, which isn't true. But I've covered this.
This is still conservative, but not nearly as brutal to new/lucky players as mu-3*sigma. Iso also used an unusually high starting uncertainty: sigma=mu instead of sigma=mu/3. I'm not sure what the motivation for this was, but it explains why Iso had "levels" as high as 53 and as low as -35, while mine runs from 29 to -3.
As I've explained above, you have this wrong, because they displayed 3sigma and not sigma. But actually yours running from -3 to 29 is not something in your favor - if the ratings were actually normally distributed, you would have, based on your number of players, a much bigger range (of mu!) than you do.
Finally, note that sigma appears to converge to 0.80 in with my standard Trueskill implementation. A great many of the experience players have ratings between 0.79 and 0.82, and the lowest sigma in all of Goko is 0.78. On Iso, uncertainties never seem to have gotten below 6.5, and they didn't converge nearly as uniformly.
They don't actually converge to .80. It's just that it's very hard to get lower than that by playing the way people actually do. For that, I would have to look, but you would either need higher draw rates, or you'd need to do something like play very weak players a lot and win a lot. But it has to do with how their updating equations work, and basically there's enough uncertainty in the game that you can't get lower than this. I don't' think they should ever go down to 0 though, because you really can't ever get totally sure of what someone's skill is with no uncertainty. Anyway, iso's were higher because they incremented upward a little bit with every day that passed (and with every game? I can't recall exactly), which meant that to get them very low, you not only needed to do what you need to do for your system, but you needed to play a heckuva lot, all the time.
Anyway, the real thing to me is, the proof is in the pudding. You go with the system that best measures things, and the only way we have of telling this is based on the predictions, so you go with the thing which best predicts things. Since you are actually only making predictions centred on the value of mu, that is what you should be sorting by.
Edit: Incidentally, it has been suggested at points in the past that I have made such comments in a way which is self-serving. This is pretty clearly not the case here. Relative to other players (if I have counted correctly), this change would help me relative to 8 players, no change relative to 139 besides myself and hurt me relative to 7153.