Goko's rating system is swingy as hell. Since human skill changes only slowly over time, any estimation of it that's all over the place is horrible.
There are a few problems with this; 1) we don't actually know if it's swingy, really, because we don't know how much 5 points or 50 points or 500 points means (admittedly, this isn't exactly a point in the system's favor)
We do know this from looking at the relative rankings. To give an example: earlier today I was briefly above Stef on the Goko rankings after winning a few games in a row, i.e. according to their rating system I would be a favorite in our next match-up, even though we have both played thousand of games on their site of which over a hundred against each other. Goko's conclusion is clearly retarded, because there's no doubt in my (or isotropish's) mind that Stef is the better player. For some more examples, see this post by Andrew.
Well, I can go rebut that post if you want. Also isotropish had Stef below you for a while several days ago, so there's some doubt in its mind (even if we ignore that, it has you guys within like 1 st dev right now, so it thinks there's a reasonable chance). Further, I guess I think players' skills move faster than you do, because I certainly wouldn't quote thousands of games like they're all relevant. E.g., since the beginning of the month, you've played 99 games, 13 against Stef. He's played 168. You're 7-6 against him this month. Last month, you were 12-8 against him. No games in January. 1-3 in December. November, 3-2. Based on these results, anyway, you're pretty clearly better than him. I don't know what hundreds of games from ancient times you're banking on to confirm your "he's better than me", at least heads up.
2)while (presumably, though I don't have better than anecdotal evidence) skill doesn't change rapidly over time for most players, this doesn't mean that the best estimation of said skill isn't going to move around a bit
It should move around a bit when it has little evidence, but when it has hundreds or even thousands of games on you including a ton of recent data points, it should be changing very conservatively. And Goko's system isn't just moving around a bit, it's bouncing wildly to the tune of white noise.
When there are a ton of recent data points, it ought to move towards what those say, ignoring to some extent the older data. And again, you don't know how much it's moving around, because you only see points, and you don't know how much points are worth. It could be that all this random fluctuation is just bumping between 50.001% and 49.999%. Yes, you have rank data, but you don't have how MUCH it favors anyone against anyone else.
3)I don't actually agree with your assessment that their system is 'swingy as hell' - it actually doesn't seem to move that much at all to me. On the other hand, the isotropish ratings seem INCREDIBLY sluggish.
We clearly have very different expectations here, because I think the isotropish ratings are much too volatile still. After having played thousands of games and a sufficient volume recently, it shouldn't be possible to change by a few levels within a single day: the prior of your skill having changed significantly over a very short time-span should be close to zero (by the nature of skill acquisition and decay), so that any significant deviation from expectation over a small sample should be judged as a fluke and thus only very slightly affect ratings.
Well, when I lose something like 15 out of 20 against someone (who was ranked reasonably high to start with and it still has me at like 80% against them, I have to assume isotropish is too stodgy.
To be more clear here, the math doesn't really back you up here. If my model has it as a 2% chance Bob beats Tim in any game, and Bob beats Tim 10 games in a row, there's one chance in something on the order of a Billion Billion of that happening. Your model is wrong, and needs to move. I don't care if you have 10k games, your ratings need to move significantly. Obviously, this is a pretty dramatic example, but even in more realistic scenarios, you can pretty quickly to get to things that are 1 in a thousand or worse, very very easily. Now it's possible that you just had that random luck pop up, but I think it's more likely that the players' skills weren't accurately recorded, possibly because of at least
somewhat of a skill change.
To make of this a testable prediction: I predict that a running 30-day average of the isotropish ratings will be a significantly better predictor of the outcome of match-ups between players than the ratings as they currently are.
You think a rating taken as an average over the last 30 days of isotropish ratings will be a significantly better predictor of WHICH matchups between players than... the current isotropish ratings? There are lots of holes in this that would need to be filled before it can be considered a testable prediction. First, how are you averaging? Arithmetic mean of mu and arithmetic mean of sigma? Do you have the historical data to calculate this? How do you time-average the data, since it updates real-time? Over what time period are you taking this measurement? Perhaps most important, how do you want to define "better predictor"? You need some kind of error function.