Topic: Goko ratings drift -- Now with data! (Read 13923 times)

ashersky · « **Reply #25 on:** June 21, 2013, 04:17:36 am »

Quote from: WanderingWinder on June 20, 2013, 11:51:17 pm

Quote from: dnkywin on June 20, 2013, 11:15:45 pm
Will Goko ever release how they calculate ratings? That would be _really_ helpful.
It's apparently too complicated to possibly write down.

Is there no way you math whizzes can reverse engineer the formula by having two players create new accounts and play a specific number of games a day?

DStu · « **Reply #26 on:** June 21, 2013, 04:57:10 am »

Quote from: ashersky on June 21, 2013, 04:17:36 am

Is there no way you math whizzes can reverse engineer the formula by having two players create new accounts and play a specific number of games a day?

That would imply that it is possible to write it down, and we all know that it isn't...

Seriously, at least it should be possible to fit TrueSkill parameters, and from there seeing how well they fit one could say more. In principle, there are infinite possibilities of setting up such a model, especially if you want to account for unreasonable choices, which I assume we must. The more degrees of freedom the parameter space you are looking on is, the more difficult that gets.

But it couldn't hurt to have the data, I would best would to have many (~100) games from one day because that ignores whatever they do overnight, and this would better be fitted seperately once you know what they do usually. Resign at turn1 is fine (or?). Can one automate this with a browser extension?

WanderingWinder · « **Reply #27 on:** June 21, 2013, 08:08:29 am »

Quote from: DStu on June 21, 2013, 04:57:10 am

Quote from: ashersky on June 21, 2013, 04:17:36 am
Is there no way you math whizzes can reverse engineer the formula by having two players create new accounts and play a specific number of games a day?
That would imply that it is possible to write it down, and we all know that it isn't...

Seriously, at least it should be possible to fit TrueSkill parameters, and from there seeing how well they fit one could say more. In principle, there are infinite possibilities of setting up such a model, especially if you want to account for unreasonable choices, which I assume we must. The more degrees of freedom the parameter space you are looking on is, the more difficult that gets.

But it couldn't hurt to have the data, I would best would to have many (~100) games from one day because that ignores whatever they do overnight, and this would better be fitted seperately once you know what they do usually. Resign at turn1 is fine (or?). Can one automate this with a browser extension?

The second half of this, are you talking about to set up our own system, or to reverse engineer theirs? I don't think we should be married to TrueSkill so much for a system, but ok whatever. But I really don't see why would try to force Goko's system into a TrueSkill framework.

Exactly reverse engineering their system would be very hard. But getting a good approximation should be possible. The biggest things needed are to work out how much is gained by playing (some kind of uncertainty decrease we can't see) and what the expected win% for a given point differential, which is probably only roughly possible.

DStu · « **Reply #28 on:** June 21, 2013, 08:13:29 am »

I was thinking about taking TS, maybe with some additional degrees of freedom (don't use Gaussian measures but also try some else), and try to fit the parameters to their data.

Edit: But of course you are right, hitting theirs exactly will be hard, as there are tons of degrees of freedom, and it can well be that it's not a simple process but some patches going on.

But it in the end it's "just" fitting two functions from \R^4->\R^2, namely
ratingUpgradeIfYouWin( yourRating, yourUncertainity, opponentRating, opponentUncertainity) -> (yourNewRating, yourNewUncertainity) and
ratungUpgradeIfYouLose( the same )

Biggest Problem is that you do not observe Rating and Uncertainity, but just the number Level(rating, uncertainity), but this should be handleable given enough data, especially if we assume that Level = Rating - a*Uncertainity for some constant a.

Edit2: And of course assuming that rating and uncertainity are the only parameters that goko uses to model your skill, but I have seen nothing that lets me think otherwise.

ashersky · « **Reply #29 on:** June 21, 2013, 08:35:02 am »

Quote from: DStu on June 21, 2013, 08:13:29 am

I was thinking about taking TS, maybe with some additional degrees of freedom (don't use Gaussian measures but also try some else), and try to fit the parameters to their data.

Edit: But of course you are right, hitting theirs exactly will be hard, as there are tons of degrees of freedom, and it can well be that it's not a simple process but some patches going on.

But it in the end it's "just" fitting two functions from \R^4->\R^2, namely
ratingUpgradeIfYouWin( yourRating, yourUncertainity, opponentRating, opponentUncertainity) -> (yourNewRating, yourNewUncertainity) and
ratungUpgradeIfYouLose( the same )

Biggest Problem is that you do not observe Rating and Uncertainity, but just the number Level(rating, uncertainity), but this should be handleable given enough data, especially if we assume that Level = Rating - a*Uncertainity for some constant a.

Edit2: And of course assuming that rating and uncertainity are the only parameters that goko uses to model your skill, but I have seen nothing that lets me think otherwise.

This is what I meant. Whatever all that up there means. (English majors ftw!)

I think the easiest way to collect data is two or more dummy accounts and lots of resignations.

WanderingWinder · « **Reply #30 on:** June 23, 2013, 09:08:44 am »

To me, the biggest thing I want to know is the shape of the distribution which is underlying the expected winrates of the players, i.e. how often is someone with a 1 point rating advantage expected to win, with a 10 point edge, a 100 point edge, 1000, etc.

ragingduckd: If you have the programming skills, time, and I data I hope you do, there is a way we could get a good approximation of this. The first thing you would need to do is break down all the games in your sample by the rating difference of the players before the game (i.e. the higher-rated was at a 1 or 3, or 264 point advantage). Because their system takes on so many values, and depending on your data set, you might bucket this, so everyone from 0-4 is in the same group, 5-9 in a group, etc. 5 point buckets or 10 point or 15 or 20 or whatever your data tells you. We want to have big enough buckets to get a reasonably high sample size (I dunno, maybe 50ish games - off the top of my head) in each bucket, but we also want the buckets to be as small as possible, obviously.

Okay, then for each game in your bucket, you split it up based on who won - the higher-rated guy, or the lower-rated guy. Then, for the higher-rated guy, take all the wins, and find how many rating points he gained, on average, from winning. Then do the same in all the losses. Divide these, (average loss change divided by average win change), and that will give you the odds for the higher player to win - do they need to win 3:1 or 2:1 or whatever, in order to maintain their rating. This can be converted into an expected winning percentage.

Oh, and of course, you will want to do the same, but separately, for all the lower-rated players. For any rating difference, it should be pretty close to matching (if the higher-rated needs to win 2:1, then the lower would only need 1:2). The differing uncertainties will make this not be exactly so, but that should at least mostly wash out over a large enough sample. The increasing uncertainties is a bigger problem, but hopefully we will be able to control for that after we have this data, or anyway we would at least have an approximate shape.

Of course, this is a lot of work, so I don't expect you to do it really. But that is how you would do it, if anyone wants to.

DG · « **Reply #31 on:** June 23, 2013, 09:40:30 am »

And remember that the rating system still has errors! I'm pretty sure that the rating change is sometimes half what it should be.

Watno · « **Reply #32 on:** June 23, 2013, 01:09:16 pm »

You mean exactly half? I wonder how they managed to program this, but it helps explaining why it can't be put into a formula.

WanderingWinder · « **Reply #33 on:** June 23, 2013, 01:31:05 pm »

Quote from: DG on June 23, 2013, 09:40:30 am

And remember that the rating system still has errors! I'm pretty sure that the rating change is sometimes half what it should be.

What makes you think that?

The only bugs I can tell for sure are that it sometimes gives 0 change when it obviously ought not.

DG · « **Reply #34 on:** June 23, 2013, 01:37:20 pm »

I play a lot of 3 player games against the bots and the ranking is usually +18 to +24 for a win. Sometimes it is +9 and there's no good reason for it other than it being an error. It's been like that ever since beta. If there's one error then there might always be more as well.

WanderingWinder · « **Reply #35 on:** June 23, 2013, 01:45:20 pm »

Quote from: DG on June 23, 2013, 01:37:20 pm

I play a lot of 3 player games against the bots and the ranking is usually +18 to +24 for a win. Sometimes it is +9 and there's no good reason for it other than it being an error. It's been like that ever since beta. If there's one error then there might always be more as well.

Well, the 0 thing seems like a much more reasonable error to me. The +18 to +24 down to +9 there could be a lot of reasons for, depending not only on the others' ratings, but also their uncertainties. Some of this is down to how Goko is showing things. If there is someone who is rated 6000 but has 1500 points subtracted off that for uncertainty vs someone who is rated 5000 but only 500 points off for uncertainty... well, in the former case, you beat a pretty good player, and in the latter, you beat someone we're pretty sure is significantly worse than you. I'm not saying that this is right, but I wouldn't jump to thinking it had to be an error.

DG · « **Reply #36 on:** June 23, 2013, 02:23:43 pm »

Quote

The +18 to +24 down to +9 there could be a lot of reasons for, depending not only on the others' ratings, but also their uncertainties.

If there are mathematical anomalies in rating system that allow rating changes to be 100% out then that itself is a problem even it is not a bug. It will also significantly handicap any attempt to model the ratings system.

WanderingWinder · « **Reply #37 on:** June 23, 2013, 02:30:58 pm »

Quote from: DG on June 23, 2013, 02:23:43 pm

Quote
The +18 to +24 down to +9 there could be a lot of reasons for, depending not only on the others' ratings, but also their uncertainties.

If there are mathematical anomalies in rating system that allow rating changes to be 100% out then that itself is a problem even it is not a bug. It will also significantly handicap any attempt to model the ratings system.

ANY reasonable rating system should have 100% or more differences reasonable. If I beat Stef, I should certainly gain more than twice as many points as if I had beat the lowest ranked player on the server, no?

DG · « **Reply #38 on:** June 23, 2013, 03:09:01 pm »

I mean 100% variance between the results you get if you beat the banker bot in games in a row. That's what I'm seeing.

WanderingWinder · « **Reply #39 on:** June 23, 2013, 03:39:05 pm »

Quote from: DG on June 23, 2013, 03:09:01 pm

I mean 100% variance between the results you get if you beat the banker bot in games in a row. That's what I'm seeing.

The same opponent? The only issue I can see there is if, because it's a bot, it was rated much much worse the second game you played it, because it had lost a lot of games to other players in the meantime. But okay, this seems unlikely, and yeah, that is a problem. Maybe you can't write the formula down because they add a somewhat random amount, and... yeah?

heron · « **Reply #40 on:** June 23, 2013, 04:52:06 pm »

Perhaps in three player games it normally gives you +9 for each player/bot you beat, but sometimes one of the bots gets the +0 error but the other one doesn't.

Warfreak2 · « **Reply #41 on:** June 23, 2013, 08:23:28 pm »

It may be weighted according to what cards are in the kingdom; some cards are higher variance than others.

Kirian · « **Reply #42 on:** June 23, 2013, 08:40:40 pm »

Quote from: Warfreak2 on June 23, 2013, 08:23:28 pm

It may be weighted according to what cards are in the kingdom; some cards are higher variance than others.

Goko is not that smart. Nor is that a good reason to change the rating system.

Warfreak2 · « **Reply #43 on:** June 24, 2013, 07:19:01 am »

They did say it was too complicated to write down, and it would explain why you get different rating increases from beating the same bots successively... where did I suggest changing anything?

Kirian · « **Reply #44 on:** June 24, 2013, 07:57:38 am »

Let me rephrase: taking the board into account would be a terrible ranking system. I guess I was implying that doing so would be a change from sanity.

Warfreak2 · « **Reply #45 on:** June 24, 2013, 09:24:47 am »

I don't see why; a rating system is supposed to predict not only which player will win, but with what probability. That probability depends on the board, since some cards create more opportunity for the weaker player to get lucky than others do (looking at you, Treasure Map and Tournament).

WanderingWinder · « **Reply #46 on:** June 24, 2013, 01:47:22 pm »

I don't think that taking the board into account would prima facie be a theoretical problem. But trying to determine what the probability for each board is - there are astronomically far too many boards out there to do an accurate job. You could do it card-by-card, influencing an overall number, but this will be only very roughly accurate, and I really don't know that it will be a help rather than a hindrance (in comparison to just finding an overall average). I mean, treasure map doesn't actually increase variance much at all I think, tournament maybe, sometimes. But how about village? On a board with no terminals, it probably increases variance. But on a board with a big-time engine, it probably decreases it. So overall?

tl;dr hypothetically fine, but no rating system is nearly this good yet, focus on getting the basics first.

DStu · « **Reply #47 on:** June 24, 2013, 01:51:03 pm »

Quote from: WanderingWinder on June 24, 2013, 01:47:22 pm

I don't think that taking the board into account would prima facie be a theoretical problem. But trying to determine what the probability for each board is - there are astronomically far too many boards out there to do an accurate job. You could do it card-by-card, influencing an overall number, but this will be only very roughly accurate, and I really don't know that it will be a help rather than a hindrance (in comparison to just finding an overall average). I mean, treasure map doesn't actually increase variance much at all I think, tournament maybe, sometimes. But how about village? On a board with no terminals, it probably increases variance. But on a board with a big-time engine, it probably decreases it. So overall?

tl;dr hypothetically fine, but no rating system is nearly this good yet, focus on getting the basics first.

Exactly this. You also get >200 more parameters for your model just for accounting for single-cards, which themselves are only an average over every board that features this cards. And an average over all skill levels.

Warfreak2 · « **Reply #48 on:** June 24, 2013, 03:52:28 pm »

I didn't say it would be easy to do correctly, or even sensible to try! However, it's possible Goko is trying anyway, which would explain why games between the same players, with the same result, could produce different ratings swings.

ragingduckd · « **Reply #49 on:** June 24, 2013, 06:30:14 pm »

Quote from: WanderingWinder on June 24, 2013, 01:47:22 pm

I don't think that taking the board into account would prima facie be a theoretical problem. But trying to determine what the probability for each board is - there are astronomically far too many boards out there to do an accurate job. You could do it card-by-card, influencing an overall number, but this will be only very roughly accurate, and I really don't know that it will be a help rather than a hindrance (in comparison to just finding an overall average). I mean, treasure map doesn't actually increase variance much at all I think, tournament maybe, sometimes. But how about village? On a board with no terminals, it probably increases variance. But on a board with a big-time engine, it probably decreases it. So overall?

I started trying this, and it did indeed deliver some modest improvement in the prediction accuracy. I also got relative card variances that seemed plausible to me: Minion was noisy, Chancellor was irrelevant, etc.

I was working with Elo and adjusting the logistic curve exponent based on the kingdom variance. I tried some rather primitive ways of calculating kingdom variance from card variance: kvar = max(cvar), kvar = sqrt(mean(cvar^2)), etc -- measures that were designed to overweight the noisiest cards in the kingdom. I was measuring accuracy using binomial deviance.

The big problem and the reason I abandoned the effort was that it was awfully computationally intensive. I could only work with really small samples. There's probably some much better algorithm than my first attempt, but as you say, this is second-order stuff.

PS: I am indeed interested in the shape of the win probability curve, as per your earlier post. I haven't gotten around to doing the sort of analysis you suggest for Goko ratings, but I plan to eventually. I'll also post the data I've collected in some useable form here so others can try too. Unfortunately, I seem to have broken my database so everything else is on hold. That's also why the log search is down.

Dominion Strategy Forum

News:

Author Topic: Goko ratings drift -- Now with data! (Read 13923 times)