Topic: What do you think of Isotropic's leaderboard change? (Read 37115 times)

theory · « **on:** December 01, 2011, 04:10:09 pm »

To get some quantitative data about people's feelings on the issue. See previous discussion here.

Crispy · « **Reply #1 on:** December 01, 2011, 04:46:11 pm »

I don't like the 30 day look-back. Seems to "Swingy" and confusing on how/why rating change from day to day without playing. And the fact that good players will start at a noob level if they haven't played for 30 days would make the data useless to me.

If it were based on say a 300 game look back or something like that, I'd be fine with it.

greatexpectations · « **Reply #2 on:** December 01, 2011, 05:03:32 pm »

Quote from: Crispy on December 01, 2011, 04:46:11 pm

If it were based on say a 300 game look back or something like that, I'd be fine with it.

i am not agreeing or disagreeing with your thought, i just want to point out that for some players this will not change much.

here are the average games played for various groups over the past month, according to today's leaderboard.

all players - 84.4
top 5000 players (lvl 1+) - 109.8
top 4000 players (lvl 5+) - 128.4
top 3000 players (lvl 9+) - 150.9
top 2000 players (lvl 14+) - 177.2
top 1000 players (lvl 19+) - 213.4
top 500 players (lvl 23+) - 237.9
top 100 players (lvl 31+) - 330.2
top 50 players (lvl 33+) - 374.4
top 10 players (lvl 38+) - 398.2

Kirian · « **Reply #3 on:** December 01, 2011, 05:06:53 pm »

Personally, the best suggestion I've seen that works in the spirit of what Doug wanted with the new board is just to place t = 0 at the time when the system switched so that only random boards counted. Which is basically "put it back" with a minor change if something must be changed.

WanderingWinder · « **Reply #4 on:** December 01, 2011, 05:13:43 pm »

It should PROBABLY be going back like two changes (before the per-day uncertainty adjustment, back to the per-game), and probably levels should be based on skill alone. And probably the exact parameters need some tweaking, but I don't know that much of the specific math here.

hgfalling · « **Reply #5 on:** December 01, 2011, 05:37:01 pm »

If the purpose of the leaderboard is to provide a crude tool for figuring out how to play players who are roughly at your skill level, it would be better to just have five levels or something. If the idea is to provide a more serious estimate of playing skill, it seems that all the parameters are more or less misconfigured.

1) Having a step function that is either time-based or duration-based whereby games drop off the ranking is bad; using decay or something Bayesian instead is better.
2) The skill variance calculations from the old board weren't good; they were too close for players of widely different volumes of play.
3) The "level" system which is some linear combination of estimated skill and variance is kind of arbitrary and misleading.
4) The assumed skill for new players is probably wrong; most unranked players are kind of bad. This is primarily a variance issue though, winning or losing against players with hardly any record shouldn't do much to rankings. That's not happening, though.
5) People being able to "game" the system by playing sock puppets is really quite avoidable; just restrict the information that games against any particular opponent can add to the system.

I don't know a lot about the guts TrueSkill, but it can hardly be rocket science; I'd be happy to do some research and help in tweaking parameters to make a system that reflects the goals of the leaderboard, whatever those are. (that last question is really quite important, btw).

The Adventurer · « **Reply #6 on:** December 01, 2011, 06:22:12 pm »

I don't like it at all.
I wouldn't mind having my oldest games ignored in my rank (just because they wouldn't represent my skill level anymore, and if I truly wanted to I could just start anew with another email address to erase those), but it seems that putting a TIME constraint on it is unreasonable, especially since we all have busy lives and cannot always keep up with playing all the time. It puts a certain aspect of "productivity" in leisure, which defeats the purpose of leisure if you ask me.

Bottom line I HATE it. Wish it was rather a past 500-games condition rather than past 30 days. And while I don't necessarily advocate eradicating the past 30 days leaderboard completely, I strongly suggest it being a supplement of info rather than the main leaderboard, where there could be tabs with different classifications and stats (like leaderboard by last 500 games, last 30 days, all time).

Hope our voices will be heard and something will be done as this is slightly annoying and makes me enjoy isotropic a little less. Though people say the socially desirable "i don't care for ranks I play for fun", I had fun looking at my progression as well as having fun playing.

Kirian · « **Reply #7 on:** December 01, 2011, 07:15:57 pm »

Quote from: The Adventurer on December 01, 2011, 06:22:12 pm

I wouldn't mind having my oldest games ignored in my rank (just because they wouldn't represent my skill level anymore...

But the point is that they represented your old skill level, which is part of your record. Games should never get lopped off nor someone's rating "restarted." Gary Kasparov's FIDE rating includes games from when he was in high school, I'm certain. That's part of what makes this quite maddening; the whole point of a system like this is that, yes, it takes your old games into account. It's supposed to.

Reyk · « **Reply #8 on:** December 01, 2011, 07:29:02 pm »

Quote from: Kirian on December 01, 2011, 07:15:57 pm

Quote from: The Adventurer on December 01, 2011, 06:22:12 pm
I wouldn't mind having my oldest games ignored in my rank (just because they wouldn't represent my skill level anymore...

But the point is that they represented your old skill level, which is part of your record. Games should never get lopped off nor someone's rating "restarted." Gary Kasparov's FIDE rating includes games from when he was in high school, I'm certain. That's part of what makes this quite maddening; the whole point of a system like this is that, yes, it takes your old games into account. It's supposed to.

I very much support this. It wouldn't help at all to make a 60 day frame instead of 30 days or a 500 game frame. The Paralyzed issue remains then, but again: sort him out manually.

ackack · « **Reply #9 on:** December 01, 2011, 07:31:23 pm »

Quote from: Kirian on December 01, 2011, 07:15:57 pm

That's part of what makes this quite maddening; the whole point of a system like this is that, yes, it takes your old games into account. It's supposed to.

To what end? How much information about how he plays now is conveyed by those old games?

I liked hgfalling's post a lot. His closing point about deciding what you want the leaderboard to actually do is apropos here. If it's supposed to be a lifetime achievement record, then yeah, including all of the games is definitely important. But I'm skeptical that the first 100 registered games I played on Isotropic - the games where I knew essentially nothing about the game - have any predictive power about how well I'm going to do right now. If we are trying to make a measure of 2p skill at the moment (and I'm of the mind that the multi-player adaptability feature of TrueSkill is actually kind of a bug given how most of us use the leaderboard), then getting rid of probably irrelevant data from 1500 games ago seems good. I'd agree again with hgfalling that some sort of exponential windowing or something like that would make more sense, but the general idea of discounting ancient games makes sense.

Jack Rudd · « **Reply #10 on:** December 01, 2011, 08:59:43 pm »

Quote from: Kirian on December 01, 2011, 07:15:57 pm

Gary Kasparov's FIDE rating includes games from when he was in high school, I'm certain.

It does, but I think he came on the list with a >2600 rating, so it doesn't make much difference in his case.

(Not that it would anyway: the effect of a game thousands of games back on your current rating is effectively nil.)

mathguy · « **Reply #11 on:** December 01, 2011, 09:04:05 pm »

Quote from: hgfalling on December 01, 2011, 05:37:01 pm

... 1) Having a step function...

I didn't expect to read about step functions today. Thanks!

Young Nick · « **Reply #12 on:** December 01, 2011, 09:06:07 pm »

This is quite random, but a previous post got me thinking...could we have two numbers next to our name? One for skill and one for games played? Right now, a person with many games played but zero skill is ranked equally as someone who is very good but only has played a limited number of games. Again this is me spitballing, but what if we had skill on one axis and games played on another? If a person's two entries were plotted as a point, we could better auto-match by taking the distance between their respective plotted points than by the current TrueSkill system.
Hopefully this makes some sense at all.

sjelkjd · « **Reply #13 on:** December 01, 2011, 09:18:56 pm »

Here is an anecdote: I have an all time record of 351-228-8, and was rank 27 before the change. I have not been playing a whole lot lately, mainly due to the Thanksgiving holiday and not wishing to excessively play Hinterlands cards before I got the boxed copy. I logged in yesterday and was shocked to see my rank at 12(has since dropped to 10, despite winning the only game I played in that period). That puts me at about the 2800th best player on the site. I can't tell you what it was before the change, but assuming the leaderboard distribution is representative of the previous distribution, that would have put me at roughly the 220th best player. Considering there are 7000 total players on the leaderboard, this moved me from the 97th percentile to the 60th percentile. That just seems broken.

If your goal is to build a stable database of player skill, this is the wrong way to do it, for so many reasons.

Players who play infrequently are penalized, since they won't ever have enough history for them to build their trueskill estimate and drive down the uncertainty. You will need a certain number of matches to do so, and if you can't hit that number in the 30 day window, you will never able to improve your rating beyond a specific number, which is short of your actual rating.

You also penalize their opponents. If I get automatched against a level 10, and he wins, he only gets credit for beating a level 10, not a level 27. Likewise, if I win, he will lose more points than he should.

It reduces my incentive to play because the rating system is unreliable. Because there will be a lot of uncertainty with overlevelled and underlevelled opponents, it's going to be frustrating to try to raise my rank.

Basically, it increases the amount of games that players need to play to have a reliable rating. But if it's going to be that much work, I'm just going to be playing Skyrim or Starcraft 2 instead :/

If you want to reward or encourage frequent play, I would highly recommend either using a ladder system like Starcraft 2, which is separate from the skill rank, or increase the uncertainty of players who haven't played in a while. You can also only count auto-matched veto mode games to reduce the effect of players cherry picking their opponents.

Edit: I wanted to say this - I appreciate all the work that goes into isotropic, and please take this as an explanation of my thought process about why I would choose to play or not play, rather than as a complaint about an overall awesome service!

sjelkjd · « **Reply #14 on:** December 01, 2011, 09:24:43 pm »

Quote from: ackack on December 01, 2011, 07:31:23 pm

...the general idea of discounting ancient games makes sense.

You should not discount ancient games because your TrueSkill was calculated using your skill at the time(as well as that of your opponent). You are reducing the amount of information used as input to a system that has been built to take advantage of as much information as possible. Even if you totally bomb your first 100 games, TrueSkill will rapidly shoot up your rating if you outperform your expected rating. Throwing out those first games doesn't make the system better, and will not give you a more accurate representation of your rating. If you are familiar with topcoder check out my rating history(same username) as an example of outperforming a low initial rating

Lekkit · « **Reply #15 on:** December 01, 2011, 09:49:13 pm »

1. Keep the leaderboard as it is now. Maybe change it so that the last 60 or 90 days are the only ones that matter. I don't see how this is any problem besides messing with peoples levels; which leads me to my second note.

2. Keep the levels separated from the leaderboard. This way you can see that Theory (just an example) is a good player; since he has a high level, although he's not ranked one of the best at the moment on the leaderboard.

These are just my thoughts, and I think that it would solve, what seems to me as, the only problem people see with this change. The levels don't always represent your skill anymore.

rspeer · « **Reply #16 on:** December 02, 2011, 01:41:16 am »

Quote from: sjelkjd on December 01, 2011, 09:24:43 pm

Quote from: ackack on December 01, 2011, 07:31:23 pm
...the general idea of discounting ancient games makes sense.
You should not discount ancient games because your TrueSkill was calculated using your skill at the time(as well as that of your opponent). You are reducing the amount of information used as input to a system that has been built to take advantage of as much information as possible. Even if you totally bomb your first 100 games, TrueSkill will rapidly shoot up your rating if you outperform your expected rating. Throwing out those first games doesn't make the system better, and will not give you a more accurate representation of your rating. If you are familiar with topcoder check out my rating history(same username) as an example of outperforming a low initial rating

Well, that's not quite true. The problem DougZ was trying to fix was that people who start out playing badly and then get better could not dig themselves out of the hole fast enough. TrueSkill makes use of all the evidence, including possibly the earlier evidence that you suck, and the only thing that makes new evidence count more than old evidence is a factor called "gamma" that increases your standard deviation (sigma) between games.

DougZ set gamma to be applied once per day you play. In that system, if you want to erase your past mistakes, the best way to do so is to play one or a few games per day in preference to games all at once. (The same reason someone like Bonogi would want to play in massive binges, to not erase his past success.)

Now he's apparently frustrated with tweaking gamma, and is dealing with it more drastically by erasing the past entirely. Whereas I'd prefer a solution that applied gamma once per day no matter what (which is what I put on the CouncilRoom experimental leaderboard).

In e-mail, DougZ challenged me for "evidence" that the new levels were not predictive, before going silent again. So here's my claim. In future games of the Dominion Strategy tournament, I claim that my leaderboard will outperform the tournament seeds (based on the old leaderboard) in predicting who wins, and that the new Isotropic leaderboard will underperform.

I've saved a copy of the Isotropic leaderboard as of today, and we'll use my leaderboard as of the start of the tournament. I don't have Doug's leaderboard from the start of the tournament, so a fair comparison will probably have to only include the games played from today onward. This still gives a tiny advantage to Doug in the case that recent ratings are really important, because it is more recent by about five days.

All predictions will be based on the skill floor (the level, without rounding) instead of the mean skill. Mean skill is more mathematically predictive, but there are solid gameplay reasons to show the skill floor in the interface and on the leaderboard, and this is meant to be a test of which system really belongs in the interface and on the leaderboard.

If Doug decides to change the leaderboard again in the meantime, that's fine -- and if he improves it enough that it makes this comparison irrelevant, that would be a great thing.

DStu · « **Reply #17 on:** December 02, 2011, 02:37:11 am »

Quote from: ackack on December 01, 2011, 07:31:23 pm

If we are trying to make a measure of 2p skill at the moment (and I'm of the mind that the multi-player adaptability feature of TrueSkill is actually kind of a bug given how most of us use he leaderboard), then getting rid of probably irrelevant data from 1500 games ago seems good. I'd agree again with hgfalling that some sort of exponential windowing or something like that would make more sense, but the general idea of discounting ancient games makes sense.

The point is, TS alone is getting rid of the irrelevant data already. I can understand if Dougz is tired of tweaking parameters, and I also see that the old (already modified) implementation has some problems (also the unmodified seem to have more serious problems), namely that you don't want to play only 1 or 2 games per day because of variance issues.
But we should not think that what TS weigthes the games from 1 year ago with the same weight with the games of yesterday. All other thing being equal, if I am level 15 and win a couple of games against someone being level 30 and loose one, this will increase my ranking. If I'm level 45, this might even decrease it. So bad performances in the past vanishes. Also, the updates only depends on my mean and my variance (and the once of my opponent), and not of how many games I have played. And variance was greatly independent of the number of games before.
I could also, with not very many games, drop my rating from 30 in the lower 20 and pushing it up to the mid30s again. Don't know how it's at the real top, but I also saw Geronimoo and WW going up to the top form the 30s in quite a fast time. I don't really see that you are stuck with your rating forever...

Slow Dog · « **Reply #18 on:** December 02, 2011, 08:19:53 am »

Quote from: Kirian on December 01, 2011, 07:15:57 pm

Quote from: The Adventurer on December 01, 2011, 06:22:12 pm
I wouldn't mind having my oldest games ignored in my rank (just because they wouldn't represent my skill level anymore...

But the point is that they represented your old skill level, which is part of your record. Games should never get lopped off nor someone's rating "restarted." Gary Kasparov's FIDE rating includes games from when he was in high school, I'm certain. That's part of what makes this quite maddening; the whole point of a system like this is that, yes, it takes your old games into account. It's supposed to.

Neither ELO (which FIDE use) nor TrueSkill (which is based on ELO) remember your old games beyond the fact that your rating is based upon them. Your ELO rating is a single number, and TrueSkill's is just a rating and a variance where you're ranked by (rating - variance).

Now, I think it's true that if, say, your skill suddenly increased from 20 to 25, compared to some new player (rated 0) who's also really skill 25, the new player will get to 25 first (and you'd get to 25 faster with TrueSkill than ELO). But you will both get there; nothing permanently penalises you for being worse in the past.

theory · « **Reply #19 on:** December 02, 2011, 08:27:32 am »

Quote from: rspeer on December 02, 2011, 01:41:16 am

In e-mail, DougZ challenged me for "evidence" that the new levels were not predictive, before going silent again. So here's my claim. In future games of the Dominion Strategy tournament, I claim that my leaderboard will outperform the tournament seeds (based on the old leaderboard) in predicting who wins, and that the new Isotropic leaderboard will underperform.

I've saved a copy of the Isotropic leaderboard as of today, and we'll use my leaderboard as of the start of the tournament. I don't have Doug's leaderboard from the start of the tournament, so a fair comparison will probably have to only include the games played from today onward. This still gives a tiny advantage to Doug in the case that recent ratings are really important, because it is more recent by about five days.

All predictions will be based on the skill floor (the level, without rounding) instead of the mean skill. Mean skill is more mathematically predictive, but there are solid gameplay reasons to show the skill floor in the interface and on the leaderboard, and this is meant to be a test of which system really belongs in the interface and on the leaderboard.

The data I would suggest you use:

1) The stuff in the spreadsheet (the skill floors used for seeding) -- this is OLD leaderboard
2) This link, which is the NEW leaderboard the day the tournament began: http://bggdl.square7.ch/leaderboard/leaderboard-2011-11-28.html
3) Whatever data you want to come up with

FWIW, the seeds have been pretty stable so far.

timchen · « **Reply #20 on:** December 02, 2011, 08:46:12 am »

Interesting, as I personally find it much harder to raise my level nowadays, in spite that I have been 10 level-ish lower on today's leaderboard. Before the change, I can just suck and see my rankings sank from 42 to 35 and then climb back, just by playing random opponents. In this sense I don't see the "gamma problem" at all. However, recently even if I win more than 70% of the games (which is my life time average against random opponents) my rank hardly changes. And if I am being experimental, it seems my rank will sink. It seems that I have to conclude the way I play is ranked quite differently under the two systems: in the previous leaderboard I am 40-ish and in the present day one I am 30-ish.

ackack · « **Reply #21 on:** December 02, 2011, 08:53:27 am »

There was substantial deflation moving between the two boards, so I wouldn't take "30ish" as much of a sign. What I'd look at is your rank. My rank appears to be quite uniform across the three leaderboards.

rrenaud · « **Reply #22 on:** December 02, 2011, 09:17:38 am »

Quote from: rspeer on December 02, 2011, 01:41:16 am

In e-mail, DougZ challenged me for "evidence" that the new levels were not predictive, before going silent again. So here's my claim. In future games of the Dominion Strategy tournament, I claim that my leaderboard will outperform the tournament seeds (based on the old leaderboard) in predicting who wins, and that the new Isotropic leaderboard will underperform.

You can compute the log likelihood of the predictions for both systems. I am pretty sure the old way (without throwing out the data) will outperform.

IIRC, summing up the log of diff_p0p1.ProbabilityPositive() after throwing out ties for 2p games will calculate the log likelihood in the implementation I have checked in.

https://github.com/rrenaud/dominionstats/blob/master/trueskill/trueskill.py#L442

Even simpler, you could also just calculate the number of times the highest rated player wins.

Edit: Of course, this requires access to doug's implementation to be super fair

. OTOH, I guess you could run two systems with the same parameters, varying only the 'throw away history' behavior, and that will get to the heart of the issue.

Kirian · « **Reply #23 on:** December 02, 2011, 09:23:27 am »

Quote from: rspeer on December 02, 2011, 01:41:16 am

In e-mail, DougZ challenged me for "evidence" that the new levels were not predictive, before going silent again.

I'm not sure more "evidence" is required than:

1. The leaderboard is used to give a sense of how "balanced" a match is, that is, whether you are playing someone similar in skill or vastly different. This is demonstrated by the fact that the system from which the leaderboard is derived uses it to this effect.
2. There are currently people on the leaderboard whose level is much, much lower than previously, because they took breaks. (Separately, one's level will regress to zero as one spends less time playing.) This has been documented multiple times.
3. The problem in (2) will happen any time a player takes a ~10-day break. This can be demonstrated mathematically.
4. Anyone playing one of the people whose ranking is far lower than it ought to be will have their level/ranking decreased (relatively) by playing one of these people with a deflated rank. This can be demonstrated mathematically.

I'm not sure how anyone can see these pieces of evidence and conclude that the new system is better. I'm talking here only about the matchmaking aspect.

hgfalling · « **Reply #24 on:** December 02, 2011, 12:32:32 pm »

Re: predictive value of "level".
Level (defined as meanskill - sigma) isn't supposed to be a predictive measure anyway, so why would it be predictive? It's just notational shorthand. I mean, a guy whose estimated rank is 30+/-20 is a lot different than a guy whose estimated rank is 15+/-5, but "level" treats them the same. Of course it's not going to be as good a predictor as, say, mean skill.

Re: arbitrary cutoffs
There is this notion of "consistency" in estimators: you have some parameter t you want to predict, and you have some set of n observations that you want to predict it from. You generally want the following: as n increases, your estimate gets closer to t. Dropping relatively recent observations from the past from consideration guarantees that your estimators will not be consistent. This is a pretty bad thing, considering that you don't really get any offsetting benefit from it.

Re: decaying the past
There are two things going on here, which it seems some posters are missing. Suppose first that skill levels are fixed, like people never improve or get worse, and our goal is correctly assess everyone's skill level in an asymptotically consistent sense. Then we should not drop anything from the past, and all observations should be equally weighted. Then the probably is basically simple except for what prior beliefs we have about the population of isotropic players.

Okay, but not skill levels aren't fixed, so we have to do something else. The Glicko solution is basically to increase the variance of the prior on each player over time. This naturally decays the impact of older games. This is the "gamma" that rspeer mentioned, as far as I can tell. Now applying this solution, but only on days you play, has a totally counter-intuitive effect on rankings. Consider two guys A and B, who on day 0 have identical mu/sigma rankings. Then A goes to study for the bar exam, while B plays a game a day for the next two months, during which time his results are right in line with his previous ranking. This system will claim, obviously implausibly, that we are MORE uncertain about B's ranking than A's, which doesn't make any sense at all.

Now, rspeer mentioned a problem about players playing badly at first and not being able to dig themselves out of the hole fast enough. To my mind, this isn't a problem: our best estimate of their level is what it is under the parameters of the model, so meh. But I think his comment reflects a prior belief about the distribution of skill levels that is a) not accounted for in the model, b) probably true. That belief is that the rate of change of "true" skill levels is much higher for players with low rankings than it is for players with high rankings. To me this is obviously true; when you suck, it's easy to become marginally competent, just read dominionstrategy.com. When you are mediocre, it is harder but not impossible to become strong. When you are strong, it is difficult to become elite OR to become mediocre. When you are elite, it's hard to move anywhere. The higher your meanskill ranking is, the lower the variance of the drift of your meanskill, regardless of the variance of your meanskill.

So if this is really the problem you are trying to solve with all this tweaking of the system, then just use non-uniform gamma based on meanskill. Problem solved, and in a nice Bayesian way.

Dominion Strategy Forum

News:

Author Topic: What do you think of Isotropic's leaderboard change? (Read 37115 times)