Topic: Goko vs TrueSkill (Read 11775 times)

ragingduckd · « **on:** July 17, 2013, 08:54:38 am »

~~Aside from a few outliers like Boodaloo, t~~The boards look pretty similar to me. Are there other aberrations I'm missing?

Goko Top 100:

Code: [Select]

  rank |       pname        | rating 
------+--------------------+--------
    1 | Stef               |   6931
    2 | nomnomnom          |   6919
    3 | hiroki             |   6801
    4 | Rene Kuroi         |   6782
    5 | Mic Qsenoch        |   6760
    6 | SheCantSayNo       |   6757
    7 | Stealth Tomato     |   6668
    8 | Wandering Winder   |   6640
    9 | Tao Chen           |   6636
   10 | Geronimoo          |   6609
   11 | Obi Wan Bonogi     |   6607
   12 | jaybeez            |   6555
   13 | LESPEUTERE         |   6534
   14 | ednever            |   6524
   15 | jog                |   6512
   16 | Rabid              |   6511
   17 | HiveMindEmulator   |   6502
   18 | Fabian             |   6488
   19 | blueblimp          |   6465
   20 | awaclus            |   6461
   21 | Andrew Iannaccone  |   6433
   22 | flyingkuyt         |   6398
   23 | iriho              |   6379
   24 | jhovall_goko       |   6370
   25 | PitrPicko          |   6365
   26 | yuuna_tu           |   6327
   27 | AQUAREAF           |   6305
   28 | Mike Harris.0001   |   6292
   29 | yudai214           |   6278
   30 | yed                |   6267
   31 | Slyfox             |   6262
   32 | eliegel            |   6257
   33 | 2.71828.....       |   6223
   34 | SM.SM              |   6222
   35 | wicket             |   6213
   36 | TrickStaR          |   6197
   37 | Monsieur X         |   6189
   38 | shark_bait         |   6173
   39 | manzi              |   6172
   40 | DominionKing       |   6154
   41 | Eevee              |   6153
   42 | nnn                |   6152
   43 | Robz888            |   6151
   44 | fiu                |   6148
   45 | Tom Collett        |   6140
   46 | theParty           |   6123
   47 | minased            |   6117
   48 | GwinnR             |   6081
   49 | sami1              |   6080
   50 | Perry Green        |   6078
   51 | Zan                |   6077
   52 | kenyou2859         |   6072
   53 | heron              |   6068
   54 | dudeabides         |   6067
   55 | markusin           |   6067
   56 | faust              |   6049
   57 | Psyduck            |   6044
   58 | RTT                |   6043
   59 | David Hunter       |   6041
   60 | Lotoreo            |   6035
   61 | dawn_harbor        |   6034
   62 | A Drowned Kernel   |   6033
   63 | Jeebus             |   6030
   64 | Powerman           |   6014
   65 | Qvist              |   6013
   66 | sangatsu           |   5999
   67 | Watno              |   5997
   68 | Trojan Horse       |   5994
   69 | HampusEriksson     |   5986
   70 | Booyakasha         |   5985
   71 | WhiteRabbit1981    |   5982
   72 | Warrior            |   5980
   73 | Emeric             |   5972
   74 | D_dreamer          |   5969
   75 | 7MiKL7             |   5954
   76 | M1                 |   5947
   77 | andwilk            |   5946
   78 | microman           |   5943
   79 | Indur              |   5926
   80 | astrosity          |   5925
   81 | Titandrake         |   5921
   82 | todo_boss          |   5916
   83 | Kazuhiro Kobayashi |   5898
   84 | Александр Логинов  |   5883
   85 | Egor Kulikov       |   5866
   86 | Silverfinger       |   5862
   87 | Lekkit             |   5848
   88 | mullinKAI          |   5839
   89 | dscarpac           |   5836
   90 | moharimo           |   5820
   91 | kilgoretrout103    |   5802
   92 | zporiri            |   5801
   93 | Magicarp           |   5801
   94 | heatthespurs       |   5800
   95 | Dominionologist    |   5787
   96 | Vampyroteuthis     |   5780
   97 | Masschy            |   5773
   98 | hirotashi          |   5771
   99 | Johannes Dorn      |   5754
  100 | Polk5440           |   5748

TrueSkill* Top 100:

Code: [Select]

  rank  |                   pname            |    mu    |   sigma
--------|------------------------------------+----------+----------
     1  |  Stef                              |  60.5675 |   3.9775
     2  |  Boodaloo                          |  60.3245 |   4.3395
     3  |  nomnomnom                         |  61.6375 |   4.7775
     4  |  Rene Kuroi                        |  58.6050 |   4.0585
     5  |  Mic Qsenoch                       |  58.1975 |   3.9765
     6  |  SheCantSayNo                      |  57.2280 |   3.9360
     7  |  Stealth Tomato                    |  57.5755 |   4.0865
     8  |  Wandering Winder                  |  57.1345 |   3.9650
     9  |  Tao Chen                          |  61.9515 |   5.7080
    10  |  hiroki                            |  55.9995 |   3.9410
    11  |  Rabid                             |  56.0185 |   4.0165
    12  |  Geronimoo                         |  55.5715 |   4.0710
    13  |  HiveMindEmulator                  |  55.5580 |   4.3235
    14  |  Andrew Iannaccone                 |  54.2025 |   3.9435
    15  |  Obi Wan Bonogi                    |  54.0950 |   3.9965
    16  |  TrickStaR                         |  54.1255 |   4.0345
    17  |  jaybeez                           |  53.9285 |   3.9900
    18  |  LESPEUTERE                        |  53.9790 |   4.0120
    19  |  Fabian                            |  53.2665 |   4.0085
    20  |  jog                               |  53.0150 |   3.9470
    21  |  PitrPicko                         |  53.0575 |   3.9795
    22  |  awaclus                           |  53.0230 |   3.9855
    23  |  blueblimp                         |  53.1740 |   4.0875
    24  |  ednever                           |  53.0205 |   4.1355
    25  |  Qvist                             |  52.0220 |   3.9885
    26  |  Perry Green                       |  51.6780 |   4.0900
    27  |  manzi                             |  51.2640 |   3.9720
    28  |  heron                             |  51.4135 |   4.0245
    29  |  flyingkuyt                        |  51.0310 |   3.9370
    30  |  Jeebus                            |  51.1155 |   3.9995
    31  |  eliegel                           |  51.0905 |   4.0280
    32  |  jhovall_goko                      |  50.8355 |   3.9635
    33  |  iriho                             |  50.5480 |   3.9405
    34  |  Psyduck                           |  52.0175 |   4.4565
    35  |  sami1                             |  50.6900 |   4.0330
    36  |  Mike Harris.0001                  |  50.5940 |   4.0170
    37  |  Lekkit                            |  50.4200 |   3.9610
    38  |  shark_bait                        |  50.5130 |   4.0145
    39  |  Slyfox                            |  50.2800 |   3.9390
    40  |  wicket                            |  50.2745 |   3.9620
    41  |  Masschy                           |  50.3730 |   4.0520
    42  |  zporiri                           |  50.8940 |   4.2580
    43  |  yudai214                          |  49.7325 |   3.9385
    44  |  SM.SM                             |  49.7215 |   3.9380
    45  |  Warrior                           |  49.7705 |   4.0085
    46  |  Monsieur X                        |  49.7705 |   4.0085
    47  |  yed                               |  49.4895 |   3.9150
    48  |  AQUAREAF                          |  49.5920 |   3.9750
    49  |  faust                             |  49.6090 |   3.9980
    50  |  kenyou2859                        |  49.8205 |   4.0760
    51  |  First                             |  55.3440 |   5.9195
    52  |  theParty                          |  49.4880 |   3.9730
    53  |  Robz888                           |  49.2785 |   3.9455
    54  |  GwinnR                            |  49.3625 |   4.0500
    55  |  Kevin O'Brien                     |  51.3655 |   4.7700
    56  |  Eevee                             |  48.9310 |   3.9615
    57  |  Holger                            |  55.6320 |   6.2050
    58  |  wsc                               |  51.4760 |   4.8355
    59  |  gagnerouperdretelleestlaquestion? |  49.9420 |   4.3300
    60  |  Powerman                          |  49.0620 |   4.0590
    61  |  dudeabides                        |  48.8435 |   3.9910
    62  |  florrat                           |  52.1135 |   5.0945
    63  |  David Hunter                      |  48.5990 |   3.9965
    64  |  Schlippy                          |  49.4220 |   4.3225
    65  |  nnn                               |  48.5010 |   4.0255
    66  |  markusin                          |  49.0665 |   4.2170
    67  |  andwilk                           |  51.4850 |   5.0315
    68  |  sangatsu                          |  48.3820 |   4.0030
    69  |  Lotoreo                           |  48.0110 |   3.9415
    70  |  pâté de campagne                  |  48.2605 |   4.0425
    71  |  2.71828.....                      |  48.0130 |   3.9660
    72  |  daniel greif                      |  71.1600 |  11.6865
    73  |  Bulec                             |  48.9830 |   4.2995
    74  |  RTT                               |  47.9095 |   3.9550
    75  |  scott pilgrim                     |  48.3400 |   4.1020
    76  |  yuuna_tu                          |  54.9045 |   6.4195
    77  |  qmech                             |  51.0900 |   5.1830
    78  |  kn1tt3r                           |  47.4915 |   3.9880
    79  |  faw                               |  53.6745 |   6.0830
    80  |  Pneumatiker                       |  49.8545 |   4.8275
    81  |  minased                           |  47.2105 |   3.9560
    82  |  Titandrake                        |  47.2465 |   3.9700
    83  |  A Drowned Kernel                  |  47.1445 |   3.9425
    84  |  p4ddy0d00rs                       |  47.1735 |   3.9640
    85  |  Teenage Raistlin                  |  67.2485 |  10.7155
    86  |  Tom Collett                       |  46.8370 |   3.9375
    87  |  dawn_harbor                       |  47.1200 |   4.0330
    88  |  Cruxis                            |  47.2785 |   4.0865
    89  |  fiu                               |  48.6615 |   4.5505
    90  |  Troninho                          |  46.8575 |   3.9520
    91  |  dnkywin                           |  47.3565 |   4.1335
    92  |  jsh357                            |  47.2225 |   4.1070
    93  |  WhiteRabbit1981                   |  46.6635 |   3.9475
    94  |  houroku                           |  46.7355 |   3.9800
    95  |  Hao Chu                           |  47.0940 |   4.1345
    96  |  Watno                             |  46.3765 |   3.9390
    97  |  Pex Golder                        |  47.8285 |   4.4575
    98  |  CopperCopper                      |  47.1805 |   4.2640
    99  |  Zan                               |  46.1630 |   3.9355
   100  |  DominionKing                      |  46.1270 |   3.9265

*TrueSkill implementation details: (anything insane here?)

Using the package at https://pypi.python.org/pypi/trueskill
Including only 2-player "Pro" games, excluding games with guests
~~initial mu=75, sigma=25 (calibrated to mimic the isotropic leaderboard)~~
draw rate=0.0175 (the empirical average)
~~ranked by sigma - 3*mu~~
no rating degradation over time

Edit: Sorry guys, I posted this after a night of insomnia. Some corrections and clarifications:

Ranked by mu - 3*sigma. Thanks for noting this, HME. WW, is mu - sigma a more accepted metric? The Microsoft research page here said that k=3 was "common."

The initial rating was mu=25, sigma=8.33. I then scaled both up around the mean to give numbers that I could more easily compare to the Iso leaderboard. For each player, the numbers I listed are actually mu' = 25 + 5*(mu-25) and sigma' = 5*sigma. I don't know what made me think this was a good idea or why I thought it was equivalent to mu=75, sigma=25. I'm undoing this.

By "no degradation over time" i meant that I wasn't doing Goko's accelerating increase uncertainty every day thing. I'm sure that some degradation is appropriate, but I suspect that Goko's method is too rapid. I also really doubt that the change should accelerate over time, and it should certainly stop asymptotically.

TrojH · « **Reply #1 on:** July 17, 2013, 09:07:38 am »

I knew my presence in the top 100 was too good to be true. Thanks for the reality check.

jsh357 · « **Reply #2 on:** July 17, 2013, 09:49:39 am »

The TrueSkill leaderboard is clearly better. CLEARLY. They need to change it to this immediately.

RTT · « **Reply #3 on:** July 17, 2013, 10:07:06 am »

Quote from: jsh357 on July 17, 2013, 09:49:39 am

The TrueSkill leaderboard is clearly better. CLEARLY. They need to change it to this immediately.

but is it "strictly" better ?

Mic Qsenoch · « **Reply #4 on:** July 17, 2013, 10:43:25 am »

Quote from: ragingduckd on July 17, 2013, 08:54:38 am

Aside from a few outliers like Boodaloo, the boards look pretty similar to me. Are there other aberrations I'm missing?

Boodaloo is an alt of mine. I stopped using it so it fell from the Goko leaderboard.

zporiri · « **Reply #5 on:** July 17, 2013, 11:03:30 am »

im 92 on the goko leaderboard, but 42 on the true skill. that's a pretty big gap, no? (i did however lose a lot of games the past couple of days, so that could explain the big drop ive had on the goko leaderboard recently)

HiveMindEmulator · « **Reply #6 on:** July 17, 2013, 11:26:18 am »

Yeah, looks pretty similar. For all the complaints about and issues with the Goko ranking system, it seems to pass the smell test.

Quote from: ragingduckd on July 17, 2013, 08:54:38 am

ranked by sigma - 3*mu

I assume you mean mu-3*sigma.

Awaclus · « **Reply #7 on:** July 17, 2013, 11:33:51 am »

Quote from: zporiri on July 17, 2013, 11:03:30 am

im 92 on the goko leaderboard, but 42 on the true skill. that's a pretty big gap, no? (i did however lose a lot of games the past couple of days, so that could explain the big drop ive had on the goko leaderboard recently)

I don't think anyone has been complaining about the accuracy of the Goko leaderboard (and if someone has, I've missed it), it's the swinginess which makes people complain. And it is very swingy indeed, it's not uncommon to go 30-40 ranks up or down on the leaderboard in a couple of hours.

EDIT: Though I assume it's more uncommon if you've played a lot of games. I have less than 500 games I think.

Grujah · « **Reply #8 on:** July 17, 2013, 11:37:10 am »

Thing with Iso is that it updated once per day, so swingness was harder to catch.

SCSN · « **Reply #9 on:** July 17, 2013, 11:45:07 am »

Great work!

Quote

initial mu=75, sigma=25 (calibrated to mimic the isotropic leaderboard)

This should be 25.

I'm curious as to why the sigma's are way lower than iso's uncertainties.

Ratsia · « **Reply #10 on:** July 17, 2013, 12:48:32 pm »

Quote from: SheCantSayNo on July 17, 2013, 11:45:07 am

I'm curious as to why the sigma's are way lower than iso's uncertainties.

"no rating degradation over time" should explain a lot of it. Iso increased the uncertainty every day, to prevent the inherent trend of TrueSkill sigmas to converge towards zero (under certain conditions, at least).

WanderingWinder · « **Reply #11 on:** July 17, 2013, 12:52:45 pm »

Quote from: Ratsia on July 17, 2013, 12:48:32 pm

Quote from: SheCantSayNo on July 17, 2013, 11:45:07 am
I'm curious as to why the sigma's are way lower than iso's uncertainties.
"no rating degradation over time" should explain a lot of it. Iso increased the uncertainty every day, to prevent the inherent trend of TrueSkill sigmas to converge towards zero (under certain conditions, at least).

Because iso's uncertainties were 3*sigma rather than just sigma.

WanderingWinder · « **Reply #12 on:** July 17, 2013, 12:54:11 pm »

By 'no rating degradation over time', do you mean no time-based adjustments to mu, or to anything?

WanderingWinder · « **Reply #13 on:** July 17, 2013, 01:44:18 pm »

Looking back over the TrueSkill documentation, I'm still vexed to try to figure out what their Beta factor actually does. As far as I can tell, they're more or less modelling the ratings as normal distributions. And the Tau factor has to do with their updating scheme. But Beta... well, apparently Beta is supposed to be related to how much skill it takes for player A to have an expected win rate of X over player B (with different sources giving X as 80% and 75.6%; I assume 80% is just some chosen round number, but I have no idea how they get 75.6). Like, I think it's just a scale factor? If that's the case, then (updating fanciness aside), we really do just have normal distributions. And if we have normal distributions, we can do some math on that, that's not too bad (the other issue I have is that it's vexingly hard to find their actual math - there are lots of handwaving explanations but relatively few equations; when I go and try to look at their papers on the subject, as best I can tell, beta is the actual standard deviation variable, but this doesn't seem to make sense - mostly, I don't understand why there is both Beta and Sigma or how they are different.

Anyway, if this is Normal, then we can look at some things. For instance, let's look at Stef and Mic Qsenoch. Obviously Stef is higher-rated here. Fine. How often would you expect him to win here?

So here's how the system calculates it. We have the difference in ratings, which is 2.37. Then we need to calculate the standard deviation; a little background knowledge of statistics tells us that the difference of two normal distributions is a normal distribution with mean equal to the difference of means and variance equal to the sum of variances, so sigma_total = (sigma_1^2+sigma_2^2)^(1/2) or in this case, about 5.624. This puts Stef at... .4214 standard deviations above Mic Qsenoch, for an expected winrate of... .6633. Now, I don't know about you, but though Stef is really good, I don't think it's terribly near accurate to say he should be winning that matchup 2/3 of the time....

Edit: Okay, found another paper which makes it look like the Beta factor is some kind of strange shape parameter, the minimum variance such that if both players magically have sigmas of zero, the overall variance is still Beta squared, as the actual variance being used is sigma_1^2 + sigma_2^2+ Beta^2. This is somewhat interesting at least. Assuming that the default 4.166666666666666666667 was used in this implementation, then this affects our spoilered calculations somewhat. The standard deviation would now actually be 6.99958, meaning Stef would have only a .3386 standard deviation advantage, which would get him to an expected winrate of only .6325, which nevertheless still seems high to me

HiveMindEmulator · « **Reply #14 on:** July 17, 2013, 02:30:00 pm »

^The difference between beta and sigma is that sigma is uncertainty in the estimate of player level and beta is uncertainty in the result of the game given knowledge of the player level. So the probability of Stef beating Mic is the probability that a normal random variable with mean 0 and variance beta is less than their difference in skill, or that N(m1,s1)-N(m2,s2)-N(0,b)>0 = P(N(m1-m2,s1+s2+b)>0, as you have calculated.

The value of beta is based on the randomness of the game, and will be smaller for games whose outcome is determined more by skill than luck. It's hard to derive a good value for beta though, and in reality it may be different for different levels of play. Trueskill cannot account for this.

ragingduckd · « **Reply #15 on:** July 17, 2013, 03:24:27 pm »

I made some corrections to my original post. Sorry for any confusion.

I don't fully understand the details of TrueSkill or Heungsub Lee's Python implementation, nor do I really plan to. The package is open-source, though, and I'm happy to implement variations.

It would be particularly nice to have a means of calibrating parameters and comparing predictive accuracy, if anyone is willing to contribute that code. I believe WW has described how to do this somewhere.

Stealth Tomato · « **Reply #16 on:** July 17, 2013, 03:41:54 pm »

Look at that tremendously sexy #7 on both leaderboards. Look at it real close before I finally quit this game I swear to god I'll manage to pull myself away eventually ooh I bet I can create a Plaza engine on this board.

Mic Qsenoch · « **Reply #17 on:** July 17, 2013, 10:02:50 pm »

Quote from: WanderingWinder on July 17, 2013, 01:44:18 pm

Looking back over the TrueSkill documentation, I'm still vexed to try to figure out what their Beta factor actually does. As far as I can tell, they're more or less modelling the ratings as normal distributions. And the Tau factor has to do with their updating scheme. But Beta... well, apparently Beta is supposed to be related to how much skill it takes for player A to have an expected win rate of X over player B (with different sources giving X as 80% and 75.6%; I assume 80% is just some chosen round number, but I have no idea how they get 75.6). Like, I think it's just a scale factor? If that's the case, then (updating fanciness aside), we really do just have normal distributions. And if we have normal distributions, we can do some math on that, that's not too bad (the other issue I have is that it's vexingly hard to find their actual math - there are lots of handwaving explanations but relatively few equations; when I go and try to look at their papers on the subject, as best I can tell, beta is the actual standard deviation variable, but this doesn't seem to make sense - mostly, I don't understand why there is both Beta and Sigma or how they are different.

Anyway, if this is Normal, then we can look at some things. For instance, let's look at Stef and Mic Qsenoch. Obviously Stef is higher-rated here. Fine. How often would you expect him to win here?

So here's how the system calculates it. We have the difference in ratings, which is 2.37. Then we need to calculate the standard deviation; a little background knowledge of statistics tells us that the difference of two normal distributions is a normal distribution with mean equal to the difference of means and variance equal to the sum of variances, so sigma_total = (sigma_1^2+sigma_2^2)^(1/2) or in this case, about 5.624. This puts Stef at... .4214 standard deviations above Mic Qsenoch, for an expected winrate of... .6633. Now, I don't know about you, but though Stef is really good, I don't think it's terribly near accurate to say he should be winning that matchup 2/3 of the time....

Edit: Okay, found another paper which makes it look like the Beta factor is some kind of strange shape parameter, the minimum variance such that if both players magically have sigmas of zero, the overall variance is still Beta squared, as the actual variance being used is sigma_1^2 + sigma_2^2+ Beta^2. This is somewhat interesting at least. Assuming that the default 4.166666666666666666667 was used in this implementation, then this affects our spoilered calculations somewhat. The standard deviation would now actually be 6.99958, meaning Stef would have only a .3386 standard deviation advantage, which would get him to an expected winrate of only .6325, which nevertheless still seems high to me

The funny thing about you picking me and Stef to say that Trueskill overestimates the likelihood of his winning is that based on your numbers, Trueskill is doing a fantastic job. Stef's record against me is 49-25-1 for a win rate of 0.65333. He is my bane.

WanderingWinder · « **Reply #18 on:** July 17, 2013, 10:39:59 pm »

Quote from: ragingduckd on July 17, 2013, 03:24:27 pm

I made some corrections to my original post. Sorry for any confusion.

As to your question in these revisions, mu-sigma isn't accepted, nor is mu-3*sigma. Ok, I have seen mu-3*sigma a decent bit, but it is 'common' only because Microsoft has pushed it out there to be standard for their TrueSkill, which is, in my estimation, mostly a way of trying to push people to play more, so as to get higher profits.

IF you believe that these are reasonable estimates as to the actual skill of the participants (something which seems quite suspect to me, actually), then mu-3*sigma gives a 99.865% chance that the player's skill is at least at the level. But 2sigma would give a 97.7% chance, 1sigma gives a 84.1% chance. But the more important thing is that these are one-sided - you could just as easily add the sigmas and have very good chances of being beneath - really, I don't see any reason to not just go based on straight-up mu, which is the central number and 'best guess' of the system, if you want a number for rating.

Quote

I don't fully understand the details of TrueSkill or Heungsub Lee's Python implementation, nor do I really plan to. The package is open-source, though, and I'm happy to implement variations.

It would be particularly nice to have a means of calibrating parameters and comparing predictive accuracy, if anyone is willing to contribute that code. I believe WW has described how to do this somewhere.

I eventually dug around to a paper which gives, well, not a perfect explanation of the system, but one to where I have a good feel now for the distribution they're using, and I figure I could probably get a pretty good idea of how their updating works if I card to. If there are serious questions, probably someone here can generally answer them.

As for the parameters, I am looking into what curve is going to be best for this, but it is fairly deep on my priority list at the moment, and moreover I am trying to write the program in a very general sense, such that I can use it for many different endeavors (and not just Dominion). For sure I will give an update when I have one, but I suspect this will be months...

One thing to note is that no matter what they are doing with the Beta factor, you still end up with normal distributions, and well, I have my doubts about the normal fitting well here. Eh, maybe it does. But I would at least try higher (relative to the mu and sigma you are using) values of beta. Basically what this does, as described above, is lessen the impact of any particular rating difference.

Oh, and for more evidence that this system is REALLY wrong: Stef vs Mic Q is bad enough (sure, Stef has Mic Q's number so far, so that sort of matches, but I seriously must believe that this is basically luck), but if we take it down to the number 100 guy on the list, we see... Stef favored to win just over 98%(!) of the time!!! I mean, folks, he is good, but he isn't *that* good.

ragingduckd · « **Reply #19 on:** July 17, 2013, 10:56:02 pm »

Quote from: WanderingWinder on July 17, 2013, 10:39:59 pm

IF you believe that these are reasonable estimates as to the actual skill of the participants (something which seems quite suspect to me, actually), then mu-3*sigma gives a 99.865% chance that the player's skill is at least at the level. But 2sigma would give a 97.7% chance, 1sigma gives a 84.1% chance. But the more important thing is that these are one-sided - you could just as easily add the sigmas and have very good chances of being beneath - really, I don't see any reason to not just go based on straight-up mu, which is the central number and 'best guess' of the system, if you want a number for rating.

When I sort by just mu, I get a leaderboard that consists largely of unknowns, people who have played a couple dozen and won against strong players. Maybe that's correct in the sense that they really are the most likely players to win in any given match, but it's not really the leaderboard I want to see. It's just too noisy.

Quote

I eventually dug around to a paper which gives, well, not a perfect explanation of the system, but one to where I have a good feel now for the distribution they're using, and I figure I could probably get a pretty good idea of how their updating works if I card to. If there are serious questions, probably someone here can generally answer them.

Cool. Can you link to it?

Quote

Oh, and for more evidence that this system is REALLY wrong: Stef vs Mic Q is bad enough (sure, Stef has Mic Q's number so far, so that sort of matches, but I seriously must believe that this is basically luck), but if we take it down to the number 100 guy on the list, we see... Stef favored to win just over 98%(!) of the time!!! I mean, folks, he is good, but he isn't *that* good.

This is compelling, but have you adjusted for my screwup? You need to "unscale" the mu/sigma values I gave above if you're going to use the package's default beta and tau for this calculation. Adjust to mu = (mu'-25)/5 and sigma = sigma/5.

rrenaud · « **Reply #20 on:** July 17, 2013, 11:31:49 pm »

Do you have historical logs of the goko ratings? You could compute how often TrueSkill and Goko are right at predicting the winner of games.

ragingduckd · « **Reply #21 on:** July 17, 2013, 11:37:12 pm »

Quote from: rrenaud on July 17, 2013, 11:31:49 pm

Do you have historical logs of the goko ratings? You could compute how often TrueSkill and Goko are right at predicting the winner of games.

I'd need a way to map from Goko ratings to win probabilities.

michaeljb · « **Reply #22 on:** July 18, 2013, 01:12:42 am »

Quote from: ragingduckd on July 17, 2013, 11:37:12 pm

Quote from: rrenaud on July 17, 2013, 11:31:49 pm
Do you have historical logs of the goko ratings? You could compute how often TrueSkill and Goko are right at predicting the winner of games.

I'd need a way to map from Goko ratings to win probabilities.

And I think that by now we all know what the issue with doing that is...

rrenaud · « **Reply #23 on:** July 18, 2013, 02:35:40 am »

Trying to get a probabalistc model out of goko ratings isn't worth the trouble. Just count matches where TrueSkill ranks A > B, and Goko ranks B > A, and then see who actually wins.

WanderingWinder · « **Reply #24 on:** July 18, 2013, 07:43:32 am »

Quote from: ragingduckd on July 17, 2013, 10:56:02 pm

Quote from: WanderingWinder on July 17, 2013, 10:39:59 pm
IF you believe that these are reasonable estimates as to the actual skill of the participants (something which seems quite suspect to me, actually), then mu-3*sigma gives a 99.865% chance that the player's skill is at least at the level. But 2sigma would give a 97.7% chance, 1sigma gives a 84.1% chance. But the more important thing is that these are one-sided - you could just as easily add the sigmas and have very good chances of being beneath - really, I don't see any reason to not just go based on straight-up mu, which is the central number and 'best guess' of the system, if you want a number for rating.

When I sort by just mu, I get a leaderboard that consists largely of unknowns, people who have played a couple dozen and won against strong players. Maybe that's correct in the sense that they really are the most likely players to win in any given match, but it's not really the leaderboard I want to see. It's just too noisy.

Ah, yes, this problem. It's not the most elegant thing ever, but I would suggest throwing out as too unpredictable or 'provisional' or something, everyone over some certain threshold. Glancing at your leaderboard as is, I would suggest perhaps sigma = 5 or sigma = 7.5 as a cutoff.

Quote

Quote
I eventually dug around to a paper which gives, well, not a perfect explanation of the system, but one to where I have a good feel now for the distribution they're using, and I figure I could probably get a pretty good idea of how their updating works if I card to. If there are serious questions, probably someone here can generally answer them.

Cool. Can you link to it?

http://research.microsoft.com/pubs/74417/NIPS2007_0931.pdf
I'm not sure how many of you will find it useful, as I did, and to how many it will be gobbledy gook.

Quote

Quote
Oh, and for more evidence that this system is REALLY wrong: Stef vs Mic Q is bad enough (sure, Stef has Mic Q's number so far, so that sort of matches, but I seriously must believe that this is basically luck), but if we take it down to the number 100 guy on the list, we see... Stef favored to win just over 98%(!) of the time!!! I mean, folks, he is good, but he isn't *that* good.

This is compelling, but have you adjusted for my screwup? You need to "unscale" the mu/sigma values I gave above if you're going to use the package's default beta and tau for this calculation. Adjust to mu = (mu'-25)/5 and sigma = sigma/5.

I recalculated this, but the very important thing is, when the system is actually doing the calculations, as you have it implemented, which parameters was it using? Was this 'screw-up' just a change you made in displaying things at the end, or was it something that was in there for any of the calculations. If it was in there for (any of) the calculations, then recalculating is in a way pointless - we want to test what it actually says. But if it's just a thing you did at the end, then... first of all, this would effectively make the change of enlarging beta, as I had suggested. And we get this: Stef over Mic Q, expectation is 54.37%. Stef over #100 guy, expectation is 74.84%. These are at least both plausible enough that I would actually want to look at data before really proclaiming they're wrong, but I do have my suspicions that just any normal curve is going to have the problem of too steep 'shoulders'.

Dominion Strategy Forum

News:

Author Topic: Goko vs TrueSkill (Read 11775 times)