Topic: In judgment of “Judgment Matches” (Read 3241 times)

JW · « **on:** July 10, 2018, 10:37:29 am »

Judgment matches have been pioneered by SamE. See his thread and judgment match rules

One conceptual flaw with judgment matches is that they are not necessarily transitive. Minion beat Groundskeeper in judgment match #1 (that was a relatively small sample, but let’s suppose that it wasn’t). Minion might lose a judgment match to Soothsayer, because Soothsayer as a junker is well-positioned vs. Minion. If that happened, would it mean that Groundskeeper is going to lose a judgment match to Soothsayer? Not necessarily! The more recent judgment matches seem to recognize this issue, and are comparing cards with more similar functions (Advisor v Alchemist, for example), which may help to mitigate this.

There’s also an issue with the way judgment matches are counted. If players agree that a card is favored, the board is counted as a win for that card. If they don’t agree, the board is played out and the winner of the game gets the point. But even if players can accurately assign a board as favoring one card as opposed to another in the judgment match, that doesn’t mean anything about the magnitude of the amount one card would be favored. Assigning a full point to the agreed-on favored card is equivalent to assuming that if the game were played out, there’s no chance that the less favored option would win.

Consider a hypothetical “Silly Smithy”: it’s like regular Smithy, but when you play it you roll a 100-sided die. If it comes up 100, +$1. This card is nearly always going to be favored over Smithy (one exception I can think of: if Smithy is the Obelisk pile, which probably shouldn't be allowed in a judgment match at all). So a judgment match will give it a nearly perfect win over Smithy. However, its advantage in actual play is very minor. If you played all of the games out, “Silly Smithy” would have only a very small advantage vs. for Smithy. A ranking system should rate Smithy and “Silly Smithy” about the same. And if you played "Silly Smithy" in a judgment match against any non-Smithy card, it would achieve essentially the same result as Smithy.

This leads to my constructive proposal for how to measure card strength, taken from this thread: If two very good (and similarly good) players play a match where they are and aren't allowed to gain a card, respectively, how big of an advantage will the player who can gain the card have.

Ideally the players play an infinite number of games (varying who is and isn't allowed to gain the card, as well as starting player), play equally well on average regardless of whether they're allowed to gain the card, and that the player who can gain the card doesn't rely on the fact that their opponent isn't allowed to gain the card. An actual cage match will be an approximation to that ideal.

This measure also more closely approximates real games of Dominion than a judgment match. In a judgment match, when the best strategy would have both players gaining both Minion and Groundskeeper, you’ve required two changes for a judgment match: one player doesn’t gain Minion, and the other player doesn’t gain Groundskeeper. In comparison, this measure of card strength only requires one player not to gain one card, as if one player (perhaps inexplicably) decided to ignore that card.

Lastly, even this measure of card strength will not tell you how to be a better Dominion player. To do that, you need to understand why a card is strong or weak, on what types of boards you should or shouldn’t gain it, and how to play a strategy involving that card. Putting cards in order of strength is fun, but it’s never going to tell you that.

markus · « **Reply #1 on:** July 10, 2018, 11:12:50 am »

Quote from: JW on July 10, 2018, 10:37:29 am

This leads to my constructive proposal for how to measure card strength, taken from this thread: If two very good (and similarly good) players play a match where they are and aren't allowed to gain a card, respectively, how big of an advantage will the player who can gain the card have.

I think the problem is simply the sample size. It's obvious that having access to the card is better than not, so it's all about how much better it is. If you want to pin that down to an interval of length 0.1 (e.g. 55%-65% win rate), you need about 400 games. And even that would not be very precise, so you probably want more than 1,000 games.

samath · « **Reply #2 on:** July 10, 2018, 01:35:51 pm »

Hey JW! Thanks for watching and offering this criticism! I think your link in my video thread is incorrect, but I appreciate the thoughtfulness of not cluttering that thread.

Let me take your criticisms in order:

1) Transitivity. I agree that they aren't necessarily transitive, but such triples are interesting as well! I love it when a Dominion board presents a Rock-Paper-Scissors choice, because you have to remain flexible to be able to react to your opponent. (For a real-game example, see Game #4 of my Interesting Games of the Week #2. It was a single terminal Sentry/Groundskeeper board, and my opponent's Nomad Camp was beating my Mountebank until I switched it out for a pair of Swamp Hags.) I expect most of these intransitive triples would take the form of Attack versus Reaction versus Engine component, where the Attack would beat the Engine component, but the normally weak Reaction would be especially strong in the presence of the Attack. These and other sorts of specific interactions mean that I'm just not going to bother testing pairs where the direct interaction is significant, which probably includes most Attack-Reaction pairs.

So yeah, we're not going to be able to get out a total ordering of all cards this way. That's okay, though -- there are already plenty of other edge cases that this format can't handle, like Embargo, Smugglers and so on. But there's still plenty we can compare when the cards don't directly interact.

2) Counting. I think you misunderstand the process we go through at the beginning of each game. Basically, we only skip the board if we think that one of the cards is so clearly better that it'll win close to 100% of the time. This actually happens fairly frequently. If it's a narrow margin, we go ahead and play it out. Silly Smithy versus Smithy would always be a narrow margin, so we'd always play it out and eventually it'd converge to 50.1% to 49.9% or whatever (ignoring the Neither games). Of course, that would be really boring so I wouldn't do that, but that's how we treat boards where one card has a narrow margin, even if we're fairly certain which card has that narrow margin.

3) Playing cards against not being able to gain them. This is essentially what we do whenever one of the players in the Judgment Match wants to champion Neither. Basically, your proposal seems to be to just play every card against Neither (or "Not", I guess) and compare their winning percentages. It's a little different since we do require that a player championing a card intends to actually gain the card; if they don't, we would be recording our Neither wins as ties between the card and Neither/Not.

Of course, the problem is that those games are fairly boring, to play and to watch, and would be even more so if we didn't pick out the games that one of us thought Neither would have a good chance. It's a lot more fun to play and to watch closer games where the outcome is more uncertain, rather than trying to discern whether the player with access to the card has an 80% or 90% win rate.

4) Sample size. markus's numbers are a bit high, though admittedly it depends on what he means by "pinning down" a probability. The way I interpret that, it's saying that the standard deviation of your estimate is 5%, so you'd report it as 60% +/- 5%. But the variance of the binomial distribution B(n, p) is np(1-p), so the variance of the average is p(1-p)/n. Since p(1-p) is at most 1/4, we can upper bound this variance by 1/4n. If we want that to be less than 1/400, which would correspond to a standard deviation of 5%, we just need 100 games, not 400.

Moreover, if we're fine with a standard deviation of 10% (e.g. a range like 60-80%), then we need just 25 games by the same logic. That's around two judgment matches with the way we've been playing them. In other words, the standard deviation of our outcomes should be around 15%-20% for the closer ones (e.g. Page versus Peasant), and lower the more lopsided the match was (since p(1-p) is lower when p is closer to 0 or 1). So, for instance, our estimate of Minion's winning percentage over Groundskeeper would be something like 67% +/- 20%, but that of Menagerie's winning percentage over Wishing Well would be something like 90% +/- 10%.

markus · « **Reply #3 on:** July 10, 2018, 05:32:38 pm »

My "pinning down" is +/-2 standard deviations.

JW · « **Reply #4 on:** July 10, 2018, 05:59:09 pm »

Quote from: samath on July 10, 2018, 01:35:51 pm

2) Counting. I think you misunderstand the process we go through at the beginning of each game. Basically, we only skip the board if we think that one of the cards is so clearly better that it'll win close to 100% of the time. This actually happens fairly frequently. If it's a narrow margin, we go ahead and play it out. Silly Smithy versus Smithy would always be a narrow margin, so we'd always play it out and eventually it'd converge to 50.1% to 49.9% or whatever (ignoring the Neither games). Of course, that would be really boring so I wouldn't do that, but that's how we treat boards where one card has a narrow margin, even if we're fairly certain which card has that narrow margin.

Apologies for the confusion. In that case, I think you are skipping way too many boards! It's a rare board that you skip where I'd guess that the favored card would win >=90 percent of games. Though it would be less exciting playing out games where everyone agrees that the favored card is substantially favored, as you mention in the context of "card" vs. "neither" matches.

Quote

3) Playing cards against not being able to gain them. This is essentially what we do whenever one of the players in the Judgment Match wants to champion Neither. Basically, your proposal seems to be to just play every card against Neither (or "Not", I guess) and compare their winning percentages. It's a little different since we do require that a player championing a card intends to actually gain the card; if they don't, we would be recording our Neither wins as ties between the card and Neither/Not.

Of course, the problem is that those games are fairly boring, to play and to watch, and would be even more so if we didn't pick out the games that one of us thought Neither would have a good chance. It's a lot more fun to play and to watch closer games where the outcome is more uncertain, rather than trying to discern whether the player with access to the card has an 80% or 90% win rate.

As you mention, in testing a card against "nothing" you could declare a game a tie if neither player would want to gain that card even if given the opportunity (e.g., the board is clearly Beggar-Gardens, so no one will get Smithy). That would save some time. You could also declare some boards to be wins for a card against "nothing" if that card is very heavily favored, in the same way as in judgment matches (and with the same potential issues).

samath · « **Reply #5 on:** July 14, 2018, 01:39:58 am »

Quote from: JW on July 10, 2018, 05:59:09 pm

Apologies for the confusion. In that case, I think you are skipping way too many boards! It's a rare board that you skip where I'd guess that the favored card would win >=90 percent of games.

I disagree! Do you have any specific boards you think we shouldn't have skipped? I'd be happy to reply the games (with you or someone else) if you think they'd actually be competitive.

faust · « **Reply #6 on:** July 14, 2018, 02:54:43 am »

Quote from: samath on July 14, 2018, 01:39:58 am

Quote from: JW on July 10, 2018, 05:59:09 pm
Apologies for the confusion. In that case, I think you are skipping way too many boards! It's a rare board that you skip where I'd guess that the favored card would win >=90 percent of games.
I disagree! Do you have any specific boards you think we shouldn't have skipped? I'd be happy to reply the games (with you or someone else) if you think they'd actually be competitive.

I only started watching Page vs Peasant and imediately the first one I think has nowhere near >= 90%win chance for Page. Mostly it comes down to who gets to play multiple Bridges first, and Page and Peasant have the same speed (Peasant actually being somewhat faster due to not clogging your deck). I mean the Warrior attack is problematic for Peasant, but it takes a while before you can play multiple.

samath · « **Reply #7 on:** July 14, 2018, 04:12:16 am »

Quote from: faust on July 14, 2018, 02:54:43 am

I only started watching Page vs Peasant and imediately the first one I think has nowhere near >= 90%win chance for Page. Mostly it comes down to who gets to play multiple Bridges first, and Page and Peasant have the same speed (Peasant actually being somewhat faster due to not clogging your deck. I mean the Warrior attack is problematic for Peasant, but it takes a while before you can play multiple.

Peasant has no chance that board. Warrior can trash Provinces, which basically means it can overcome any deficit, and it's fairly easy to get through the deck with a Spice Merchant or two early and piling up on Courtyards. With no natural village on the board, the best Peasant is hoping for is +Action on Courtyard, which is basically what Page gets and more.

That said, if anyone is not convinced and wants to play it out five times with me, I'd be happy to try it out. Just message me on Discord.

faust · « **Reply #8 on:** July 14, 2018, 05:05:55 am »

Quote from: samath on July 14, 2018, 04:12:16 am

Quote from: faust on July 14, 2018, 02:54:43 am
I only started watching Page vs Peasant and imediately the first one I think has nowhere near >= 90%win chance for Page. Mostly it comes down to who gets to play multiple Bridges first, and Page and Peasant have the same speed (Peasant actually being somewhat faster due to not clogging your deck. I mean the Warrior attack is problematic for Peasant, but it takes a while before you can play multiple.
Peasant has no chance that board. Warrior can trash Provinces, which basically means it can overcome any deficit, and it's fairly easy to get through the deck with a Spice Merchant or two early and piling up on Courtyards. With no natural village on the board, the best Peasant is hoping for is +Action on Courtyard, which is basically what Page gets and more.

That said, if anyone is not convinced and wants to play it out five times with me, I'd be happy to try it out. Just message me on Discord.

Obviously playing 5 times is not near enough to determine whether a card has a >= 90% win rate.

EDIT: Also, by the time Warriors would be able to trash Provinces, the game is likely over.

samath · « **Reply #9 on:** July 14, 2018, 12:41:05 pm »

Quote from: faust on July 14, 2018, 05:05:55 am

Obviously playing 5 times is not near enough to determine whether a card has a >= 90% win rate.

Well playing five times is certainly better than playing zero. Again, I’m happy to put my money where my mouth is if anyone actually thinks Peasant has a chance on that board. More than just the random trials, actually playing the games gives us a feel for the winning percentage — you can tell if you’re just a turn behind or if the other strategy is just far better. I think it’s plenty sufficient for this estimation task.

faust · « **Reply #10 on:** July 14, 2018, 01:24:09 pm »

Quote from: samath on July 14, 2018, 12:41:05 pm

Quote from: faust on July 14, 2018, 05:05:55 am
Obviously playing 5 times is not near enough to determine whether a card has a >= 90% win rate.
Well playing five times is certainly better than playing zero. Again, I’m happy to put my money where my mouth is if anyone actually thinks Peasant has a chance on that board. More than just the random trials, actually playing the games gives us a feel for the winning percentage — you can tell if you’re just a turn behind or if the other strategy is just far better. I think it’s plenty sufficient for this estimation task.

Well I could play, but I will say in advance that Peasant losing each time would not convince me that I am wrong.

samath · « **Reply #11 on:** July 14, 2018, 01:41:12 pm »

Quote from: faust on July 14, 2018, 01:24:09 pm

Well I could play, but I will say in advance that Peasant losing each time would not convince me that I am wrong.

Of course, Peasant would have to not just lose but lose convincingly. If you or anyone else wants to play the games, just message me (ideally in Discord, but forum messaging also works) and we can find a good time to do so.

JW · « **Reply #12 on:** July 14, 2018, 02:12:15 pm »

Quote from: samath on July 14, 2018, 01:39:58 am

Quote from: JW on July 10, 2018, 05:59:09 pm
Apologies for the confusion. In that case, I think you are skipping way too many boards! It's a rare board that you skip where I'd guess that the favored card would win >=90 percent of games.
I disagree! Do you have any specific boards you think we shouldn't have skipped? I'd be happy to reply the games (with you or someone else) if you think they'd actually be competitive.

Game 1 and game 8 of the Butcher-Altar match come to mind.

samath · « **Reply #13 on:** July 14, 2018, 08:08:23 pm »

Quote from: JW on July 14, 2018, 02:12:15 pm

Game 1 and game 8 of the Butcher-Altar match come to mind.

I will defend Butcher winning 90%+ for either of those games. Both had a great Gold gainer in Leprechaun, and Butcher's ability to turn Golds into Provinces seems crucial, in addition to its usual cheaper cost and ability to mill provinces if necessary. Same offer applies: If anyone wants to actually play them out with me, I'd be happy to give it a try; just message me. If I had to pick between the two games, the first one seems less clear cut.

markus · « **Reply #14 on:** July 15, 2018, 07:40:13 am »

The Peasant-Page board with Bridge is not so much about trashing your opponent's Provinces, but who is the first to play a bunch of Bridges. And Champion is faster than Teacher, so I think that will be a clear >90%.

The first Butcher-Altar board doesn't want to use Leprechaun as a Gold gainer but Dismantle. I'm convinced Dismantle would win a judgement match vs Leprechaun there.

Dominion Strategy Forum

News:

Author Topic: In judgment of “Judgment Matches” (Read 3241 times)

JW

In judgment of “Judgment Matches”

markus

Re: In judgment of “Judgment Matches”

samath

Re: In judgment of “Judgment Matches”

markus

Re: In judgment of “Judgment Matches”

JW

Re: In judgment of “Judgment Matches”

samath

Re: In judgment of “Judgment Matches”

faust

Re: In judgment of “Judgment Matches”

samath

Re: In judgment of “Judgment Matches”

faust

Re: In judgment of “Judgment Matches”

samath

Re: In judgment of “Judgment Matches”

faust

Re: In judgment of “Judgment Matches”

samath

Re: In judgment of “Judgment Matches”

JW

Re: In judgment of “Judgment Matches”

samath

Re: In judgment of “Judgment Matches”

markus

Re: In judgment of “Judgment Matches”