Hey JW! Thanks for watching and offering this criticism! I think your link in my video thread is incorrect, but I appreciate the thoughtfulness of not cluttering that thread.
Let me take your criticisms in order:
1)
Transitivity. I agree that they aren't necessarily transitive, but such triples are interesting as well! I love it when a Dominion board presents a Rock-Paper-Scissors choice, because you have to remain flexible to be able to react to your opponent. (For a real-game example, see Game #4 of my
Interesting Games of the Week #2. It was a single terminal Sentry/Groundskeeper board, and my opponent's Nomad Camp was beating my Mountebank until I switched it out for a pair of Swamp Hags.) I expect most of these intransitive triples would take the form of Attack versus Reaction versus Engine component, where the Attack would beat the Engine component, but the normally weak Reaction would be especially strong in the presence of the Attack. These and other sorts of specific interactions mean that I'm just not going to bother testing pairs where the direct interaction is significant, which probably includes most Attack-Reaction pairs.
So yeah, we're not going to be able to get out a total ordering of all cards this way. That's okay, though -- there are already plenty of other edge cases that this format can't handle, like Embargo, Smugglers and so on. But there's still plenty we can compare when the cards don't directly interact.
2)
Counting. I think you misunderstand the process we go through at the beginning of each game. Basically, we only skip the board if we think that one of the cards is so clearly better that it'll win close to 100% of the time. This actually happens fairly frequently. If it's a narrow margin, we go ahead and play it out. Silly Smithy versus Smithy would always be a narrow margin, so we'd always play it out and eventually it'd converge to 50.1% to 49.9% or whatever (ignoring the Neither games). Of course, that would be really boring so I wouldn't do that, but that's how we treat boards where one card has a narrow margin, even if we're fairly certain which card has that narrow margin.
3)
Playing cards against not being able to gain them. This is essentially what we do whenever one of the players in the Judgment Match wants to champion Neither. Basically, your proposal seems to be to just play every card against Neither (or "Not", I guess) and compare their winning percentages. It's a little different since we do require that a player championing a card intends to actually gain the card; if they don't, we would be recording our Neither wins as ties between the card and Neither/Not.
Of course, the problem is that those games are fairly boring, to play and to watch, and would be even more so if we didn't pick out the games that one of us thought Neither would have a good chance. It's a lot more fun to play and to watch closer games where the outcome is more uncertain, rather than trying to discern whether the player with access to the card has an 80% or 90% win rate.
4)
Sample size. markus's numbers are a bit high, though admittedly it depends on what he means by "pinning down" a probability. The way I interpret that, it's saying that the standard deviation of your estimate is 5%, so you'd report it as 60% +/- 5%. But the variance of the binomial distribution B(n, p) is np(1-p), so the variance of the average is p(1-p)/n. Since p(1-p) is at most 1/4, we can upper bound this variance by 1/4n. If we want that to be less than 1/400, which would correspond to a standard deviation of 5%, we just need 100 games, not 400.
Moreover, if we're fine with a standard deviation of 10% (e.g. a range like 60-80%), then we need just 25 games by the same logic. That's around two judgment matches with the way we've been playing them. In other words, the standard deviation of our outcomes should be around 15%-20% for the closer ones (e.g. Page versus Peasant), and lower the more lopsided the match was (since p(1-p) is lower when p is closer to 0 or 1). So, for instance, our estimate of Minion's winning percentage over Groundskeeper would be something like 67% +/- 20%, but that of Menagerie's winning percentage over Wishing Well would be something like 90% +/- 10%.