I made some corrections to my original post. Sorry for any confusion.
As to your question in these revisions, mu-sigma isn't accepted, nor is mu-3*sigma. Ok, I have seen mu-3*sigma a decent bit, but it is 'common' only because Microsoft has pushed it out there to be standard for their TrueSkill, which is, in my estimation, mostly a way of trying to push people to play more, so as to get higher profits.
IF you believe that these are reasonable estimates as to the actual skill of the participants (something which seems quite suspect to me, actually), then mu-3*sigma gives a 99.865% chance that the player's skill is at least at the level. But 2sigma would give a 97.7% chance, 1sigma gives a 84.1% chance. But the more important thing is that these are one-sided - you could just as easily add the sigmas and have very good chances of being beneath - really, I don't see any reason to not just go based on straight-up mu, which is the central number and 'best guess' of the system, if you want a number for rating.
I don't fully understand the details of TrueSkill or Heungsub Lee's Python implementation, nor do I really plan to. The package is open-source, though, and I'm happy to implement variations.
It would be particularly nice to have a means of calibrating parameters and comparing predictive accuracy, if anyone is willing to contribute that code. I believe WW has described how to do this somewhere.
I eventually dug around to a paper which gives, well, not a perfect explanation of the system, but one to where I have a good feel now for the distribution they're using, and I figure I could probably get a pretty good idea of how their updating works if I card to. If there are serious questions, probably someone here can generally answer them.
As for the parameters, I am looking into what curve is going to be best for this, but it is fairly deep on my priority list at the moment, and moreover I am trying to write the program in a very general sense, such that I can use it for many different endeavors (and not just Dominion). For sure I will give an update when I have one, but I suspect this will be months...
One thing to note is that no matter what they are doing with the Beta factor, you still end up with normal distributions, and well, I have my doubts about the normal fitting well here. Eh, maybe it does. But I would at least try higher (relative to the mu and sigma you are using) values of beta. Basically what this does, as described above, is lessen the impact of any particular rating difference.
Oh, and for more evidence that this system is REALLY wrong: Stef vs Mic Q is bad enough (sure, Stef has Mic Q's number so far, so that sort of matches, but I seriously must believe that this is basically luck), but if we take it down to the number 100 guy on the list, we see... Stef favored to win just over 98%(!) of the time!!! I mean, folks, he is good, but he isn't *that* good.