2. the Goko vs Isotropish analysis I still haven't posted yet because it'll lead immediately to 3. the debate over optimized TrueSkill parameters for Isotropish.
When I'm finished not dealing with those, I'll try to find some time to not deal with this.
I spent some time thinking about this last week. There are objective measures for judging the quality of predictions--the Brier score is one that seemed pretty logical to me after I gave it some thought. The Trueskill Python module will provide a prediction, so in theory it seems pretty straightforward to throw the big dataset into two Trueskill systems set up with different parameters. Depending on how long it takes to run, you could even run several different parameter sets to see what gives the best predictions.
If it would be of use, I'm a programmer and could probably work on that myself if you make the data available. I managed to connect to your DB with the guest login you have on another forum, but it looked like the tables needed weren't available to guests.
Great! Yes, what I've done so far has been based on a similar metric:
Statistical Deviance, which is the same basic measure but with a bunch of logs involved. I'm no expert, but I understand that Deviance is well-accepted in stats/econometric circles, and variations on it are used in many different contexts. It's also (almost) the metric that FIDE used for
its chess rating system competition back in 2011.
I'd be happy to add you as a collaborator on the not-yet-public project I've been using for this stuff:
https://github.com/aiannacc/Trueskill-Analysis. I've wanted to keep it private until I'm confident that its bug-free and easy to download and run. Since this issue is certain to spark a lot of debate, I want that debate to at least be well-informed.
As for running time, comparing two different TrueSkill systems on the ~1 million Pro games played on Goko to date takes a couple hours on an Intel i7 920 @ 2.67 GHz. That time goes, almost exclusively, into doing the TrueSkill rating updates: specifically into doing the error function integral approximations that TS requires. Of course, that's only the speed when running in Python (which is slow), but it's using the Python scipy module, and that module outsources most of the heavy lifting to Fortran. So there's probably not a lot of time to be saved by working in another language. However, there's a lot that can be gained by multi-threading, depending on the number of physical cores in your CPU.
Contact me if you'd like access to my github project and current analysis. I'll make you a project collaborator. Same for anyone else who's interested in this.