We want these ranks to be correlated to "stones" difference, i.e. handicaps.
We can measure a system's accuracy, as wms lined out, by measuring its predictive power.
I believe that the best we can currently do is the following:
First of all, we need good data. Good data need to
- have standard conditions for each game,
- have as little isolated subpopulations as possible, and
- cover the whole handicap range.
To achieve that, I believe that a go server should derive its ratings only from the games of an ongoing tournament with standard settings. The pairing should be completely random each round, and full handicap given.
Second, we need a good model. A good model needs to
- use the data,
- have good predictive power,
- need as little non-data parameters as possible.
The following are assessments from my educated guesses. Please take them as a motivation for research.
I believe that all points-based models are bad. By points-based I mean WBaduk, IGS, ELO, Glicko. The reason for this assessment is that they all use a lot of arbitrary parameters that have no basis in the data. For example, the EGF system (which is a modified ELO) has at least three parameters that are completely arbitrary (a, con, and epsilon, plus some rules about rating resets). They seem to work, but that is actually very dependent on things that go beyond the pure game data: more or less isolated subpopulations, different strength improvements by region, different numbers of new players all make any hope to get everything right for the whole of Europe completely futile. Another big problem is that after a result was entered and has effected a rating change, it is forgotten. At each point, the history is discarded.
The KGS decayed history model is a big improvement, because it does not discard the data after processing. The continuous reassessment of a player's strength based on all games in his recent history has several advantages:
- It increases cohesion between subpopulations, because each single game continues to hold the two players together over a long time. This means that this model needs much fewer games to come to a good approximation.
- Players do not need to fear anything from other players with unclear ranks, because the games are not judged on their ranks at the time of playing, but on their most current rank. Bad rankings correct themselves automatically without any adverse effect. (This is why I believe that [xx?]-players should not be discriminated against like it is currently done on KGS, by the way.)
However, this model also has a flaw: it assumes that the real strength is a single value, and in determining that value just gradually forgets old data so that the value can move.
This is where the whole history rating comes in: it assumes that the real strength changes, and it thus regards it not as a single value, but as a continuous function of time. In comparison to the KGS system, it doesn't just add a new point at the end of the rank graph, but wiggles the whole line of the graph to fit all the data.
When you think about it, it seems obvious to me that it is a better model.
I believe that if we gather better data, as outlined at the start of this post, we may be able to see the differences between the predictive powers of our models more clearly. If I were to design a go server's ranking system now, I should use the ongoing tournament together with an implementation of WHR.