Whole History Rating

Harleqin · #1

It was recently brought to my attention that a rating system exists which does not fake a development by gradually forgetting earlier data, but instead forms a model of a complete rank development. It is called "Whole History Rating" and its description is available at Rémi Coulom's homepage.

I have also read someone claim that it also had some drawbacks.

I would like to hear about these drawbacks.

As you know, the EGF has a lot of tournament data which could be fed to a WHR algorithm. Is it feasible to test this?

This might be a major improvement over the current systems.

Phelan · #2

Given that every tournament data is accessible online on the EGD website, it's probably feasible to test. How would you analyze the data to know which is better, though?

daniel_the_smith · #3

I think the normal way to evaluate such things is to feed it half the data and see how well it predicts the other half. If I recall correctly, the authors do that in the paper with data from some server.

RobertJasiek · #4

Does it make sense to use real world rating data at all to evaluate a specific rating system / algorithm or should one not rather first require the real world data to meet certain minimal quality criteria? E.g., each player to have played a particular minimal number of games per period of time. E.g., all players to be not isolated in regional subpopulations. E.g., all held tournaments are submitted to rating evaluation.

Should one take a set of well selected players (playing enough, dan players only) and take information from only well selected tournaments (class A, even games only)?

Or is a model player population always a better test data set because one can easily let all its players fit reasonable quality criteria?

Harleqin · #5

RobertJasiek wrote:

Does it make sense to use real world rating data at all to evaluate a specific rating system / algorithm or should one not rather first require the real world data to meet certain minimal quality criteria? E.g., each player to have played a particular minimal number of games per period of time. E.g., all players to be not isolated in regional subpopulations. E.g., all held tournaments are submitted to rating evaluation.

(I guess you mean "real world game data", not "real world rating data".)

The main problem from the missing quality you are mentioning arises when the rating algorithm regards newer data as corrections rather than modeling a development. In the first case, there have to be a lot of contacts between different populations in order to get enough corrections for rating alignment, while in the second, just a few should suffice. The case of only few rated games of a player is similar.

It would be very desirable to get more and more connected data, but we have to work with what we get. Therefore, I think that testing with the real game data will give the most relevant results with respect to perhaps replacing the current EGF rating algorithm.

These are all just working hypotheses, of course. However, as the flaws of the current algorithm are widely known, this could be a major improvement. I would like to test it.

Quote:

Should one take a set of well selected players (playing enough, dan players only) and take information from only well selected tournaments (class A, even games only)?

Or is a model player population always a better test data set because one can easily let all its players fit reasonable quality criteria?

I do not think that one should leave out handicap games. The go rank system is built upon handicap, and if we want that correlation to hold, we need to use as many handicap games as possible. This could also help to understand the effects of handicap better.

It might be interesting to also test subpopulations both of players and of games to see the effects of assumed quality differences. Testing the whole data is the most important thing, though, and should not be missed.

pwaldron · #6

RobertJasiek wrote:

Does it make sense to use real world rating data at all to evaluate a specific rating system / algorithm or should one not rather first require the real world data to meet certain minimal quality criteria?

Test data should be of the same quality as the expected actual data. If the plan, for example, is to use the system to rate all EGF events, then all EGF tournament data should be used. If the system falls apart when it meets questionable data then it isn't robust enough to be deployed in the real world.

pwaldron · #7

Harleqin wrote:

I have also read someone claim that it also had some drawbacks.

As it is described in Remi's paper, the system was tested with even games played on KGS over a period of time. Limitations and questions that I could see:

* As described the system doesn't deal with handicap games

* The system initializes everyone to a rating of 0, which means that when a strong player enters the rating system for the first time it takes a while to get an accurate estimate of the strength. The stronger the player the stronger the effect.

* The system has only been tested with online data. Players are much more active online than they are in face-to-face tournaments. For systems ratings face-to-face tournament data, a good deal of a player's improvement can occur out of sight of the rating system and it's not clear how well it will keep up.

* The system uses a particular model for the expected winning percentage as a function of rating difference. This is mathematically convenient, but it's not clear how well it matches reality. The US Chess Federation changed its model some years ago after finding that it didn't work well in the cases of large rating differences.

Harleqin · #8

Thanks for the input, pwaldron! That is what I wanted to see. As far as I can see, these are limitations that can be worked out.

pwaldron wrote:

* As described the system doesn't deal with handicap games

I am sure that handicap can be incorporated into the formula.

Quote:

* The system initializes everyone to a rating of 0, which means that when a strong player enters the rating system for the first time it takes a while to get an accurate estimate of the strength. The stronger the player the stronger the effect.

The solution for this would be to either move the initialization of a player to a point in the past, e.g. to an estimate when he has actually started Go, or to initialize the player at his claimed rank.

Quote:

* The system has only been tested with online data. Players are much more active online than they are in face-to-face tournaments. For systems ratings face-to-face tournament data, a good deal of a player's improvement can occur out of sight of the rating system and it's not clear how well it will keep up.

This is what the testing is about. The current system does not fare well with this problem, and this is the main motivation behind seeking an alternative.

Quote:

* The system uses a particular model for the expected winning percentage as a function of rating difference. This is mathematically convenient, but it's not clear how well it matches reality. The US Chess Federation changed its model some years ago after finding that it didn't work well in the cases of large rating differences.

Again, this is a matter of testing and, if needed, developing better models. We have the advantage that we already have collected a lot of data, so we do not have to wait for the data to trickle in while doing a live evaluation.

I think that the main point here is that the algorithm models the whole development instead of faking it by gradually forgetting "old" data. This seems to be the right direction for an improvement, even if some formulae have to be adjusted.

pwaldron · #9

Harleqin wrote:

Quote:

* The system uses a particular model for the expected winning percentage as a function of rating difference. This is mathematically convenient, but it's not clear how well it matches reality. The US Chess Federation changed its model some years ago after finding that it didn't work well in the cases of large rating differences.

Again, this is a matter of testing and, if needed, developing better models. We have the advantage that we already have collected a lot of data, so we do not have to wait for the data to trickle in while doing a live evaluation.

This may be a little more difficult than it sounds. The model that is currently in use produces a computational problem that is extremely quick to solve. In his paper Remi reported that adding new games was extremely fast, but changing the model would probably break that feature. On the other hand, if you're interested in doing ratings updates once a week the computational time is probably no big deal.

I gave a little bit of thought to converting the AGA's algorithm to a whole-history approach and decided it was probably doable, but haven't gone much beyond that yet.

prokofiev · **#10**

This looks like a good approach, given that the computing time isn't so huge (less than 5 minutes to do the KGS data from 2000 to 2005). Once you're using Bayes' Theorem the content is in choosing how much data to keep, the particular Bayesian priors (initial guesses for probability distributions), and simplifying assumptions like a model for rating differences (assumptions regarding e.g. "if A beats B 2 times out of 3 and B beats C 2 times out of 3, how often does A beat C" which should probably be decided upon after looking at training data).

As pwaldron mentions, I guess the paper does seem to have a particular model for rating differences which helps in making the computations reasonably quick to update (though I didn't look at details to see why this is).

The go community seems to have made its own assumption regarding rating differences. That is, if A gives B two stones (for each to have an equal chance of winning) and B gives C two stones, then A gives C four stones. Presumably this isn't exactly correct, but perhaps even so it's a reasonable thing to build in to the model?

I guess the next step is to understand the conversion from handicap stones to even game win percentage (caveat: presumably this is highly dependent on rank, i.e. a 10 kyu beating a 5 kyu at an even game can happen but a 2 dan beating a 7 dan at an even game must be almost unheard of).

I guess one should work out (and/or guess at) some data on this, couple it with the handicap assumptions above, and compare it to the rating assumptions made in the paper to see how plausible those are.

Two other thoughts:

- Other than predictive ability on the large scale, one should also evaluate a rating system for its "fairness." I'm not exactly sure what I mean by this, or whether the system in the paper exhibits this more/less than others, but it's something to consider. What I mean is some combination of: understandability by users, equality of predictive ability (i.e. it doesn't do well in general by really messing up on some players or some such), ability to forgive bad stretches with time, and similar concerns.

- I'm confused by the example rating graph for CrazyStone in the paper. It seems to predict the large rise in CrazyStone's rating during one period of inactivity but not during another. That is, is that graph not "the rating this system would give CrazyStone at each point in time if we had it running and updating" but rather something bizarre like "what CrazyStone's rating seems to most likely have been at each point in time given the later data as well"? (Is that what is meant by "a posteriori" in the paper?)

RobertJasiek · **#11**

If the real world data are bad, then the real world system should be replaced. If the model data are bad, then they should be replaced. A good rating system should be designed for a nice world. I do not see much sense in tuning model rating systems for bad real worlds. Both the world and the models must be good or else be replaced to become good!

Harleqin · **#12**

RobertJasiek wrote:

If the real world data are bad, then the real world system should be replaced.

Please suggest an alternative to the current culture of weekend tournaments across Europe.

Quote:

If the model data are bad, then they should be replaced.

The data are game results. See above.

Quote:

A good rating system should be designed for a nice world. I do not see much sense in tuning model rating systems for bad real worlds. Both the world and the models must be good or else be replaced to become good!

A good system naturally covers all corner cases without further effort.

If every Go player played a fixed amount of rated games per month under controlled and equal conditions with a good distribution of handicap games and against a random selection from a wide variety of opponents, then the current ELO system might suffice. Since this is not the case, we need something better.

RobertJasiek · **#13**

I understand that the input data are game results.

I do not suggest to force players and organizers to change existence and nature of tournaments - with one exception: Each tournament should report to the ratings commission.

What I suggest is that insignificant data are ignored. If a player has too few game results altogether or over some defined periods, then those games are to be ignored. If players of a country have too few games against players from other countries, then the games of the players from the country are ignored.

It defies your dream but insignificance should be taken into account instead of being overlooked.

Sverre · **#14**

RobertJasiek wrote:

What I suggest is that insignificant data are ignored. If a player has too few game results altogether or over some defined periods, then those games are to be ignored. If players of a country have too few games against players from other countries, then the games of the players from the country are ignored.

So you would recommend that players who lack the time to play in tournaments regularly get no ratings? I'm not sure if this would be popular with a lot of casual players.

prokofiev · **#15**

RobertJasiek wrote:

I understand that the input data are game results.

I do not suggest to force players and organizers to change existence and nature of tournaments - with one exception: Each tournament should report to the ratings commission.

What I suggest is that insignificant data are ignored. If a player has too few game results altogether or over some defined periods, then those games are to be ignored. If players of a country have too few games against players from other countries, then the games of the players from the country are ignored.

It defies your dream but insignificance should be taken into account instead of being overlooked.

I think this is already part of the model. Or rather, not ignoring such games but giving them less weight. The model finds (or rather approximates) the rating graphs of each player most likely given the games played (and given the various assumptions). Thus the variance of the rating of a player you play a game against should play a role.

pwaldron · **#16**

RobertJasiek wrote:

It defies your dream but insignificance should be taken into account instead of being overlooked.

It also defies mathematical theorems. It is always better to have more information (in form of game results). Your belief to the contrary is irrelevant.

prokofiev · **#17**

prokofiev wrote:

- I'm confused by the example rating graph for CrazyStone in the paper. It seems to predict the large rise in CrazyStone's rating during one period of inactivity but not during another. That is, is that graph not "the rating this system would give CrazyStone at each point in time if we had it running and updating" but rather something bizarre like "what CrazyStone's rating seems to most likely have been at each point in time given the later data as well"? (Is that what is meant by "a posteriori" in the paper?)

Answering my own question (apologies):

The second "quote" above is in fact correct, but this is not a bug, it's a feature. The model seeks better & better approximations of the whole rating graph because it takes into account the likelihood of the ratings varying (e.g. slowly varying is more likely than quickly). To get a better approximation now, a better approximation in the past is desired too.

(Also, that isn't really what "a posteriori" refers to in the paper.)

pwaldron · **#18**

prokofiev wrote:

Also, that isn't really what "a posteriori" refers to in the paper.

The posterior function is a statistical term. It represents an updated probability based on what you knew before (called the prior function), modified by some new information (in this case game results).

prokofiev · **#19**

pwaldron wrote:

prokofiev wrote:

Also, that isn't really what "a posteriori" refers to in the paper.

The posterior function is a statistical term. It represents an updated probability based on what you knew before (called the prior function), modified by some new information (in this case game results).

Thanks. I'd realized the meaning, but still didn't connect the term with prior!

RobertJasiek · **#20**

Sverre, there are these possibilities: a) yes, b) give them pseudo-ratings that are shown for their pleasure but otherwise ignored, c) use a rating system that calulates only local ratings anyway.

prokofiev, I want something stronger than weak confidence parameters, which are a makeshift measure.

pwaldron, maybe in theory there are more information is better theorems but currently rating systems are so far from perfect that a more modest approach makes it easier to design better systems. When we will have them, one can still come back to the low confidence sparse data noise and see if one can explain them already well.

Whole History Rating

Who is online