Whole History Rating

yoyoma · Post by **yoyoma** » Tue Jun 22, 2010 8:46 pm

RobertJasiek wrote:My first statistics book had a nice example: Estimate the distance between two towns. First you take a rough look: "The next town is about 10km afar." The you measure your town's mediaeval wall: "It is 30cm thick". Now you conclude: "The distance is 10km + 30cm = 10.0003km."

I would say it this way: The next town is 10km +/- 2km. Then measure wall is 30cm. Now: The next town is 10.0003km +/- 2km. Mathematically it works just fine.

Turning to examples of go ratings, if a player has only played 2 games, an even game win against a 30k, and an even game loss to a 2k, then the rating system can say he is 16k +/- 14 ranks.

If a different player played 1000 games, all even games against 16k players, and won 50% lost 50%, then the rating system can say his is 16k +/- 0.2 ranks.

So no games are thrown out. Of course the system will have less confidence in ratings of players with less games. AGA's system publishes a number related to the confidence for all players.

RobertJasiek · Post by **RobertJasiek** » Tue Jun 22, 2010 10:09 pm

the rating system can say he is 16k +/- 14 ranks.

If the rating systems did say it (in terms of rating points), that would be an improvement. Strange confidence values instead say too little to the reader.

Harleqin · Post by **Harleqin** » Wed Jun 23, 2010 2:33 am

Liisa wrote:Mathematical rating system should not have any direct and fixed relationship with kyuu-dan ranks (that are subjective honorary titles). If we try to force that relationship, it will just decrease the reliability of the mathematical system.

The ranks are just labels attached to certain values of the model. They do not change the model.

(We play handicap games in tournaments only when we are beginner double digit kyuus!)

I think that this is a very lamentable recent development.

And the good thing of plain and simple Elo is that even though we cannot deduce from Elo exact probability of beating specific opponent, we can always put players in very specific order within a certain subpopulation. (This is the reason why GoR works like magic!)

You describe a use and an outcome and that should be a reason? Magic indeed...

And there are always enough traffic between subpopulations (e.g. via EGC) so that we can calibrate them to match roughly each other if that is necessary.

What is the average rating improvement of finnish players at the London Open? If calibration was so fast, it should approach 0, no? The problem is that calibration can only propagate through later games. If a population is 40 ELO points underrated and 5% of them go to a big foreign tournament, they bring home 40 points each. In theory, this should mean that the population is afterwards only 38 points underrated, but in order to actually distribute these points, a lot of games have to be played (and the players bringing the points will naturally not be very inclined to do so).

But I agree that history approach has its merits. The best way is to calculate simultaneously normal Elo and rating that includes enough history (a year or so to the past) and put both figures to the same graph.

I think that you have not yet looked at the idea of the algorithm in question. It is not about "including some history".

I guess that I cannot expect everyone who wants to discuss this to read that paper, so an explanation will have to be given in this thread. I shall look into that.

Liisa · Post by **Liisa** » Wed Jun 23, 2010 6:53 am

Harleqin wrote:
And there are always enough traffic between subpopulations (e.g. via EGC) so that we can calibrate them to match roughly each other if that is necessary.
What is the average rating improvement of finnish players at the London Open? If calibration was so fast, it should approach 0, no? The problem is that calibration can only propagate through later games. If a population is 40 ELO points underrated and 5% of them go to a big foreign tournament, they bring home 40 points each. In theory, this should mean that the population is afterwards only 38 points underrated, but in order to actually distribute these points, a lot of games have to be played (and the players bringing the points will naturally not be very inclined to do so).

If we see that subpopulation's rating is off by 38 points after the LOGC, then we can add 38 Elo points to entire active sub population. In practice we can add 12 points (30%) and then look how much subpopulation is still underrated after next year. If this kind of comparison is applied once a year, soon enough we will get acceptable differences between subpopulations or better yet that subpopulations will stay in sync. This is not that hard.

I guess that I cannot expect everyone who wants to discuss this to read that paper, so an explanation will have to be given in this thread. I shall look into that.

That would be nice. Because it is difficult to understand any key points of WHR from the paper. What is WHR about in practice? Exactly how many games/months from past you would like to take in consideration? Real world resembling examples are always nice.

daniel_the_smith · Post by **daniel_the_smith** » Wed Jun 23, 2010 7:44 am

Liisa wrote: If we see that subpopulation's rating is off by 38 points after the LOGC, then we can add 38 Elo points to entire active sub population. In practice we can add 12 points (30%) and then look how much subpopulation is still underrated after next year. If this kind of comparison is applied once a year, soon enough we will get acceptable differences between subpopulations or better yet that subpopulations will stay in sync. This is not that hard.

Do you... plan to do that by hand? For each subpopulation? And how do you even identify a subpopulation? How can you tell how over/underrated a subpopulation is? And how could you keep such a system impartial? I concede it may be possible, but I don't see how you can claim it isn't hard...

Besides, WHR basically does that for you, only in a much better way than arbitrarily giving subpopulations 30% bonuses...

Liisa · Post by **Liisa** » Wed Jun 23, 2010 12:16 pm

daniel_the_smith wrote:Besides, WHR basically does that for you

how?

daniel_the_smith · Post by **daniel_the_smith** » Wed Jun 23, 2010 12:29 pm

As I understand it, WHR works backwards as well as forwards in time. So it should distribute those rating points back amongst the isolated pool retroactively, with no further games necessary.

Although, now that I think about it more, I don't know that it would distribute significantly more points than those few players won (as you were suggesting).

Harleqin · Post by **Harleqin** » Wed Jun 23, 2010 3:06 pm

WHR does not distribute points.

daniel_the_smith · Post by **daniel_the_smith** » Wed Jun 23, 2010 3:08 pm

Maybe I should read the paper again. I read it a very long time ago...

yoyoma · Post by **yoyoma** » Wed Jun 23, 2010 5:11 pm

RobertJasiek wrote:
the rating system can say he is 16k +/- 14 ranks.
If the rating systems did say it (in terms of rating points), that would be an improvement. Strange confidence values instead say too little to the reader.

AGA already does something like this. My AGA rating is 3.354528, with a sigma of 0.276734.

prokofiev · Post by **prokofiev** » Wed Jun 23, 2010 8:44 pm

Here's a summary of the paper and a few other rating systems for those interested. (Caveat: I perused the paper and a few other things when I made the comments above a few days ago and may have minor misunderstandings). My apologies for the presumption inherent in a long post. I've left out formulae except for one simple one in the next paragraph as presumably you should just go to the paper if you desire those.

Elo: Each player has a single numerical rating (i.e. no variance or confidence interval is used). Win percentage in the model of player 1 with rating r_1 against player 2 with rating r_2 is r_1/(r_1+r_2) (so e.g. if A beats B 2/3 the time and B beats C 2/3 the time, then r_A = 2r_B = 4r_C and it's assumed A beats C 4/5 the time). This ratings assumption seems to be called the Bradley-Terry model (from some 1952 paper of theirs) and is only moderately relevant to Elo's workings. Glicko and WHR use this as well more crucially. Ratings are updated based on a simple formula based on rating difference, with a maximum adjustment set. The points lost by one player equal those gained by the other player, so one really does have "shifting around" of rating points between somewhat isolated populations.

Main benefit: Easy to see what your rating will be after the next game (formula is simple).

Glicko (roughly Elo + confidence/variance) (Is this what AGA uses?): This is (an approximation of) an actual statistical model using Bayesian reasoning. The same rating assumption is used, but ratings have a "ratings deviation" assigned as well. The model doesn't assume your rating varies in time (technically it assumes that it doesn't, but the particular way the model is approximated plus the following hack makes this assumption less severe), but there's a hack that's added on that gradually increases your variance if you don't play for a while.

The basic idea is, given an assumed prior distribution of ratings (here assumed to be normal I think) and a collection of game results, find the most likely ratings of each player, as well as a measure of the uncertainty in this. (If this idea doesn't make sense, look at the wikipedia article on Bayes theorem or on Bayesian analysis.)

The approximation & trick: Actually finding the most likely ratings would take a lot of computations (but possibly these days not really too much time?) so an approximation is used. Whenever there's a game, we update the ratings by what is in some sense "one iteration of Newton's method." The fact that this makes for a simple formula (it looks similar to the Elo one, but the variances play a role, essentially) and the fact that this yields a good approximation are both due to the particular rating assumption mentioned above. We don't go back and reupdate based on previous games iteratively, but the approximation will still be reasonably good. Ratings deviations are also updated after each game (formula pretty simple too).

Example: If a player of high variance plays a player of low variance with an equal rating, the player of high variance will have their rating change by a lot based on their game but the player of low variance will have their rating change only by a small amount. In particular, total rating points are not preserved (this is good). (Note that I think the average rating is in fact at least approximately preserved when you take each players' ratings distribution into account [i.e. average in the sense of expected value of the average of the distributions].)

Benefits: Still possible to see how your rating will change, but formula a bit more complicated. Takes into account uncertainties in other players' ratings and avoids a major issue with different populations having too many/not having enough ratings points in the population. (For example: If a player from a small community with high uncertainty in their ratings plays an outside player with a more certain rating, his rating will adjust a lot and moreover his rating deviation will drop a good amount too, so he'll both have brought a lot of rating difference back to his community and he'll have increased ability to move his community's ratings due to the smaller ratings deviation from playing a low uncertainty player.)

WHR: (An approximation to) a statistical model. Works very similarly to Glicko, but also keeps track of previous results at all times and uses them to update not only your current rating, but its approximation of your rating at previous points in time. It uses the assumption that ratings vary in time by Brownian motion (think: how a gas particle moves around), so not too much at any one time, but can drift over time. The basic idea is, given the assumptions on the distribution of ratings (same as in Glicko I think) and the way ratings vary in time, to find the most likely ratings history for each player. (Caveat: Your most likely ratings history at some point in the past can be different from your most likely ratings history even up to that point after you've played more games.)

The approximation of this "most likely history": after each game, update the entire rating history (plus uncertainties) by one iteration of Newton's method. The formula for doing this is simple given the circumstances (again due to the ratings assumptions) and will give a good approximation, but is complicated by the fact that you need to keep track of all the previous ratings and update them too, and there's mild interplay between the ratings at different times (coming from the assumption that ratings are unlikely to vary too sharply). It might be plausible to compute your next "current rating" (i.e. not the whole history, but just today's) after one game, but more will get difficult say with just a calculator at the tournament.

Also: The fact that the history is nicely dealt with (i.e. there's no hack to make ratings sort of vary in time as in Glicko) makes it reasonable (i.e. in Glicko it would breaking that hack) to every once in a while do a big "ratings update" in which more iterations are run for each player, again using the whole history of course. This will make the approximation of the "most likely history" even better.

Benefits: No hack to make ratings vary with time, meaning your current rating shouldn't lag as much if you've improved not too long ago. Increased accuracy from taking the whole history into account with a plausible statistical model. Increased ability to run extra iterations to make ratings more accurate, which will, among other things, further benefiting the handling of mostly isolated populations.

Life In 19x19

Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating

Re: Whole History Rating