Whole History Rating

prokofiev · **#41**

Here's a summary of the paper and a few other rating systems for those interested. (Caveat: I perused the paper and a few other things when I made the comments above a few days ago and may have minor misunderstandings). My apologies for the presumption inherent in a long post. I've left out formulae except for one simple one in the next paragraph as presumably you should just go to the paper if you desire those.

Elo: Each player has a single numerical rating (i.e. no variance or confidence interval is used). Win percentage in the model of player 1 with rating r_1 against player 2 with rating r_2 is r_1/(r_1+r_2) (so e.g. if A beats B 2/3 the time and B beats C 2/3 the time, then r_A = 2r_B = 4r_C and it's assumed A beats C 4/5 the time). This ratings assumption seems to be called the Bradley-Terry model (from some 1952 paper of theirs) and is only moderately relevant to Elo's workings. Glicko and WHR use this as well more crucially. Ratings are updated based on a simple formula based on rating difference, with a maximum adjustment set. The points lost by one player equal those gained by the other player, so one really does have "shifting around" of rating points between somewhat isolated populations.

Main benefit: Easy to see what your rating will be after the next game (formula is simple).

Glicko (roughly Elo + confidence/variance) (Is this what AGA uses?): This is (an approximation of) an actual statistical model using Bayesian reasoning. The same rating assumption is used, but ratings have a "ratings deviation" assigned as well. The model doesn't assume your rating varies in time (technically it assumes that it doesn't, but the particular way the model is approximated plus the following hack makes this assumption less severe), but there's a hack that's added on that gradually increases your variance if you don't play for a while.

The basic idea is, given an assumed prior distribution of ratings (here assumed to be normal I think) and a collection of game results, find the most likely ratings of each player, as well as a measure of the uncertainty in this. (If this idea doesn't make sense, look at the wikipedia article on Bayes theorem or on Bayesian analysis.)

The approximation & trick: Actually finding the most likely ratings would take a lot of computations (but possibly these days not really too much time?) so an approximation is used. Whenever there's a game, we update the ratings by what is in some sense "one iteration of Newton's method." The fact that this makes for a simple formula (it looks similar to the Elo one, but the variances play a role, essentially) and the fact that this yields a good approximation are both due to the particular rating assumption mentioned above. We don't go back and reupdate based on previous games iteratively, but the approximation will still be reasonably good. Ratings deviations are also updated after each game (formula pretty simple too).

Example: If a player of high variance plays a player of low variance with an equal rating, the player of high variance will have their rating change by a lot based on their game but the player of low variance will have their rating change only by a small amount. In particular, total rating points are not preserved (this is good). (Note that I think the average rating is in fact at least approximately preserved when you take each players' ratings distribution into account [i.e. average in the sense of expected value of the average of the distributions].)

Benefits: Still possible to see how your rating will change, but formula a bit more complicated. Takes into account uncertainties in other players' ratings and avoids a major issue with different populations having too many/not having enough ratings points in the population. (For example: If a player from a small community with high uncertainty in their ratings plays an outside player with a more certain rating, his rating will adjust a lot and moreover his rating deviation will drop a good amount too, so he'll both have brought a lot of rating difference back to his community and he'll have increased ability to move his community's ratings due to the smaller ratings deviation from playing a low uncertainty player.)

WHR: (An approximation to) a statistical model. Works very similarly to Glicko, but also keeps track of previous results at all times and uses them to update not only your current rating, but its approximation of your rating at previous points in time. It uses the assumption that ratings vary in time by Brownian motion (think: how a gas particle moves around), so not too much at any one time, but can drift over time. The basic idea is, given the assumptions on the distribution of ratings (same as in Glicko I think) and the way ratings vary in time, to find the most likely ratings history for each player. (Caveat: Your most likely ratings history at some point in the past can be different from your most likely ratings history even up to that point after you've played more games.)

The approximation of this "most likely history": after each game, update the entire rating history (plus uncertainties) by one iteration of Newton's method. The formula for doing this is simple given the circumstances (again due to the ratings assumptions) and will give a good approximation, but is complicated by the fact that you need to keep track of all the previous ratings and update them too, and there's mild interplay between the ratings at different times (coming from the assumption that ratings are unlikely to vary too sharply). It might be plausible to compute your next "current rating" (i.e. not the whole history, but just today's) after one game, but more will get difficult say with just a calculator at the tournament.

Also: The fact that the history is nicely dealt with (i.e. there's no hack to make ratings sort of vary in time as in Glicko) makes it reasonable (i.e. in Glicko it would breaking that hack) to every once in a while do a big "ratings update" in which more iterations are run for each player, again using the whole history of course. This will make the approximation of the "most likely history" even better.

Benefits: No hack to make ratings vary with time, meaning your current rating shouldn't lag as much if you've improved not too long ago. Increased accuracy from taking the whole history into account with a plausible statistical model. Increased ability to run extra iterations to make ratings more accurate, which will, among other things, further benefiting the handling of mostly isolated populations.

Whole History Rating

Who is online