Oddities in KGS ranking system

Harleqin · Post by **Harleqin** » Wed Aug 31, 2011 8:09 am

We do need ranks, even though we know that their meaning is purely statistical (as Herman pointed out).

We want these ranks to be correlated to "stones" difference, i.e. handicaps.

We can measure a system's accuracy, as wms lined out, by measuring its predictive power.

I believe that the best we can currently do is the following:

First of all, we need good data. Good data need to

have standard conditions for each game,
have as little isolated subpopulations as possible, and
cover the whole handicap range.

To achieve that, I believe that a go server should derive its ratings only from the games of an ongoing tournament with standard settings. The pairing should be completely random each round, and full handicap given.

Second, we need a good model. A good model needs to

use the data,
have good predictive power,
need as little non-data parameters as possible.

The following are assessments from my educated guesses. Please take them as a motivation for research.

I believe that all points-based models are bad. By points-based I mean WBaduk, IGS, ELO, Glicko. The reason for this assessment is that they all use a lot of arbitrary parameters that have no basis in the data. For example, the EGF system (which is a modified ELO) has at least three parameters that are completely arbitrary (a, con, and epsilon, plus some rules about rating resets). They seem to work, but that is actually very dependent on things that go beyond the pure game data: more or less isolated subpopulations, different strength improvements by region, different numbers of new players all make any hope to get everything right for the whole of Europe completely futile. Another big problem is that after a result was entered and has effected a rating change, it is forgotten. At each point, the history is discarded.

The KGS decayed history model is a big improvement, because it does not discard the data after processing. The continuous reassessment of a player's strength based on all games in his recent history has several advantages:

It increases cohesion between subpopulations, because each single game continues to hold the two players together over a long time. This means that this model needs much fewer games to come to a good approximation.
Players do not need to fear anything from other players with unclear ranks, because the games are not judged on their ranks at the time of playing, but on their most current rank. Bad rankings correct themselves automatically without any adverse effect. (This is why I believe that [xx?]-players should not be discriminated against like it is currently done on KGS, by the way.)

However, this model also has a flaw: it assumes that the real strength is a single value, and in determining that value just gradually forgets old data so that the value can move.

This is where the whole history rating comes in: it assumes that the real strength changes, and it thus regards it not as a single value, but as a continuous function of time. In comparison to the KGS system, it doesn't just add a new point at the end of the rank graph, but wiggles the whole line of the graph to fit all the data.

When you think about it, it seems obvious to me that it is a better model.

I believe that if we gather better data, as outlined at the start of this post, we may be able to see the differences between the predictive powers of our models more clearly. If I were to design a go server's ranking system now, I should use the ongoing tournament together with an implementation of WHR.

daniel_the_smith · Post by **daniel_the_smith** » Wed Aug 31, 2011 8:36 am

I think there is a better way to score rating systems than WMS's. A score for a rating system should be (how well it predicts outcomes) / (how much data it required to attain that level of prediction). That is, a rating system that predicts as well as Elo but only requires half as much data to make the prediction is twice as good as Elo-- it extracts twice as much useful information from the data. The important thing, in my mind, is to squeeze the maximum information out of the data you have. (Of course, WMS is measuring the first half of the equation-- which is light-years beyond the other servers, AFAIK.)

KGS already does decently with as few as 3 or 4 games, so it scores quite well (compare IGS, which at one point (still?) took 20 games).

Remi's paper shows WHR doing only marginally better that competitors. However, I'm willing to bet that if he did more comparisons while giving the systems progressively worse data, WHR's margin would grow, perhaps significantly.

Therefore, I agree with everything Harlequin said with one exception: forcing the humans to change how they naturally want to play games is definitely not my preference.

Forcing humans to play a certain way (an ongoing explicitly paired tournament) would indeed give great data for calculating ratings, and perhaps it would be worth having a rating explicitly done that way-- you could have your "casual rating" and your "tournament rating".

flOvermind · Post by **flOvermind** » Wed Aug 31, 2011 8:41 am

Harleqin wrote:To achieve that, I believe that a go server should derive its ratings only from the games of an ongoing tournament with standard settings. The pairing should be completely random each round, and full handicap given.

I'm not sure what you mean with "ongoing tournament". You can't really force players to play pre-scheduled games, most players just want to be able to play whenever they want. And determining a matchup and making both players schedule a time for themselves won't ever work in practice, because of timezone differences, or, more likely, just general lazyness

EDIT (after reading the post of Daniel): And that's not just "I don't like to force players to do X". That also has practical consequences. Would you prefer a few data points with high quality, or rather many more data points with not so good quality?

But wouldn't a simple "automatch" implementation fulfill all your conditions, as long as you don't allow restricting the opponents? That is, have a button "play against a random player who also happens to have pressed the button right now". True, that will somewhat segment your player base into timezones, but there is no way around that since you can't force players to be available 24 hours a day.

daniel_the_smith · Post by **daniel_the_smith** » Wed Aug 31, 2011 9:02 am

To be fair, an ongoing ratings tournament wouldn't need to be that inconvenient. Just, every 2 hours start a round, anyone who is online and wants to play pushes a button a few minutes before the round starts to be included in the round. Personally, I think that'd be really cool, actually-- is there any record to be broken for tournament with greatest number of rounds?

ez4u · Post by **ez4u** » Wed Aug 31, 2011 4:00 pm

AFAIK KGS is an "ongoing rating tournament".

Everyone participates, if they wish to, by choosing to play "rated" games rather than "free" games. If you think that standard conditions are necessary to produce "correct" ratings, you must think that those correct ratings will only be useful for playing under standard conditions, right? So what are the proposed "standard" conditions?

tapir · Post by **tapir** » Wed Aug 31, 2011 4:57 pm

wms wrote:
Kaya.gs wrote:First of all, how do we know a rating system is accurate? How can we compare accuracy between KGS and Wbaduk?.
Basically, to determine the accuracy of a rating system, you feed in all but the last month's worth of games. Then you take the ranks it gives you, and use those ranks to determine who should win each game in the last month. More games predicted correctly means a better system.

I guess I am a bad case for any rating system. July record: 0:5, August record: 18:3.

But... I liked my proposal from the danigabi.gs thread. You have the data already, setting up more than one algorithm to keep track of ranks shouldn't be impossible. We would learn sth. about the performance of the different systems as well. If you make "display glicko rating" an option for KGS+ users all the rating nerds will join KGS+. Good news.

Another idea may be hiding the drift. You compute it in every time a game is played by adding / subtracting up to say 20% of the change effected by actual games (and you have a variable somewhere telling you how much drift is still to compute). The system remains basically the same, but it will feel different. Nobody will have this feeling, that he got promoted for nothing or demoted although he didn't lose a game.

palapiku · Post by **palapiku** » Wed Aug 31, 2011 4:59 pm

shapenaji wrote:Herman: You know, I've often wondered if the best approach to rank is to track a player's distribution and mean. And then use bayes theorem to update the distribution based on the distribution of their defeated and victorious opponents.

Of course it is... why wouldn't it be?

tapir · Post by **tapir** » Wed Aug 31, 2011 5:17 pm

Kaya.gs wrote: Back then when playing with danigabi[5d] account i have played certain 2ds giving them 3 handicap stones. I would win & lose, and i think i won a tad more than lost (say 60%). The impressive happens later. Right after losing a game, i would log back in with Rakuen[7d], and play the very same player with 6H. Suddently, i would win almost 80%.

How is it possible that increasing many stones , my chances to win go up. My current account, DexMorgan, has been brought up to 7d with a similar effect.

I don't know. But what you report here is an oddity in your gameplay or in that of the 2d, not in the KGS rating system. Also, I doubt you have a sample big enough to claim you have better results against the 2d's with 6 than with 3 handicap stones with any reasonable confidence. (Pushing accounts to high levels with a relatively small number of games is pretty popular, that you work harder in "important" games is understandable as well.)

zazen5 · Post by **zazen5** » Wed Aug 31, 2011 6:01 pm

Rank is a useful tool to players to give not only better games but games that enable both players to learn and progress. Rank as an endpoint I believe is similar to people arguing about how much their house should be worth or that market value has any validity. Rank is ever changing, allowing you to determine if what you are doing during play and during training is having the effect that you want, similar to using a metronome during music practice. Its a gauge for information. There should be no emotion attached to it because there will always be someone worse or better than yourself, even on the pro level, if you reach that high.

ez4u · Post by **ez4u** » Wed Aug 31, 2011 7:20 pm

Kaya.gs wrote:...
Besides accounts being heavy and such, there is an impressive psychological aspect of the system that does not feel to affect point-based systems like in Wbaduk or Tygem.

Back then when playing with danigabi[5d] account i have played certain 2ds giving them 3 handicap stones. I would win & lose, and i think i won a tad more than lost (say 60%). The impressive happens later. Right after losing a game, i would log back in with Rakuen[7d], and play the very same player with 6H. Suddently, i would win almost 80%.

How is it possible that increasing many stones , my chances to win go up. My current account, DexMorgan, has been brought up to 7d with a similar effect.

I think this is a specific anomaly of this history-based rating system, where the psychology of the palyers deeply affect the end results and hence its accuracy.
...

Call me anal, but I can't see a claim like this without wanting to check the facts. (That is also why I like GoGoD so much!) Happily we have the KGS Archives. Memory is a tricky little beast. I am sure that we have all had the experience of retelling moments of remembered glory over a beer only to find out afterwards that things weren't quite like that. So I was not too surprised that an examination of the archives for Rakuen and danigabi did not immediately turn up a lot of examples that fit the situation described above. Maybe kaya.gs could point out which games he was referring to?

Harleqin · Post by **Harleqin** » Thu Sep 01, 2011 6:52 am

daniel_the_smith wrote:To be fair, an ongoing ratings tournament wouldn't need to be that inconvenient. Just, every 2 hours start a round, anyone who is online and wants to play pushes a button a few minutes before the round starts to be included in the round.

That is exactly what I had in mind.

My reason to "force" random pairing is that I want a lot more handicap games. Right now, the overwhelming majority of rated games are even. How can we hope to get the correlation to handicap stones right this way?

ez4u · Post by **ez4u** » Thu Sep 01, 2011 7:25 am

Harleqin wrote:
daniel_the_smith wrote:To be fair, an ongoing ratings tournament wouldn't need to be that inconvenient. Just, every 2 hours start a round, anyone who is online and wants to play pushes a button a few minutes before the round starts to be included in the round.
That is exactly what I had in mind.

My reason to "force" random pairing is that I want a lot more handicap games. Right now, the overwhelming majority of rated games are even. How can we hope to get the correlation to handicap stones right this way?

Huh? Sorry, if the overwhelming majority of rated games are even, why does the correlation to handicap stones even matter?

However, where are your stats to support: 1. the "overwhelming majority" and 2. the lack of correlation to handicap stones? It has been more than 10 years since I played on IGS and I have not played a significant number of games on tygem but my impression is that the percentage of handicap games on KGS has always been relatively high.

Mef · Post by **Mef** » Thu Sep 01, 2011 8:15 am

ez4u wrote:
Harleqin wrote:
daniel_the_smith wrote:To be fair, an ongoing ratings tournament wouldn't need to be that inconvenient. Just, every 2 hours start a round, anyone who is online and wants to play pushes a button a few minutes before the round starts to be included in the round.
That is exactly what I had in mind.

My reason to "force" random pairing is that I want a lot more handicap games. Right now, the overwhelming majority of rated games are even. How can we hope to get the correlation to handicap stones right this way?
Huh? Sorry, if the overwhelming majority of rated games are even, why does the correlation to handicap stones even matter?

However, where are your stats to support: 1. the "overwhelming majority" and 2. the lack of correlation to handicap stones? It has been more than 10 years since I played on IGS and I have not played a significant number of games on tygem but my impression is that the percentage of handicap games on KGS has always been relatively high.

I believe he was meaning to speak in the ideal case. If a primary goal of the ranking system is to properly match up handicap games (and properly space ranks via handicap stones), it makes sense that you would want a large number of handicap games in your dataset (probably >50%, I'm not a well-versed statistician who could give you an exact number though). As far as ratio of even to handicap games....just glancing down the active games list, in the first 35 rated games I count 8 handicap. I then scrolled about halfway down the list to the 4-6k section, and found similar numbers (7-8 handicap games in about 35 rated games....the list changed a little while I was counting). So it looks like a reasonable guess would be 20-25% of rated games on KGS are handicapped (I didn't look at the time systems to see if even or handicap was more likely to be blitz, that could skew things). I don't know about how many handicap games are played on other servers, but I would make a guess that as the number of players on a server go up the percentage of handicap games go down (since from my anecdotal experience a majority of the population prefers even to handicap, and more players = more opportunity for an even game).

wms · Post by **wms** » Thu Sep 01, 2011 9:31 am

tapir wrote:If you make "display glicko rating" an option for KGS+ users all the rating nerds will join KGS+. Good news.

Good news for the rating nerds, you mean! And I'm not much of a rating nerd. I just want something that does a good job of matching players up and a reasonable job of tracking progress, then I pretty much lose interest in the topic of ratings.

shapenaji · Post by **shapenaji** » Thu Sep 01, 2011 1:29 pm

wms wrote:
tapir wrote:If you make "display glicko rating" an option for KGS+ users all the rating nerds will join KGS+. Good news.
Good news for the rating nerds, you mean! And I'm not much of a rating nerd. I just want something that does a good job of matching players up and a reasonable job of tracking progress, then I pretty much lose interest in the topic of ratings.

Sometimes though, a person's progress can be strongly linked to the form of rating being used. If a person's rating progresses too slowly compared to their actual level, they won't be challenged as much and may fall back on bad habits.

A lot of getting stronger is in being challenged at the right times.

Life In 19x19

Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system