A Curious Case Study in KGS Ranks

Mef · Post by **Mef** » Tue Mar 25, 2014 7:54 am

uPWarrior wrote:It's funny how Robert just proposed removing all handicap games from the calculation while in a different topic I proposed that only handicap games should be considered so we don't rely on arbitrary win percentages.

Someone with a stronger math background than myself could probably come up with a better answer for what the rating system thinks is ideal, but I would think that the best case would be for all players to have an even distribution of games across the whole range of handicaps the system aims to predict. on KGS that would mean 7.69% giving H6, H5, H4, etc. This would leave approximately 23% of your games as having no handicap (e.g. either even or +- 1 stone). Also you would probably want to fix the cultural affinity for using 0.5 komi and make it reverse komi.

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 7:57 am

I ran a simulation, just for fun. As Herman points out, the system is deflationary. To compensate, I use a closed pool of players. Each player is given a rank from 0 to 40 (9d to 30k, so to say) and an "inner rank," which in some sense is used to model its real rank. So for example, a player can be 4, 8 so he should be losing rank eventually. To calculate game results, I use the difference of inner rank among players and an ELO-like winning probability, and only consider games between players with at most 1 rank difference. The distribution of ranks is a normal distribution, mean 20, sigma 2/3*mu. The population (and ranks) are corrected so that min is 0 and max is 40.

To plot and display, I use the percentiles in the difference between "inner rank" and "system rank." The results with a pool of just 100 players and 50000 games (real games played: 49847) roughly look like:

Code: Select all

Simulation: 100 * gauss(mu=20) for T=50000 steps having 49847 games played (bot stands for bottom):

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      40.00      29.16      22.98      20.82      14.88      10.37       8.22       2.81       0.02 
mid         4.66       4.04       2.77       2.16       1.65       0.96       0.74       0.26       0.00 
final       3.14       2.38       1.71       1.50       1.21       0.88       0.72       0.26       0.00

Or graphically,

: Screen Shot 2014-03-25 at 15.51.56.png (61.79 KiB) Viewed 16775 times

Even such a simple ranking model has a big flaw (assuming closed pool of players, sure): it takes an awful lot of games to get to a "real strength," and even with 150k playthroughs (just simulated this, to check) the worst result is almost 2 "stones" off (the median is half a stone off).

Polama · Post by **Polama** » Tue Mar 25, 2014 8:46 am

Mef wrote:
illluck wrote:That seems like a demonstration of immobile rank to me - 6:236 and only dropping a fifth of a stone is pretty ridiculous.
To put this in perspective, this is the equivalent to a normal player who plays 2 games /day having a 4 game losing streak in a day.

Nope, not equivalent. Plugging these numbers into a binomial calculator:

If we expect a 41% win rate, the probability of losing at least 236 games out of 242 'by chance' is about 10^-45.

If we factor in the ~17,000 opportunities for that streak, we're still around, call it, 10^-40.

For a player going 2 games a day, that's 365 games in the 6 month span. If we say he went 0-4, that's 12%. Given 361 4 game spans, that's essentially a given to occur (1-(10^-20) or so?) We'd be extremely surprised if a 41% player didn't have a 4 game losing streak in 365 games, and even more surprised if he had a 3/242 streak in a 17,000 game span.

Wins are streaky by nature, so the probability will be higher in practice. But still, 10^-40 is roughly your odds of being dealt a royal flush in poker, 7 hands in a row.

Put another way, auto-resigning most games, he was probably, what? 30 kyu? So the fact that the system thought he'd only fallen 1/5 a stone was extremely wrong. We know he was much worse than that. He demonstrated it over a very significant number of games. Which, as I understand it, is the most common complain about the kgs rating system: that it overestimates (in this case, vastly overestimates) how much variation can be expained away by chance as the number of games played increases.

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 8:54 am

Polama wrote:
Mef wrote:
illluck wrote:That seems like a demonstration of immobile rank to me - 6:236 and only dropping a fifth of a stone is pretty ridiculous.
To put this in perspective, this is the equivalent to a normal player who plays 2 games /day having a 4 game losing streak in a day.
Nope, not equivalent. Plugging these numbers into a binomial calculator:

If we expect a 41% win rate, the probability of losing at least 236 games out of 242 'by chance' is about 10^-45.

If we factor in the ~17,000 opportunities for that streak, we're still around, call it, 10^-40.

For a player going 2 games a day, that's 365 games in the 6 month span. If we say he went 0-4, that's 12%. Given 361 4 game spans, that's essentially a given to occur (1-(10^-20) or so?) We'd be extremely surprised if a 41% player didn't have a 4 game losing streak in 365 games, and even more surprised if he had a 3/242 streak in a 17,000 game span.

Wins are streaky by nature, so the probability will be higher in practice. But still, 10^-40 is roughly your odds of being dealt a royal flush in poker, 7 hands in a row.

Put another way, auto-resigning most games, he was probably, what? 30 kyu? So the fact that the system thought he'd only fallen 1/5 a stone was extremely wrong. We know he was much worse than that. He demonstrated it over a very significant number of games. Which, as I understand it, is the most common complain about the kgs rating system: that it overestimates (in this case, vastly overestimates) how much variation can be expained away by chance as the number of games played increases.

Can't this just be explained by history inertia? It may be statistically relevant, but the KGS ranking system (IIRC, it's been a while since I checked it) it's almost a predictor-corrector system (sorry for the term, this is used in numerical analysis, for example): it will heavily rely on history to predict the rank, probably correcting after more data points are available. Sure, a huge losing streak is significant, and current, but the historical weight says otherwise, and dampens the current "error"

Polama · Post by **Polama** » Tue Mar 25, 2014 9:14 am

RBerenguel wrote: Can't this just be explained by history inertia? It may be statistically relevant, but the KGS ranking system (IIRC, it's been a while since I checked it) it's almost a predictor-corrector system (sorry for the term, this is used in numerical analysis, for example): it will heavily rely on history to predict the rank, probably correcting after more data points are available. Sure, a huge losing streak is significant, and current, but the historical weight says otherwise, and dampens the current "error"

The algorithm's choice can be explained by history inertia. But the actual performance can't be. If you view a rank as a fixed, static thing and you hit a 200 loss streak the best you can do is throw your hands up and say "that was weird!" and adjust your prediction down slightly. But this streak clearly demonstrates that this account's ability is not static, that the previous 17,000 games are no longer particularly meaningful. When we're at 10^-40 probability, it's significantly more likely that, say, the person suffered extreme head trauma then that they're having a bad day.

The model may work better with humans. But this case is a demonstration that at extreme numbers of games it can no longer respond to absurdly strong signals of a change in rank.

Now, it may be that there's an explicit time mechanism, and that if this account were let to run for a month it would eventually plummet rapidly to 30 kyu. That would be sensible, because the most likely case seems to be that somebody else logged into this account today. You'd want measures from multiple days to be certain. But if we're just looking at game results, the effect should definitely be way, way, way stronger.

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 9:32 am

Polama wrote:
RBerenguel wrote: Can't this just be explained by history inertia? It may be statistically relevant, but the KGS ranking system (IIRC, it's been a while since I checked it) it's almost a predictor-corrector system (sorry for the term, this is used in numerical analysis, for example): it will heavily rely on history to predict the rank, probably correcting after more data points are available. Sure, a huge losing streak is significant, and current, but the historical weight says otherwise, and dampens the current "error"
The algorithm's choice can be explained by history inertia. But the actual performance can't be. If you view a rank as a fixed, static thing and you hit a 200 loss streak the best you can do is throw your hands up and say "that was weird!" and adjust your prediction down slightly. But this streak clearly demonstrates that this account's ability is not static, that the previous 17,000 games are no longer particularly meaningful. When we're at 10^-40 probability, it's significantly more likely that, say, the person suffered extreme head trauma then that they're having a bad day.

The model may work better with humans. But this case is a demonstration that at extreme numbers of games it can no longer respond to absurdly strong signals of a change in rank.

Now, it may be that there's an explicit time mechanism, and that if this account were let to run for a month it would eventually plummet rapidly to 30 kyu. That would be sensible, because the most likely case seems to be that somebody else logged into this account today. You'd want measures from multiple days to be certain. But if we're just looking at game results, the effect should definitely be way, way, way stronger.

I assume the 6 month history is log or 1/exp weighted, so that the first game used barely signals anything. So I guess after 3-4 days of losing everything, the rank may start plummeting, faster and faster (since as soon as you start losing to lower ranked players the rank will fall faster.) The weights used, history timings and other artifacts essentially determine the volatility of rank within the system, but from what I saw of the formula (as I say, it was a looong time ago since I checked it), it's essentially as easy to go up as it is to go down (as Mef explains with "low 4d" "high 4d" numbers in one of the linked threads.)

Pippen · Post by **Pippen** » Tue Mar 25, 2014 9:44 am

Mef wrote:KGS's rating system aims to provide the most accurate rank it can with all data available. It aims to do the best job of predicting the probable outcome between any two players and any handicap (though in practice it only accepts feedback from games H6 or less).

Tygem's rating system does not make any predictions. It does not handle handicap games. It does not make any attempt to ensure proper rank spacing. It suffers from large amounts of noise being introduced by players setting their own ranks. Under an ideal set of assumptions (all ranks properly spaced, all players properly ranked, etc) you still expect to spend 30% of your time at the wrong rank. Tygem's rating system has a place in the go world and many people find it fun. Accurately assessing your go strength and comparing yourself on a fixed scale to a pool of larger players isn't it.

Yes, a Tygem rank x ranges wider. On KGS this rank x would be constructed over two ranks that cover that area. But also on Tygem you will not see a 5D beating a 7D more than 2 out of 10 games, so these ranks are accurate, but not that accurate like KGS and therefore not so fitting for ranked HC games (but not inadaequate either). Of course there is the exception when we talk about players that begin with Tygem and self-ranked themselves, but you gotta consider thousands of players on Tygem. IMO this pure mass "cleans" things up.

I do believe that after 100 games Tygem gives u a rank that compares u accurately to the big player pool. KGS does it preciser, but at the cost of fun, thrill & motivation. Because believe it or not: Knowing that the next game will decide about your promotion does give u chills you will not have at KGS^^. It's like if you compare regular season games to playoff games in sports.

RobertJasiek · Post by **RobertJasiek** » Tue Mar 25, 2014 9:56 am

In/deflation of global player population: this is a problem of every rating system, because of player in/output and improvement of part of the players. It is possible to add global corrections for that for every, incl. my quick draft of a, system.

Handicaps: I dislike rating of handicap games because one needs to make pretty arbitrary assumptions by far not all players will meet.

HermanHiddema · Post by **HermanHiddema** » Tue Mar 25, 2014 10:18 am

@Robert: So we add anchors or whatever to stabilize the rating.

Next problems:

1. Wildly inaccurate ranks change slowly. If a new player enters as 10k, but is actually 1d, he could go 50-0 and still be only 5k.

2. Constant volatility. The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 10:35 am

HermanHiddema wrote:@Robert: So we add anchors or whatever to stabilize the rating.

Next problems:

1. Wildly inaccurate ranks change slowly. If a new player enters as 10k, but is actually 1d, he could go 50-0 and still be only 5k.

2. Constant volatility. The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

You beat me to it. We need anchors to stabilise the ranks, add predictions (essentially to estimate this hidden "inner strength") and we end with a ranking system very similar or equivalent to KGS's

RobertJasiek · Post by **RobertJasiek** » Tue Mar 25, 2014 11:24 am

HermanHiddema wrote:So we add anchors or whatever to stabilize the rating.

No anchors, and their artificial problems. I prefer a global method to balance in/deflation.

1. Wildly inaccurate ranks change slowly.

New players: they choose an appropriate rank. If they guessed badly, they reset their initial rank.

Fast improving players: +0.1 per win is fast enough.

2. Constant volatility.

Great. This reflects reality.

The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

Great. It should not reflect that. The system is designed to be very volatile, and sufficiently volatile for 15k or 5d.

RBerenguel wrote:We need anchors to stabilise the ranks, add predictions (essentially to estimate this hidden "inner strength") and we end with a ranking system very similar or equivalent to KGS's

Anchors and predications are NOT needed to stabilise the (volatile) ranks within the global population. Instead one can use an assumption for a global distribution. The system would be very different from KGS, because global stablisation does / need / should not prevent each player's possible great volatility.

HermanHiddema · Post by **HermanHiddema** » Tue Mar 25, 2014 11:30 am

RobertJasiek wrote:
2. Constant volatility.
Great. This reflects reality.

You really think that a 15 kyu and a 5 dan improve at the same rate?

RobertJasiek · Post by **RobertJasiek** » Tue Mar 25, 2014 11:32 am

We seem to have a misunderstanding to what "constant" refers:) You: players of different ranks have different volatility (yes!). I: regardless of different ranks, a rating system can be kept simpler by using a constant volatility regardless of rank.

EDITED

HermanHiddema · Post by **HermanHiddema** » Tue Mar 25, 2014 11:36 am

RobertJasiek wrote:Of course not. But I consider it an overkill to treat them differently. I think rating systems can and should be as simple as possible.

Perhaps, then, you should have said "this does not reflect reality, but I am willing to sacrifice accuracy for simplicity".

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 12:16 pm

RobertJasiek wrote:
HermanHiddema wrote:So we add anchors or whatever to stabilize the rating.
No anchors, and their artificial problems. I prefer a global method to balance in/deflation.

1. Wildly inaccurate ranks change slowly.
New players: they choose an appropriate rank. If they guessed badly, they reset their initial rank.

Fast improving players: +0.1 per win is fast enough.

2. Constant volatility.
Great. This reflects reality.

The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.
Great. It should not reflect that. The system is designed to be very volatile, and sufficiently volatile for 15k or 5d.

RBerenguel wrote:We need anchors to stabilise the ranks, add predictions (essentially to estimate this hidden "inner strength") and we end with a ranking system very similar or equivalent to KGS's
Anchors and predications are NOT needed to stabilise the (volatile) ranks within the global population. Instead one can use an assumption for a global distribution. The system would be very different from KGS, because global stablisation does / need / should not prevent each player's possible great volatility.

This seems to imply that the method you suggest may be: consider the previous 4 games (for instance, just a much smaller sample than 6 months) as rank estimation dampeners (to keep volatility slightly under control) and an estimation of the "inner strength" as valid, current rank. Is this close to your idea?

Life In 19x19

A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks