Oddities in KGS ranking system

wms · Post by **wms** » Tue Aug 23, 2011 9:25 am

KGS is assuming that all ranks are stable, but the end result is the same as assuming that you improve with your opponents when you aren't playing.

When I did the research for the current KGS rank system, I did code up a system that would assume each player was improving at a constant rank, and it would try to compute the slope for your rank that best fit the available data. It made the rank system a lot more complex and run a lot slower, but in the end it made the system no better at predicting the outcome of future games (which is what I used as my metric for accuracy), so I took that algorithm out.

uPWarrior · Post by **uPWarrior** » Tue Aug 23, 2011 11:12 am

daniel_the_smith wrote:
uPWarrior wrote:The problem is that adding a new stone has less and less impact as the number of stones grows.

E.g.:
7d vs 2d at 5 handi and -6.5 komi, the expected win rate is 50%.
However, 7d vs 2d at 6 handi and -6.5 komi, the expected win rate is 79% for black.

Does anyone believe that a single new handicap stone could produce this difference in winning percentages?
Sure, I could believe it, but I'd also like to know where your figures came from

http://senseis.xmp.net/?KGSRatingMath

See expected win rates given rank differences.

flOvermind · Post by **flOvermind** » Fri Aug 26, 2011 6:52 am

xed_over wrote: KGS tries to make the assumption that even if you're not playing on KGS, that you're still playing somewhere and improving. And it does that by comparing you with your past opponents.

That's not true.

But KGS makes the assumption that it does not "know" your rank or the rank of your opponents for sure at any point in time. If its guess of the rank of a player changes (e.g. because it gets more data, that is, the player plays more games), that means that the previous guess must have been wrong. So it needs to adjust everything calculated with this "wrong guess", including the ratings of opponents that haven't played in a while.

So the problem is actually the other way round: Because it assumes the rank of players *does not change* in time, it has to correct your rating all the time as the rank of others change. And because the rank of the average Kyu will rather go up than down, that typically results in a rank drift upwards when you don't play. Note that this problem can't be solved by simply inserting the assumption that players improve at a constant rate. It would still have to re-calculate everything whenever the rank changes, but now with a linear backwards interpolation instead of a simple constant, but with both players having the same assumed improvement rate, that wouldn't actually change anything.

IGS for example has the opposite problem: It assumes that its knowledge of the current rank is perfect, and any change to this number reflects a real change in skill, which is of course ridiculous. That assumption is actually a lot worse than the KGS assumption. That way, every misranked player hurts the long-term accuracy of the rating system, while in the KGS system there are only short-term effects, and the system will adapt with time as it gathers more data.

The solution to both problems, as has already been mentioned: WHR.
This system both assumes that ratings change over time, and that the knowledge of the "true" rating is always just a guess that can change later due to more data being available.

Kaya.gs · Post by **Kaya.gs** » Tue Aug 30, 2011 10:22 am

wms wrote:KGS is assuming that all ranks are stable, but the end result is the same as assuming that you improve with your opponents when you aren't playing.

When I did the research for the current KGS rank system, I did code up a system that would assume each player was improving at a constant rank, and it would try to compute the slope for your rank that best fit the available data. It made the rank system a lot more complex and run a lot slower, but in the end it made the system no better at predicting the outcome of future games (which is what I used as my metric for accuracy), so I took that algorithm out.

I understand that the kgs rating system is more sophisticated than Wbaduk for example. I am not a ratings master at all, and i should start understanding a lot more about this things.

First of all, how do we know a rating system is accurate? How can we compare accuracy between KGS and Wbaduk?.

I do believe Wbaduk has a higher sample of players, which means it should present less inacuracy. However they have the issue that from 3d to weak 7d they are almost the same strength, and then inside 7d, you feel 2 stones difference.
I dont know why that happens.

I think Kgs rating system feels very good from say, 8k to 2d. From 3d up it starts to feel a little funky, but its probably due to the lack of players. I can say that from 6d up, i have a certain disbelief for ranks.

Back when KGS showed up i remember that IGS(as Wbaduk) required you to play 20 games to get a solid rank, and that sucked.
But what i cant stand on KGS is that accounts get heavy. Its feels like you are carrying a cross for all our previous losses, which is why people constantly make new accounts. In Wbaduk, maybe because of all the games needed to get a solid rank, i only have 1 account and i havent met anyone trying to make a second one. I've never been unhappy with my Wbaduk rating, and it has moved a lot over time.

My feeling with history-based rating is that it tries to assign you a rank basically on average. So if you lose to 7d and beat 5d you are 6d. But the truth is that sometimes you play like 7d, and sometimes like 5d.

With point based systems, as you play you approach the strengh you have right now, not the average, which i think is natural and better.
Example:
KGS: i play and lose X games with 7d, then lose X games with 6d. then i win X games with 5d and win X games with 6d. Given reasonable time-frames, i would probably be in 6d.
Wbaduk: i play and lose X games with 7d, then i go down and lose X games with 6d. After that im 5d. Then i win X games, get to 6d, win X games, get to 7d.

I cant give hard examples with numbers of the top of my head, but i think this gets my point across.

What do you guys think?

I keep promissing i will make the thread about rating, it will be up soon

HermanHiddema · Post by **HermanHiddema** » Tue Aug 30, 2011 10:47 am

Playing strength varies enormously depending on all sorts of conditions, such as thinking time, alcohol, lack of sleep, or whatever. Any rating system that tries to capture that playing strength in a single number is guaranteed to be inaccurate in that respect. That's why a rating system like Glicko also reports a deviation. So a 4kyu with deviation of 2 is 95% likely to play with a strength between 6kyu and 2kyu. That does not mean there is some precise actual strength between 2kyu and 6kyu that they really are. Rather, it means that even though their playing strength varies, the playing strength in any one game is very likely to be between those values.

But of course, for all sorts of purposes, from determining handicap to sorting players, you very much need a single number.

Now I think that often, players themselves are very much aware when their own strength is likely to be better or worse than their average. That is why people create separate accounts for blitz, or for playing casually instead of seriously. They don't want games that are likely to be bad to damage their rating too much.

Now there may be some ways to work around this issue, based on the player's own knowledge. Here's a few ideas:

Allow a player to secretly mark a game as "bad" before their first move. If they mark it as such, it will count less heavily for the rating (say, only 50%). This way, if you're tired, drunk or otherwise not in great shape, you can play with your main account with less chance of damage to your rating.

Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

Allow a player to request a reevaluation every X games (say 50). If they do this, their next 3 games count more strongly for their rating. This allows a player who feels that his rating is lagging to quickly gain some points. The game counts for the opponent's rating as usual, not extra.

hyperpape · Post by **hyperpape** » Tue Aug 30, 2011 11:00 am

HermanHiddema wrote:Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

From a UI perspective, this could get weird. How would ranks be displayed when this happens?

HermanHiddema wrote:Allow a player to request a reevaluation every X games (say 50). If they do this, their next 3 games count more strongly for their rating. This allows a player who feels that his rating is lagging to quickly gain some points. The game counts for the opponent's rating as usual, not extra.

This is a neat idea, but I wonder how many players would take to automatically requesting a reevaluation.

wms · Post by **wms** » Tue Aug 30, 2011 11:04 am

Kaya.gs wrote:First of all, how do we know a rating system is accurate? How can we compare accuracy between KGS and Wbaduk?.

Basically, to determine the accuracy of a rating system, you feed in all but the last month's worth of games. Then you take the ranks it gives you, and use those ranks to determine who should win each game in the last month. More games predicted correctly means a better system.

If you want to get even better, instead of just predicting the win/loss of each game, have the rank system predict the probability of each player winning the game. Then you score points according to the sum of the logs of the probability of each outcome, and compare systems to see which is better. (This is equivalent to comparing the product of all win probabilities, but computing the product of the win probabilitiess will underflow if you have a lot of games, so summing the logs is more practical).

You can also, for example, score only the handicap games, to see how well your system computes proper handicaps. This was important to me so I used that as a second metric.

Once I build a system that would take any algorithm and spit out a score on how well it did, I was able to in the space of just a couple weeks of tuning and tweaking come up with a system that to me works extremely well. It does have quirks, but all rating systems do, and the quirks (e.g., your rank moves even when you don't play) aren't things that bother me very much, while I'm very happy with the accuracy.

Edit: For an example, you have two games, A vs. B and C vs. D. If A and C won, and system1 said A has a 50% chance of winning, while C has a 60% chance, then system1 gets ln(0.5)+ln(0.6) = -1.204 points. If system2 said A had an 80% chance of winning and C had a 40% chance of winning, then system2 gets ln(0.8)+ln(0.4) = -1.139 points. System2 has a higher score, so system2 is the better rank system. (Note that the scores will always be negative, because probabilities are always less than 1, so whichever system is closer to a score of 0 is the better one).

Kaya.gs · Post by **Kaya.gs** » Tue Aug 30, 2011 11:10 am

HermanHiddema wrote:Playing strength varies enormously depending on all sorts of conditions, such as thinking time, alcohol, lack of sleep, or whatever. Any rating system that tries to capture that playing strength in a single number is guaranteed to be inaccurate in that respect. That's why a rating system like Glicko also reports a deviation. So a 4kyu with deviation of 2 is 95% likely to play with a strength between 6kyu and 2kyu. That does not mean there is some precise actual strength between 2kyu and 6kyu that they really are. Rather, it means that even though their playing strength varies, the playing strength in any one game is very likely to be between those values.

But of course, for all sorts of purposes, from determining handicap to sorting players, you very much need a single number.

Now I think that often, players themselves are very much aware when their own strength is likely to be better or worse than their average. That is why people create separate accounts for blitz, or for playing casually instead of seriously. They don't want games that are likely to be bad to damage their rating too much.

Now there may be some ways to work around this issue, based on the player's own knowledge. Here's a few ideas:

Allow a player to secretly mark a game as "bad" before their first move. If they mark it as such, it will count less heavily for the rating (say, only 50%). This way, if you're tired, drunk or otherwise not in great shape, you can play with your main account with less chance of damage to your rating.

Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

Allow a player to request a reevaluation every X games (say 50). If they do this, their next 3 games count more strongly for their rating. This allows a player who feels that his rating is lagging to quickly gain some points. The game counts for the opponent's rating as usual, not extra.

This feels way to complex. And users playing with the rating system makes me feel uneasy. It shold be global and simple.

Remember that KGS makes you think of your bad game, but in Wbaduk, u dont worry as much. Yes , you are likely to lose, but who cares, u can get it back with a single victory, and a likely one. A loss on kgs feels its there to drag you down forever ,or 6 months which is pretty much the same

.

I like the concept of a fix number of games. 14 victories = rank up. Makes it veeeery predictable. But i worry that being so unsophisticated, it will give innacurate results, and people will find it too plain.

Then again, i dont know what "accurate " means. What is the price of innacuracy? winning easy games and losing hard games all the time ?

wms · Post by **wms** » Tue Aug 30, 2011 11:14 am

Kaya.gs wrote:Then again, i dont know what "accurate " means. What is the price of innacuracy? winning easy games and losing hard games all the time ?

Possible costs of inaccuracy, different inaccurate systems will have different sets of problems:

* Some players will be overranked, making all their games very hard. Others will be underranked, making their games very easy.
* Handicap games will be consistently very easy (or very hard) for white to win
* Clusters of players who play each other a lot can drift away from the population, meaning that if you play somebody from a group of friends, then even though your ranks are equal the game could be an easy win for one or the other of you.

There are others, but that's off the top of my head.

Mef · Post by **Mef** » Tue Aug 30, 2011 11:54 am

Kaya.gs wrote: A loss on kgs feels its there to drag you down forever ,or 6 months which is pretty much the same .

I like the concept of a fix number of games. 14 victories = rank up. Makes it veeeery predictable. But i worry that being so unsophisticated, it will give innacurate results, and people will find it too plain.

Then again, i dont know what "accurate " means. What is the price of innacuracy? winning easy games and losing hard games all the time ?

Just because something feels a certain way doesn't mean that feeling is true, or even rational. (=

Last time I crunched the numbers (which things may have changed since then, but I would imagine they are still close) a worst case scenario if you play games at a consistent rate: ~40% of your rating is from games within the last month, ~80% of your rating is from games within the last three months.

Also, I think it would be reasonable to ask up front -- Do you want handicap games to be meaningful? If you are trying to have properly spaced handicap games, a simple winning streak formula is unlikely to give you reasonable results. So much so in fact, that it would be a poor idea to allow handicap games to be considered for rank. At that point I would recommend discarding the traditional rank structure anyway, because what does being three stones stronger than someone even mean if you can't give them three stones? If you are merely predicting even game winning chances, perhaps just give an elo style rating.

Kaya.gs · Post by **Kaya.gs** » Tue Aug 30, 2011 4:08 pm

wms wrote:
Once I build a system that would take any algorithm and spit out a score on how well it did, I was able to in the space of just a couple weeks of tuning and tweaking come up with a system that to me works extremely well. It does have quirks, but all rating systems do, and the quirks (e.g., your rank moves even when you don't play) aren't things that bother me very much, while I'm very happy with the accuracy.

Edit: For an example, you have two games, A vs. B and C vs. D. If A and C won, and system1 said A has a 50% chance of winning, while C has a 60% chance, then system1 gets ln(0.5)+ln(0.6) = -1.204 points. If system2 said A had an 80% chance of winning and C had a 40% chance of winning, then system2 gets ln(0.8)+ln(0.4) = -1.139 points. System2 has a higher score, so system2 is the better rank system. (Note that the scores will always be negative, because probabilities are always less than 1, so whichever system is closer to a score of 0 is the better one).

Possible costs of inaccuracy, different inaccurate systems will have different sets of problems:

* Some players will be overranked, making all their games very hard. Others will be underranked, making their games very easy.
* Handicap games will be consistently very easy (or very hard) for white to win
* Clusters of players who play each other a lot can drift away from the population, meaning that if you play somebody from a group of friends, then even though your ranks are equal the game could be an easy win for one or the other of you.

There are others, but that's off the top of my head.

Your expertise here is very much appreciated!. It is simple, one system is more accurate if it can predict results better.

I am a practical man, but i believe in theory also. The idea to compare different rating systems, and tweaking them makes me think of making this open source.

I will talk to Polly right away about making a project on Github that makes runs of statistics, and can potentially have plug&play systems. this would allow to compare different settings and also make it easy to tweak.

karaklis · Post by **karaklis** » Tue Aug 30, 2011 9:55 pm

Kaya.gs wrote: I do believe Wbaduk has a higher sample of players, which means it should present less inacuracy. However they have the issue that from 3d to weak 7d they are almost the same strength, and then inside 7d, you feel 2 stones difference.
I dont know why that happens.

In the dan ranks, WBaduk is more or less ok, but in the kyu ranks, below about 2k it is completely crap.

In spite of its weaknesses the KGS ranking system seems to be the most accurate among the common realtime go servers especially in these areas.

danielm · Post by **danielm** » Wed Aug 31, 2011 2:00 am

HermanHiddema wrote: Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

This reminds me of one thing I read about the StarCraft 2 league system, where it would put a player up against a player from a stronger league occasionally to test their skill.

The concept of this seems very promising to me, as it should solve the issue of very long winning (or losing) streaks without making ranks too volatile. E.g. if a 4k wins four games in a row (or sooner), the account could be marked as 4k+ (or something else to avoid confusion with the IGS +, or it doesn't have to be visibly marked at all), meaning that the player is still considered a 4k, but will play the next game(s) as a 3k handicap-wise to test his strength more severely. Perhaps this would only apply to automatching, and the default handicap suggestions in manual games.

This could also happen the other way around with players playing one handicap stone weaker (4k- playing as 5k), which might make serious slumps more frustrating, but at the same time might also help to recover if the player regains confidence from playing truly weaker players in even games.

In chess, the lack of handicaps has the advantage that one can increase (or ruin...) ones rating quite fast by playing significantly higher or lower rated players, and this concept would bring some of that to go without losing the advantages of proper handicap games. While it would more often lead to non-proper handicap games, that is not necessarily a bad thing, as the rating system will take those differences into account of course (and there is nothing wrong in essence with occasionally playing an easy or hard game, after all chess players do this almost every single time they play).

It might be harshest on the opponents of e.g. a 4k+ player, because they stand a lot to lose from losing against a 4k in an even game who might actually be stronger, but I'm sure that can be balanced out with some math geekery.

E.g. rating change could be less severe for the opponents of such a "tested" player, or corrected afterwards if the rating of the 4k+ actually changes (which I believe something like WHR would do anyway?).

Mef · Post by **Mef** » Wed Aug 31, 2011 3:55 am

danielm wrote: It might be harshest on the opponents of e.g. a 4k+ player, because they stand a lot to lose from losing against a 4k in an even game who might actually be stronger, but I'm sure that can be balanced out with some math geekery. E.g. rating change could be less severe for the opponents of such a "tested" player, or corrected afterwards if the rating of the 4k+ actually changes (which I believe something like WHR would do anyway?).

For Whole-history and Decayed-history (KGS) there is no penalty for helping an underranked played get promoted, as ultimately the promotion is figured in with the ranking calculations. For incremental systems like Elo, or "win X number of games to promote" helping an underranked person earn a promotion requires a bit of altruism as the risk/reward scenario is more one-sided.

shapenaji · Post by **shapenaji** » Wed Aug 31, 2011 4:46 am

Herman: You know, I've often wondered if the best approach to rank is to track a player's distribution and mean. And then use bayes theorem to update the distribution based on the distribution of their defeated and victorious opponents.

Then you could look at each player's unique distribution, rather than just assuming everybody has a normal-distribution...

Life In 19x19

Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system

Re: Oddities in KGS ranking system