A Curious Case Study in KGS Ranks

Mef · Post by **Mef** » Wed Mar 26, 2014 8:51 am

arndt wrote:I'm coming in late, and I haven't read the preceding discussion in detail, but has it been stated what the ranks of the users challenging the bot during its resigning period were? If they were 15k to 10k, why should it plummet to 30k rather than 15k?

Sorry if this was already asked and answered.

Bots will typically play a wide range of players at a wide range of handicaps. For the most part this bot plays 5k-18k players anywhere from +-6H. The distribution of games between white/black was roughly equal, and for the most part at least they were the default handicaps assigned by KGS.

That said you are correct, looking at just the win/loss rate over those games played, you would expect this bot to probably get placed around 15k (3 stones below the 12k it was ranked at when it's rating was dropping)

hashimoto · Post by **hashimoto** » Wed Mar 26, 2014 9:05 am

Mef wrote:Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.

Yes, you seem to prefer real data and real results... as long as those results fit within the narrow model of how you think ranking should behave. Of course you can find data to explain how the KGS ranking system works. You can't seem to see past the idea that the model itself may not be desirable for some people.

RBerenguel · Post by **RBerenguel** » Wed Mar 26, 2014 9:16 am

RobertJasiek wrote:Thanks, but these data can be influenced by the frequencies to attend tournaments / play regularly on servers, which might be lower for beginner ranks.

Histogram of GoR for players with more than 5 tournaments, 40 breaks (4149 players)

hyperpape · Post by **hyperpape** » Wed Mar 26, 2014 9:23 am

I can't speak for Mef, but I'm perfectly willing to accept that some people would trade away a portion of the predictive power that KGS has for the predictability that other systems have. I don't think that trade off is absolutely wrong, though it's not the one I'd make. But I also expect them to be clear about how it might work and that this is what they're doing. I also expect them to be as accurate about what KGS does.

Mef · Post by **Mef** » Wed Mar 26, 2014 9:37 am

hashimoto wrote:
Mef wrote:Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.
Yes, you seem to prefer real data and real results... as long as those results fit within the narrow model of how you think ranking should behave. Of course you can find data to explain how the KGS ranking system works. You can't seem to see past the idea that the model itself may not be desirable for some people.

I prefer real data and real results, period. When real data is unavailable, modeling and analysis are a reasonable substitute. I prefer having clearly stated, objectively measurable goals with with to evaluate a rating system. I strongly dislike vague hypotheticals and baseless speculation.

I understand completely that different people want different things out of rating systems and frequently acknowledge that. KGS's system strives for accuracy over noise. Many have stated they prefer noise because to them it is more fun. There's nothing wrong with that, it's just not a goal of KGS's rating system.

In KGS's subforum discussing KGS´s rating system I assume unless otherwise stated we are evaluating systems based on prediction accuracy, because I assume it is well known what the rating system's aim is.

If we wanted to establish other objectively measurable goals for a rating system I would be happy to evaluate with those in mind, but for here and now I chose these.

Polama · Post by **Polama** » Wed Mar 26, 2014 10:36 am

Mef wrote: Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.

Ok, let's stay away from hypotheticals. What we strictly, factually know is that over 242 games this account was at least 3 stones weaker, potentially more depending on the exact nature of the bug. The length of the streak was such that bad luck is completely out of the question as an explanation. We can consider the previous 17,000 games, but again, the streak was long enough that we can very easily see that these are distinct distributions. Any statistician looking at the results would state that there's no longer a connection between the earlier record and the newest record.

I think the only possible conclusion without bringing in hypothetical factors is that the algorithm was wrong in this case. Given the time series of results, I would think a student in a statistics class would not be marked correct for estimating the win% of the the next game against an 11kyu at 30% or whatever the lowest rank reached would suggest.

RBerenguel · Post by **RBerenguel** » Wed Mar 26, 2014 11:05 am

Polama wrote:
Mef wrote: Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.
Ok, let's stay away from hypotheticals. What we strictly, factually know is that over 242 games this account was at least 3 stones weaker, potentially more depending on the exact nature of the bug. The length of the streak was such that bad luck is completely out of the question as an explanation. We can consider the previous 17,000 games, but again, the streak was long enough that we can very easily see that these are distinct distributions. Any statistician looking at the results would state that there's no longer a connection between the earlier record and the newest record.

I think the only possible conclusion without bringing in hypothetical factors is that the algorithm was wrong in this case. Given the time series of results, I would think a student in a statistics class would not be marked correct for estimating the win% of the the next game against an 11kyu at 30% or whatever the lowest rank reached would suggest.

A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

skydyr · Post by **skydyr** » Wed Mar 26, 2014 11:29 am

RBerenguel wrote:
Polama wrote: Ok, let's stay away from hypotheticals. What we strictly, factually know is that over 242 games this account was at least 3 stones weaker, potentially more depending on the exact nature of the bug. The length of the streak was such that bad luck is completely out of the question as an explanation. We can consider the previous 17,000 games, but again, the streak was long enough that we can very easily see that these are distinct distributions. Any statistician looking at the results would state that there's no longer a connection between the earlier record and the newest record.

I think the only possible conclusion without bringing in hypothetical factors is that the algorithm was wrong in this case. Given the time series of results, I would think a student in a statistics class would not be marked correct for estimating the win% of the the next game against an 11kyu at 30% or whatever the lowest rank reached would suggest.
A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

I would additionally point out that given a long term perspective, the data for the loss streak is a fluke and should not be counted heavily. The corollary to "the rank should have dropped at least X stones" is that as soon as the 12 hour issue was over, the new rank would be equally off from the presumably correct rank where it was before the problem occurred. I suppose you could argue that it would then be a feature of the proposed system's volatility that the rank goes back up relatively quickly, but it seems like it would be better not to have the huge rating discrepancy in the first place. If you look at 24 hours before the occurrence and 24 hours after, the ratings system as it is seems significantly more correct than a proposed more volatile one.

Looking at the somewhat different rating systems used for DGS and the old OGS (not sure about the new one) they both suffer from problems where a player stops playing, and loses some large number of games that time out over the weeks that they are gone. By having their rank drop 10 or more stones at a blow, when they start playing again, the act of them fighting their way back up to the old rank destabilises the entire ranking system to some degree, as all the ranks get corrected, and may end up skewing it in one direction or another over time.

Going with the assumption that rank differences should be relatively predictive of game outcomes, why is this a good thing? And as Mef mentioned, if you don't think that rank differences should be relatively predictive of game outcomes, you should be looking at a different system, or asking if you actually need to worry about rank at all, rather than the KGS one which has this explicit goal.

Polama · Post by **Polama** » Wed Mar 26, 2014 12:14 pm

RBerenguel wrote: A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

I'm not a statistician so I can't speak with any authority, but I think an advanced statistical model would view this case as a meaningful shift, as an extreme outlier, or at least as not conforming to the expected distribution.

skydyr wrote:I would additionally point out that given a long term perspective, the data for the loss streak is a fluke and should not be counted heavily. The corollary to "the rank should have dropped at least X stones" is that as soon as the 12 hour issue was over, the new rank would be equally off from the presumably correct rank where it was before the problem occurred.

My point is that I think people are underestimating how powerful a signal a loss streak of this magnitude really is. I understand we have priors that say "people don't drop multiple stones all at once", but this should overwhelm those priors. We wouldn't expect this sort of streak by chance with trillions of go players. Something definitely happened above and beyond a bad day. In this case it was counteracted the next day, but I see no reason to assume that shift will inherently be followed immediately by a recovery. I'd bet there's never been a streak of a 100th of this probability in an established, non robot account.

Now, this case is a bizzare one. It's an extreme edge case. I'm fine with the algorithm not handling it well. I find it interesting for it's extremenes, and it's not something you should draw conclusions from for general players. My point was merely that I wouldn't hold this up as an example of the algorithm getting an extreme case right. I think it got an extreme case wrong in the way I'd expect it to get it wrong.

RBerenguel · Post by **RBerenguel** » Wed Mar 26, 2014 12:24 pm

Polama wrote:
RBerenguel wrote: A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."
I'm not a statistician so I can't speak with any authority, but I think an advanced statistical model would view this case as a meaningful shift, as an extreme outlier, or at least as not conforming to the expected distribution.

I'm no statistician either, but I know a little about it (and I know people who are into statistical modelling.) Models, as such, are general. Outliers? Well, they are outliers. A model only needs to model most of the subjects, if it does a good job with most subjects, it is a good model. Probably there's a better model than KGS's current model (one that takes into account history, weights, fast improvement, etc), but finding it is probably too hard to be worth finding it.

Polama · Post by **Polama** » Wed Mar 26, 2014 12:47 pm

RBerenguel wrote:
Polama wrote:
RBerenguel wrote: A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."
I'm not a statistician so I can't speak with any authority, but I think an advanced statistical model would view this case as a meaningful shift, as an extreme outlier, or at least as not conforming to the expected distribution.
I'm no statistician either, but I know a little about it (and I know people who are into statistical modelling.) Models, as such, are general. Outliers? Well, they are outliers. A model only needs to model most of the subjects, if it does a good job with most subjects, it is a good model. Probably there's a better model than KGS's current model (one that takes into account history, weights, fast improvement, etc), but finding it is probably too hard to be worth finding it.

Agreed. Models usually aren't judged on their handling of outliers, although there's obviously differences in how well they do. In some fields outliers are exactly what you're most interested in, though that's not the case here.

But the performance of the KGS algorithm on an extreme outlier was specifically what this thread was created about. And although I agree it isn't particularly important, it's interesting. There's clearly disagreement on how the ranking of the performance should have gone. I'm arguing that given the full kgs records and asked about this account at that point in time, I certainly wouldn't say he was off by less than a stone. That's not intended to mean that it's a bad algorithm, just that in the case under discussion, I disagree with the conclusion.

Bantari · Post by **Bantari** » Wed Mar 26, 2014 12:56 pm

hashimoto wrote:You can't seem to see past the idea that the model itself may not be desirable for some people.

This seems to be the crux of the argument here. Yes, some people prefer fun over accuracy and predictive power, and such people have for example Tygem to have fun on. However, some people prefer accuracy and predictive power over widely inaccurate ratings and such people have for example KGS to play on.

I absolutely see no reason why all servers should be the same, catering to one specific group of people and making sure *those* selected people have more fun. It is a big world, and there is certainly room for few *different* approaches. Especially since it seems that the preference of one over the other is purely subjective over the short streak.

As for rank stability and inertia, I think both systems have advantages and disadvantages. For improperly ranked players, KGS system offers much more quick adjustment than Tygem (as already stated.) For properly rated players, a single winning/losing streak (like a bad or good day) can get them dislodged from their proper rank much faster - and thus make the "improperly" rated easier. Wait... ok, it seems one system has more advantages than the other.

Somebody called Tygem ratings a roulette. Its fun, its fast paced, and its exciting, and so there is a place for it in the world. Just like there is place for arcade games and shoot-em galleries and maybe even peep-holes.

Personally, I think the accuracy and predictive power are more valuable then cheap thrills of seeing numbers by your name change daily. But that's just me. Or is it?

How about real-world ratings. Lets say RJ is 5d, and on the verge of being invited to a prestigious tournament based on this rank. But look, there is a 4d player who just won 20 games in a row from his friend, and now he is invited instead, as a 6d. Ha ha ha, very exciting! So in reality, if you adopt +/-0.1 per game rating system, you will have to include all kinds of weights, checks, balances, and factors - just to make it behave more sensible, more like the current system (or like the KGS system.)

Arcade and cheap thrills are good on a server, and I am glad such server exists for those who like this kind of stuff. But this simply cannot be the *only* model we use, and not even the main one. This is what I think, even though I am not going to get into all this math stuff, have enough of that at work to play with it on my free time.

RobertJasiek · Post by **RobertJasiek** » Wed Mar 26, 2014 1:07 pm

Bantari, my proposal is not meant for accurate real world ranks.

Bantari · Post by **Bantari** » Wed Mar 26, 2014 1:19 pm

RobertJasiek wrote:Bantari, my proposal is not meant for accurate real world ranks.

This is obvious.

What you need to explain is why would you want such system, which is not good enough for real world, be implemented on each server.
And before you object - if you only want it on *some* servers, it already is - on Tygem, no?
So I don't get what the fuss is about. Just play there and be happy like a clam.

You certainly have to admit that there *is* room for a major server with more accurate real-world-like rating system.
Or you don't want to admit that, and this is the point of contention?

RobertJasiek · Post by **RobertJasiek** » Wed Mar 26, 2014 1:50 pm

Bantari wrote: What you need to explain is why would you want such system, which is not good enough for real world, be implemented on each server.
And before you object - if you only want it on *some* servers, it already is - on Tygem, no?
So I don't get what the fuss is about. Just play there and be happy like a clam.

You certainly have to admit that there *is* room for a major server with more accurate real-world-like rating system.

There are also other reasons why I do not play much on other servers, such as extremely disliking having to use another software for every server.

There are other reasons to like KGS, so I want the worst part of KGS (the rating system) to improve so that I can better enjoy to good features of KGS.

Regardless of whether my rating proposal or something similar is adopted on KGS, this is not so important. It is also a think model for encouraging to overcome the too great rating stability for quite a few players. I made other proposals that were rejected, but it is not the specific proposal that matters. Instead, it is the aim of overcoming the problem.

The proposal is a rough draft; I do not mind if it is improved, completed, changed to model also accuracy to some reasonable extent etc.

I have not said that one system must be used on all servers. You have made this up.

There is room for a server with real world ratings. In fact, there is so much room that such a server does not even exist remotely. Don't even try to pretend KGS would be such a server, ridiculous. On KGS, equally KGS-ranked players can easily be 5 real world ranks apart.

There is also room for a server with accurate ratings, i.e., where almost equal ratings imply a great likelihood of 50% winning chances (in non-integer komi even games). As before. Which "accuracy" do you paint on KGS? Your dream of how accurate it should be? Accuracy measured by letting two KGS players play also real world games and assessing their winning percentages.

My system (when worked out to have global non-deflationary stability) would have much greater volatiliy, but I am not at all convinced it would have smaller accuracy. Rather I think that, on average for every particular player, it would have greater accuracy, because it can correct his temporarily wrong ratings much more quickly.

Life In 19x19

A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks