A Curious Case Study in KGS Ranks

Mef · Post by **Mef** » Wed Mar 26, 2014 6:40 pm

mitsun wrote:
Mef wrote: The KGS rating system aims to answer the latter question (predict the outcome of the next game), at the necessary expense of the former question (describe the result of the previous game).
Hmm, I thought I understood the KGS rating system until I read this. I would have said that the KGS rating system is designed to accurately describe the results of the previous games, with the assumption that this allows it to predict the outcome of the next game.

On the subject of a player whose rank changes drastically and discontinuously, that is an unusual case which violates the assumptions of the rating model, and I don't think is it particularly interesting to see how KGS or any other rating system copes with this anomaly.

This is one of those where perhaps you run into the definitions game, but what I mean is this:

KGS's rating is always how it predicts you will play your next game. This is one of the reasons why things like rank drift, etc occur. In spite of new knowledge learned, it never goes back to alter a previous prediction made. For instance in the situation that spawned this thread, if you wanted to have a better descriptive system you would take the set of known data, and find a changing rank to it (probably ending up with a model that has a good fit for a 24 hour step change then reverting).

moboy78 · Post by **moboy78** » Wed Mar 26, 2014 7:36 pm

I don't really care about all the math and statistics in this thread thus far, but from what I see and have seen on this forum, I think it's safe to say that while the kgs rating system is mathematically sound, it generally makes people annoyed. I think I speak for just about every go player I've met when I say that if I have winning streak that lasts for an extended period of time, even if I play a lot of games on kgs a week (I realize that "a lot" is a vague term but it doesn't take a genius to think of a number of games big enough to qualify as "a lot" for that period of time), which as I understand kgs's rating system would make each win count for less than a win for someone who plays less games and wins, I would want my actual rank, not my rating, to reflect my increase in strength. If I can go around thrashing everyone my rank (let's assume for the sake of argument that I'm not really playing many people outside of my rank during this proposed winning streak), then it seems to me that I'd need to give those who were once considered my peers a handicap to keep the games fair, therein making me a rank above them.

I can speak from experience that this doesn't really happen on kgs. I remember when I was just a little ways away from becoming a 3 kyu on kgs (assuming my rank graph and the distances between ranks can be believed), I played quite a lot of games. I'd usually play just about every day of the week, and would always try to play enough go games a day so that I'd have won more games than I'd have lost (which, given the fact that I lost far less than I won games, meant my record on a "bad day" would look something like 2-1 in my favor). At that particular point in time I ended up getting a 12 or 14 game winning streak (I don't really remember which it was) and got promoted to 3 kyu. But by that time I was barely a 3 kyu, and after my streak ended I was almost a 4 kyu once more. My rank graph had barely changed at all, even though I could tell I'd gotten much stronger than I was before (and I'm not tooting my own horn,I was told this by others). This, understandably, really irked me.

I realize that my example might not be very scientific, but I do think it highlights a feeling many on kgs share.

And I also think that Mef's original example of GNUgo2 is absolutely worthless, because to go 6-263 in a single day clearly shows that the player has gotten weaker at the end of the day than when the day started. I understand that that losing streak was caused by the bot's owner rather than the bot itself, but a human player would have no such excuse. KGS's ranking system would've failed to punish a human player for such a long streak of losses, nor would it have properly rewarded a player for a similarly long streak of wins.

uPWarrior · Post by **uPWarrior** » Wed Mar 26, 2014 7:46 pm

skydyr wrote:
uPWarrior wrote: I don't have data but I think most people would agree both things apply: stones are not fully transitive and stronger players play less swingy games.

The second fact wouldn't impact the rating system at all, if a 7k wins 50% of the games against a 4k then he should be 6k and the same would be true for the 7d. (how easy it is to actually win 50% of the games against a player 3 stones stronger wouldn't have to be modeled at all)
Transitivity could be a problem, you could try to model that distribution instead of the player distribution as this one would not be biased (your player distribution model depends on your own ranking system, while the win/loss ratio does not). Or you could just ignore the fact that high handicaps aren't transitive as they are so rare anyway..
If a 7k is winning 50% of 3 stone games against a 4k, and losing 50% of them, why would you assume their rank should be increased? I suspect I've misunderstood your argument.

I wanted to write "if a 7k wins 50% of the games against a 4k with 2 stones" but forgot the last part.

Polama · Post by **Polama** » Thu Mar 27, 2014 7:24 am

Mef wrote:What we have at the core is two questions: "How strong was the bot performing at a given time?" vs. "How do you expect the bot to perform on its next game?"

Many people are worried about the former and this is related to what Polama is calculating. The performance of the bot on that day was clearly well below 11k. This is very easy to show with very high statistical certainty.

The other question is related, however it is not the same. Likewise, when you calculate the expected result it is also not the same.

Well summarized!

If we were to look for analogies, the closest we will probably find to something like this is a sports injury. If a player is injured, their performance may suffer a sudden drastic drop, but you would not expect this to be representative of how they will be expected to perform if and when they recover.

I'd also been thinking about that analogy, with baseball season starting up. This general topic is debated endlessly there: if a good player has a terrible year or an injury, what do we expect from him the next year? Sometimes they recover fully, sometimes they don't.

The KGS rating system aims to answer the latter question (predict the outcome of the next game), at the necessary expense of the former question (describe the result of the previous game). This of course always implies there is a bit of regression to the mean ever-present in all of its calculations.

I mostly just enjoyed reasoning through the math and its implications, but if I came to any conclusion it's that the streak was too extreme to assume a bounce-back was imminent. Modeling explicitly by time, sure, one day is a blip. But modeling by game, 260 isn't, even out of 17K. To switch to the sports metaphor:

If a leadoff hitter usually bats .300 and goes .200 in a month span, I'm going to predict that he'll be right back to .300 next month. That sort of variation occurs. We should trust his track record.

If he instead goes .015 in a month, I don't expect him to immediately jump back to .300. If I'm the manager, I'm not going to bat him leadoff until he demonstrates he can hit again over at least a week or two. The signal that something has fundamentally changed is just too strong to ignore. Maybe he does return to full form tomorrow. But until he demonstrates some change, I think predicting an imminent bounce-back is too aggressive, that you'd get more predictions right by saying "ok, he's not very good right now and won't play well next game".

I agree the formula is probably reasonable for normal, human levels of variation. But at these levels of play and variation it looks overly stubborn in its insistence for hundreds of games at a time that the next one will be different. Ok, the next one. Ok the next one...

RBerenguel · Post by **RBerenguel** » Thu Mar 27, 2014 8:46 am

Polama wrote:
Mef wrote:What we have at the core is two questions: "How strong was the bot performing at a given time?" vs. "How do you expect the bot to perform on its next game?"

Many people are worried about the former and this is related to what Polama is calculating. The performance of the bot on that day was clearly well below 11k. This is very easy to show with very high statistical certainty.

The other question is related, however it is not the same. Likewise, when you calculate the expected result it is also not the same.
Well summarized!

If we were to look for analogies, the closest we will probably find to something like this is a sports injury. If a player is injured, their performance may suffer a sudden drastic drop, but you would not expect this to be representative of how they will be expected to perform if and when they recover.
I'd also been thinking about that analogy, with baseball season starting up. This general topic is debated endlessly there: if a good player has a terrible year or an injury, what do we expect from him the next year? Sometimes they recover fully, sometimes they don't.

The KGS rating system aims to answer the latter question (predict the outcome of the next game), at the necessary expense of the former question (describe the result of the previous game). This of course always implies there is a bit of regression to the mean ever-present in all of its calculations.
I mostly just enjoyed reasoning through the math and its implications, but if I came to any conclusion it's that the streak was too extreme to assume a bounce-back was imminent. Modeling explicitly by time, sure, one day is a blip. But modeling by game, 260 isn't, even out of 17K. To switch to the sports metaphor:

If a leadoff hitter usually bats .300 and goes .200 in a month span, I'm going to predict that he'll be right back to .300 next month. That sort of variation occurs. We should trust his track record.

If he instead goes .015 in a month, I don't expect him to immediately jump back to .300. If I'm the manager, I'm not going to bat him leadoff until he demonstrates he can hit again over at least a week or two. The signal that something has fundamentally changed is just too strong to ignore. Maybe he does return to full form tomorrow. But until he demonstrates some change, I think predicting an imminent bounce-back is too aggressive, that you'd get more predictions right by saying "ok, he's not very good right now and won't play well next game".

I agree the formula is probably reasonable for normal, human levels of variation. But at these levels of play and variation it looks overly stubborn in its insistence for hundreds of games at a time that the next one will be different. Ok, the next one. Ok the next one...

One of my most "staty" friends works in models for insurance. He told me once about Large deviations theory (wikipedia link.) It's relatively close to this idea of seeing such numbers and wondering "WTF" while doing a relevant model for it.

Mef · Post by **Mef** » Thu Mar 27, 2014 3:34 pm

Polama wrote:To switch to the sports metaphor:

If a leadoff hitter usually bats .300 and goes .200 in a month span, I'm going to predict that he'll be right back to .300 next month. That sort of variation occurs. We should trust his track record.

If he instead goes .015 in a month, I don't expect him to immediately jump back to .300. If I'm the manager, I'm not going to bat him leadoff until he demonstrates he can hit again over at least a week or two. The signal that something has fundamentally changed is just too strong to ignore. Maybe he does return to full form tomorrow. But until he demonstrates some change, I think predicting an imminent bounce-back is too aggressive, that you'd get more predictions right by saying "ok, he's not very good right now and won't play well next game".

I agree the formula is probably reasonable for normal, human levels of variation. But at these levels of play and variation it looks overly stubborn in its insistence for hundreds of games at a time that the next one will be different. Ok, the next one. Ok the next one...

I'm on a phone right now so I can't give this the response I want to, but since you brought in the baseball reference, I couldn't resist. The player you are likely looking for in this analogy is Craig Counsell. He was a career .250 hitter (over 4000+ AB) who a couple years ago out of the blue went 0 for 45. After he finally snapped his slump he hit pretty much back at his career line.

To bring it back to the bot, projecting it to be 1-2 stones lower is basically the go equivalent of sending it back to AAA (expecting a sub .200 average)

Life In 19x19

A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks