Mef wrote:What we have at the core is two questions: "How strong was the bot performing at a given time?" vs. "How do you expect the bot to perform on its next game?"
Many people are worried about the former and this is related to what Polama is calculating. The performance of the bot on that day was clearly well below 11k. This is very easy to show with very high statistical certainty.
The other question is related, however it is not the same. Likewise, when you calculate the expected result it is also not the same.
Well summarized!
If we were to look for analogies, the closest we will probably find to something like this is a sports injury. If a player is injured, their performance may suffer a sudden drastic drop, but you would not expect this to be representative of how they will be expected to perform if and when they recover.
I'd also been thinking about that analogy, with baseball season starting up. This general topic is debated endlessly there: if a good player has a terrible year or an injury, what do we expect from him the next year? Sometimes they recover fully, sometimes they don't.
The KGS rating system aims to answer the latter question (predict the outcome of the next game), at the necessary expense of the former question (describe the result of the previous game). This of course always implies there is a bit of regression to the mean ever-present in all of its calculations.
I mostly just enjoyed reasoning through the math and its implications, but if I came to any conclusion it's that the streak was too extreme to assume a bounce-back was imminent. Modeling explicitly by time, sure, one day is a blip. But modeling by game, 260 isn't, even out of 17K. To switch to the sports metaphor:
If a leadoff hitter usually bats .300 and goes .200 in a month span, I'm going to predict that he'll be right back to .300 next month. That sort of variation occurs. We should trust his track record.
If he instead goes .015 in a month, I don't expect him to immediately jump back to .300. If I'm the manager, I'm not going to bat him leadoff until he demonstrates he can hit again over at least a week or two. The signal that something has fundamentally changed is just too strong to ignore. Maybe he does return to full form tomorrow. But until he demonstrates some change, I think predicting an imminent bounce-back is too aggressive, that you'd get more predictions right by saying "ok, he's not very good right now and won't play well next game".
I agree the formula is probably reasonable for normal, human levels of variation. But at these levels of play and variation it looks overly stubborn in its insistence for hundreds of games at a time that the next one will be different. Ok, the next one. Ok the next one...