Can We Stop Calling Kata "scoreMean" Points?

spook · Post by **spook** » Thu Dec 12, 2019 9:51 am

Here is a full chart of winrate, playouts and score estimation.

: for this game.jpg (174.16 KiB) Viewed 8767 times

If we want to verify the correctness of the score estimations,
then I think it may be better to work our way back,
i.e. to verify if the drop at 257 is justifiable.

Bill Spight · Post by **Bill Spight** » Thu Dec 12, 2019 11:10 am

Thanks, spook.

Very nice graphs. I'm not sure exactly what's what, though.

Bill Spight · Post by **Bill Spight** » Thu Dec 12, 2019 11:27 am

spook wrote:If we want to verify the correctness of the score estimations,
then I think it may be better to work our way back,
i.e. to verify if the drop at 257 is justifiable.

Interesting that the score estimation graph starts to drop at White 248, where White cuts the Black stones off with sente. As for Black 257, it may be cleaner for Black to play the double atari. Then if White connects, Black makes 2 eyes. The score remains the same with perfect play, OC.

lightvector · Post by **lightvector** » Thu Dec 12, 2019 12:48 pm

Okay, at the cost of several percent of selfplay efficiency I think I can do this. I'll try it out (adding the new target to predict score difference between current and even).

If it works, then I think there's no more issue, right? For example, it should then consistently estimate the value of passing the first move of the game as losing somewhere from 13 to 15 points since that would be the komi adjustment needed to make the game fair after that (essentially to become new second player). So once this output exists, if it says 10 points lead, then it should actually mean 10 points lead (i.e. a player is 10 points above what would make a fair game), up to the bot's best ability to judge.

There will be one slight detail that if the search tree strongly expects you to try a specific move imminently that loses points to gain winning chances, it would be reporting the lead given its expectation for you to be playing that move. That's not an easy detail to fix, but that should only happen in specific planned tactics, the reported value would not reflect any general anticipation of giving up points for the game as a whole.

Assuming this works, would this mostly satisfy everyone?

Edit: if I'm doing this, I'm probably going to have it simply replace the old prediction output, the old one won't exist any more. Which I think is desirable, since different runs of KataGo seem to randomly be more aggressive/conservative by small amounts, and those small amounts add up over the course of the game to actually give somewhat different values for scoreMean as it currently is. Having a "points difference from fair" should be more consistently anchored and stable between versions.

Bill Spight · Post by **Bill Spight** » Thu Dec 12, 2019 1:33 pm

lightvector wrote:Okay, at the cost of several percent of selfplay efficiency I think I can do this. I'll try it out (adding the new target to predict score difference between current and even).

If it works, then I think there's no more issue, right? For example, it should then consistently estimate the value of passing the first move of the game as losing somewhere from 13 to 15 points since that would be the komi adjustment needed to make the game fair after that (essentially to become new second player). So once this output exists, if it says 10 points lead, then it should actually mean 10 points lead (i.e. a player is 10 points above what would make a fair game), up to the bot's best ability to judge.

It definitely sounds like it's worth a try.

And thanks to jlt for his posts on this topic.

I expect that this will give better temperature estimates by xela's method, too.

It will be interesting to see how it reacts to sente, as well.

The main problem with static points evaluation has been identifying sente. If everything is gote it's relatively easy.

If KataGo can be used to identify sente reasonably well, then we can use it to estimate static points evaluation, which may well be easier for humans to learn than final score estimation. That and xela's temperature estimates could lead to rules of thumb that humans can use during actual play.

Gomoto · Post by **Gomoto** » Thu Dec 12, 2019 1:49 pm

I do not know yet what I will prefer. I like the current score and it is not obvious to me, what advantage the new one will have.

In any case I will continue to call it points

Gomoto · Post by **Gomoto** » Thu Dec 12, 2019 2:05 pm

Universe is build on only statistical working physics. Why this fear of statistical values, there may be nothing more available to you in the end. Why all these strive for "real" points, that will not be achieved in any case till the end of the game.

Dont you "logic" thinkers, bean counters realize what you are missing out. Get rid of your small self enforced burka vision slot. You are like a bunch of reactionary physicists still denying quantum mechanics, although you see the fruits of its applications all around you.

Feels good to rant a little bit.

Bill Spight · Post by **Bill Spight** » Thu Dec 12, 2019 2:13 pm

Gomoto wrote:I do not know yet what I will prefer. I like the current score and it is not obvious to me, what advantage the new one will have.

In any case I will continue to call it points

I thought that lightvector planned to estimate the median final score instead of the mean. The median ought to be less affected by heroic efforts that lose points.

xela · Post by **xela** » Thu Dec 12, 2019 3:21 pm

spook wrote:Here is a full chart of winrate, playouts and score estimation.

Neat! The three lines on the score estimates graph: do the upper and lower represent error bounds? This isn't the sort of Lizzie screen shot that's usual around here. What software did you use to make these graphs?

xela · Post by **xela** » Thu Dec 12, 2019 3:25 pm

lightvector wrote:Okay, at the cost of several percent of selfplay efficiency I think I can do this. I'll try it out (adding the new target to predict score difference between current and even). :rambo:

If it works, then I think there's no more issue, right?

The number of people who have serious issues with the status quo seems to be approximately 1, so I'm not convinced there's a real problem that needs solving. (Still willing to be corrected on that point...) Still, it would be interesting to carry out this experiment if you have nothing better to do. We can run both versions on the same positions and see how well the two types of score estimates do or don't correlate.

"At the cost of several percent of selfplay efficiency' -- you mean it will take slightly longer to train this model, but you don't expect a significant impact on playing strength either way?

emerus · Post by **emerus** » Thu Dec 12, 2019 4:24 pm

uberdude wrote: When you (emerus) talk of points do you mean:
- minimal guaranteed territory (i.e. even if opponent gets the gote endgames in the area you still get these points). I think Myungwan Kim 9p tended to count like this in his videos and called it "confirmed territory".
- expected local territory (i.e. if an endgame move is your sente but opponent's gote you assume you get the sente, if gote for both then split the difference, if ambiguous, or boundaries are not pure endgame but have life and death and aji implications with other areas then very hard)
- expected territory plus a point quantification of the value of influence (e.g. projecting 2 points of territory in front of a wall), which is essentially what I was trying to do in counting the early game position at viewtopic.php?p=243147#p243147, but with simplifying assumptions of similar stones cancelling out so the absolute value is off, just the difference.
- something else?

For example, how many points is a lone 4-4 stone? Or a 3-4 stone? Or a 3-4 5-3 shimari? In terms of guaranteed territory a 4-4 has 0 points. Whilst a 3-3 has maybe 4 points. But in terms of "quantification of value on the same scale as points" as in the third definition a 4-4 is obviously similar to that 3-3 if not a little better.

Any and all of those.

The point is that 'points' is a regularly used Go term. It is already muddy enough. Why would you want to even use such a muddy term for a new evaluation value anyway?

Coming up with a better, clearer, way to refer to scoreMean(or whatever it may evolve into) can not be a bad thing. It comes at no cost either.

Bill Spight · Post by **Bill Spight** » Thu Dec 12, 2019 4:25 pm

xela wrote:
lightvector wrote:Okay, at the cost of several percent of selfplay efficiency I think I can do this. I'll try it out (adding the new target to predict score difference between current and even).

If it works, then I think there's no more issue, right?
The number of people who have serious issues with the status quo seems to be approximately 1, so I'm not convinced there's a real problem that needs solving. (Still willing to be corrected on that point...) Still, it would be interesting to carry out this experiment if you have nothing better to do. We can run both versions on the same positions and see how well the two types of score estimates do or don't correlate.

jlt showed how to estimate the median score under the current setup for any position of interest.

That should yield a better temperature estimate, as well. OC, to do it for every position would be quite a chore.

Bill Spight · Post by **Bill Spight** » Thu Dec 12, 2019 4:30 pm

emerus wrote:The point is that 'points' is a regularly used Go term. It is already muddy enough. Why would you want to even use such a muddy term for a new evaluation value anyway?

Coming up with a better, clearer, way to refer to scoreMean(or whatever it may evolve into) can not be a bad thing. It comes at no cost either.

As you indicate, it's not like points is a clear term, anyway. Which is why Berlekamp came up with the term, count, for the current, static estimate. All of these territory estimates are expressed in terms of points, no problem there. IMHO, current points and final points, or the like, would be fine for differentiating the two estimates.

Bantari · Post by **Bantari** » Thu Dec 12, 2019 4:58 pm

I am not sure if this is what OP is trying to say, but from what I understand is that when kataGo (or any AI) says you are "ahead by 20" - do they mean "points" as we, humans, understand "points"? Or not? If yes, no worries. If no, this might hurt people using this metric to evaluate errors.

I am not sure if anybody did that, but it seems a simple experiment would help here: Present an AI with a position, and let it evaluate. Then let it play this position against itself - and see if the final result is anywhere near the evaluation, point-wise. Repeat a number of times, and we will have something to talk about, I think.

If the final results will be close/identical to the evaluation - we can call them "points". If not, we might want to call them something else.

PS>
In other words - in human common-sense understanding - if both players play a perfect game for a while, then one makes a 5 points mistake, and then they continue to play perfect game - we expect the result to be 5 points adjusted for komi. This is also how we can, retroactively, measure the size of a mistake - potentially.

Does the same apply to AI and its evaluation? I think this is what would interest me here.

lightvector · Post by **lightvector** » Thu Dec 12, 2019 5:55 pm

Suppose you play the bot against itself 100 times and you find that on average it loses by 20 points in some position (winning a few games barely, losing most games by a lot). Suppose that 20 points was precisely what the bot had given as its "final score difference estimate" in that position. Great, right?

Suppose you dig further into the example and determine that actually, if the bot had just played move X, it would lose only by about 4 points - the resulting endgame is stable, and although it's not clear how to play it exactly optimally, it's highly clear that it's not going to vary by more than +/- 1 point under any reasonable lines of play. If you had 4 more points, then you'd have 50-50 winning chances playing move X. And the bot also agrees. The *reason* why the bot did not play move X and instead chose Y was that X led to an easy and predictable loss, whereas move Y is a complex and uncertain move that gives some slim winning chances instead of zero, but average seems to lead to a much bigger loss.

So we have the state of affairs:

A: In the sense of self-play games -> you are on average down by 20 points (since the bot plans for move Y).
B: In the sense of points you'd need to have 50-50 chances -> you are "down" by only 4 points (since if only you had 4 more points, the position would become fair).

If you had to choose just one of A or B to be reported to you, which value would you prefer to have?

I think B is more useful.

Consider: suppose you discovered you had made a macroendgame blunder a couple moves earlier that led to you getting precisely 4 points less than you could have gotten in some particular area, with no other differences - no lingering aji or ko threat differences, same player gets sente, etc. So there is a very intuitive sense in which that blunder loses precisely 4 points. Now, if you had asked the bot prior to that mistake, it would have said:

A: In the sense of self-play games -> you are on average down by 0 points (because now it plans to choose move X, not move Y).
B: In the sense of points you need to have 50-50 chances -> you are on average down by 0 points (because the game is fair as-it-is).

If you're using A then you might get the impression that the blunder "loses 20 points" since before the mistake the estimated difference is 0, and afterward the estimated difference is -20. If you are using B, then the difference before and after is 4 points, as expected. And if you really did want A, you could just wait until after move Y is played, then B will join along with A in saying you are down by 20.

So generally it seems to me B should be more useful and differences or changes in B are more likely to be "objective" and consistent with other measures. For example B should increase/decrease precisely by 1 whenever komi increases/decreases by 1, whereas A in general would not. Passing in the opening should result in B decreasing by bot's-believed-fair-komi * 2 whereas A could decrease by a different value.

Now, just like A, it is not possible to always estimate B perfectly - the neural net will still be imperfect. And, as I mentioned in my previous post, even if the neural net is trained to estimate B, the fact that MCTS is layered on top will introduce some "A-like" behavior into the result regarding short-term plans. So in the above case, if the MCTS actually sees down to wanting to play Y, rather than the neural net merely anticipating wanting to make an "Y-like" move longer in the future, then even a B-trained KataGo will say -20 points.

But it seems to me that moving towards estimating B should be more useful in general. Even if MCTS will be "A-like" in the short term, it should be helpful to get rid of the "A-like" behavior in the estimate that the neural net anticipates long into the future beyond what MCTS can see. The long-term part is actually usually the part that's actually having the impact, particularly in the opening. For example, KataGo says passing in the opening is -20 points instead of -14 points (14 ~= 2x katago's believed fair komi) not because the MCTS actually sees a short-term plan to lose 6 points in the future to improve its winning chances, but because the neural net anticipates giving up on average 6 points to improve winning chances way off in the future, presumably during midgame fighting.

So, my thought is to try to make KataGo estimate B instead. And, I could also continue estimating A too, but it would be extra overhead in the search to carry both around, so my inclination is to just not have A once we have B. Unless people think it should keep reporting both? Thoughts?

xela wrote: "At the cost of several percent of selfplay efficiency' -- you mean it will take slightly longer to train this model, but you don't expect a significant impact on playing strength either way?

Well, taking longer to train *is* an impact on playing strength - it means that for any fixed amount of training, it would end up weaker by the amount corresponding to have trained a few percent less long instead. But yeah, so long as my setup works well at all, I don't think it will cost too much more than that.

Life In 19x19

Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?

Re: Can We Stop Calling Kata "scoreMean" Points?