On the accuracy of winrates

moha · Post by **moha** » Sun Aug 19, 2018 6:44 am

On second thought there is at least one particular problem with bot winrates, if used for anything else outside the bot's search. Because of the way pure MCTS works (averaging), winrates can only change slowly (the later in the search the slower potential changes are).

In peaceful positions this should be ok, just refining raw NN evaluations. But suppose there are two candidates A and B, with estimates 70% and 65% after few thousands visits. Then suddenly a tesuji/refutation move is found below A (actually B is only move). Now with further search A winrate starts to decrease (as losing lines start to average into it) but will only move towards 30% (its correct value) slowly. (Depending on remaining analysis limit the bot may even still play A knowing its refutation, and even after it fell below B in value, if further visits are not enough to also overcome its visit disadvantage.)

So the returned value MAY be in the middle of a slow but significant change, lagging behind current knowledge (thus quite random), but there is no obvious indication of this. Maybe UIs could display small red/green down/up arrows (like stock prices) beside moves under such reconsideration. It could also be possible to compare estimate distribution to visit distribution, and guess the stability of current eval from this.

Waylon · Post by **Waylon** » Mon Aug 20, 2018 4:47 am

moha wrote:On second thought there is at least one particular problem with bot winrates, if used for anything else outside the bot's search. Because of the way pure MCTS works (averaging), winrates can only change slowly (the later in the search the slower potential changes are).

In peaceful positions this should be ok, just refining raw NN evaluations. But suppose there are two candidates A and B, with estimates 70% and 65% after few thousands visits. Then suddenly a tesuji/refutation move is found below A (actually B is only move). Now with further search A winrate starts to decrease (as losing lines start to average into it) but will only move towards 30% (its correct value) slowly. (Depending on remaining analysis limit the bot may even still play A knowing its refutation, and even after it fell below B in value, if further visits are not enough to also overcome its visit disadvantage.)

So the returned value MAY be in the middle of a slow but significant change, lagging behind current knowledge (thus quite random), but there is no obvious indication of this. Maybe UIs could display small red/green down/up arrows (like stock prices) beside moves under such reconsideration. It could also be possible to compare estimate distribution to visit distribution, and guess the stability of current eval from this.

Building the search tree and selecting the move to play are two different tasks that could be treated differently.

MCTS seems to work fine for guiding the search. But to avoid the problem of insensitivity to sudden changes at the leaf nodes, one could use a version of the alpha-beta algorithm to select the move to play in the root position.

Of course one must be careful not to go to the other extreme: Propagating the values from the leaf nodes back to the root with alpha-beta could make the move selection to sensitiv to a single wrong evaluation. A possible solution could consider only "reliable" leaf nodes, i.e. such nodes with at least a certain number of evaluated child nodes below them.

chut · Post by **chut** » Tue Aug 21, 2018 10:52 am

LZ trying to escape a ladder and failed.
Games between AQ-GO (w) and LZ (b)
Both are android version running on the same phone. There are threads on how to set up both in this forum.
LZ is set to 10 sec/move

: Screenshot_2018-08-22-01-10-19-838_cn.ezandroid.aqgo.png (1012.19 KiB) Viewed 11945 times

10 sec/move on android is not a lot of computer power for tree search - but that is the sort of timing that human will tolerate playing against a computer.
AQ has menu options that say "show ladder capture", "show ladder escape", that means AQ as built-in logic for ladder.

I am convinced that LZ need similar control. I don't play a lot as I am more interested in the algorithm. But I have already encounter enough failed ladder in LZ.

Uberdude · Post by **Uberdude** » Thu Aug 30, 2018 2:18 am

Bill et al,
Here is a github thread in which people are making Elf or LZ play against itself from the same position (ones posted by the Russian Go Fed twitter with extreme Elf viewpoints after a human joseki) to test the accuracy of winrates. Quick summary, Elf v1 gave a position 4%, but in 50 game match at 1.6k visits it won 22%. In the one we discussed here which Elf v1 gave 1% for black latest LZ 20b gave 25% and in a 170 game match won 22%, much closer. Of course it's possible Elf's win% would be closer to a match result if the match was with Elf's not LZ's engine and at a gazillion playouts per move.

Bill Spight · Post by **Bill Spight** » Thu Aug 30, 2018 2:36 am

Uberdude wrote:Bill et al,
Here is a github thread in which people are making Elf or LZ play against itself from the same position (ones posted by the Russian Go Fed twitter with extreme Elf viewpoints after a human joseki) to test the accuracy of winrates. Quick summary, Elf v1 gave a position 4%, but in 50 game match at 1.6k visits it won 22%. In the one we discussed here which Elf v1 gave 1% for black latest LZ 20b gave 25% and in a 170 game match won 22%, much closer. Of course it's possible Elf's win% would be closer to a match result if the match was with Elf's not LZ's engine and at a gazillion playouts per move.

Many thanks.

OC, playouts per move matter. I would want at least 10k visits, myself. But I doubt if that would overcome an 18% difference.

Another thing at work is the statistical phenomenon of regression to the mean. That is, we should expect positions chosen because they are extreme to produce less extreme results. Still, the degree of regression is quite shocking.

Edit: And testing Elf's projections, based upon its own self play should not have used Leela's self play. As dfan pointed out somewhere recently, since Leela is weaker than Elf, Leela's self play results should be closer to 50%.

Uberdude · Post by **Uberdude** » Thu Aug 30, 2018 2:46 am

Bill Spight wrote: Edit: And testing Elf's projections, based upon its own self play should not have used Leela's self play.

They did use Elf's network, converted for use in the LZ engine. The source code for the Elf engine is available but apparently it's really hard to compile, I don't know if anyone has managed yet. How much difference the LZ vs Elf engine makes is an open question to me (but should at least be less that the weights!).

Bill Spight · Post by **Bill Spight** » Thu Aug 30, 2018 3:20 am

Uberdude wrote:
Bill Spight wrote: Edit: And testing Elf's projections, based upon its own self play should not have used Leela's self play.
They did use Elf's network, converted for use in the LZ engine. The source code for the Elf engine is available but apparently it's really hard to compile, I don't know if anyone has managed yet. How much difference the LZ vs Elf engine makes is an open question to me (but should at least be less that the weights!).

OIC. Thanks.

Bill Spight · Post by **Bill Spight** » Thu Aug 30, 2018 3:25 am

As far as the number of visits goes, my preliminary results suggest that with a setting of 100k, Leela Zero's margin of error is at least 3%. With only 1600 visits, God only knows!

moha · Post by **moha** » Thu Aug 30, 2018 3:32 am

Elf value head is known to be much sharper, more sensitive to slight advantages or disadvantages. Would these results still seem incorrect if we interpret Elf estimate as "probability that a perfect player would win against a perfect player" from here? That value cannot be determined but can only be 0 or 1.

Actual LZ or even Elf play have some randomness in their moves, so such practical win percentages are expected to be closer to 50%. On the above interpretation if the engine estimate is close to practical win percentage it may even mean it is less accurate. (No point for it to include this huge random factor - btw it would also be interesting to do the playout test with such randomness disabled.)

The only problem with this viewpoint is that AFAIK neither Elf nor other bot training targeted the perfect solution.

And it seems hard to imagine how that direction would be possible indirectly, without direct training data (maybe at price of huge training slowdown - or maybe simply using less randomness in selfplay could have slightly similar effect).

In any case, directly comparing bot selfplay winrates to bot estimates seems incorrect - these are two completely different things. At the very least the former depends heavily on bot move randomness configuration, so comparing to it has reduced meaning.

Bill Spight · Post by **Bill Spight** » Thu Aug 30, 2018 6:03 am

moha wrote:Elf value head is known to be much sharper, more sensitive to slight advantages or disadvantages. Would these results still seem incorrect if we interpret Elf estimate as "probability that a perfect player would win against a perfect player" from here?

That depends upon the meaning of probability.

That value cannot be determined but can only be 0 or 1.

If that is your meaning of probability, then your proposed interpretation is impossible. In that case, better interpret winrates as assuming errors. The question is, whose errors. From what I hear they are the bot's errors in self play.

moha · Post by **moha** » Thu Aug 30, 2018 6:13 am

I meant the actual, observable (if would be possible) value for "winner with perfect play" is 0 or 1.

The point is, predicting/estimating this is completely different to predicting selfplay results (which are almost always much closer to 50%).

With "probability" I meant the best guess from all available information.

Bill Spight · Post by **Bill Spight** » Thu Aug 30, 2018 7:16 am

IMO, the people who produce winrates should define the term.

Tryss · Post by **Tryss** » Thu Aug 30, 2018 9:24 am

Bill Spight wrote:As far as the number of visits goes, my preliminary results suggest that with a setting of 100k, Leela Zero's margin of error is at least 3%. With only 1600 visits, God only knows!

Actually, it should be more accurate the closer you're to the training parameters.

IMO, the people who produce winrates should define the term.

It's "just" a metric of who's ahead. As this experiment show, it's not exactly the probability of winning. But this would be really hard to calculate, and wouldn't be much more usefull than what we actually have. "The probability this exact network win against itself at x visits in this exact position" is not much more interesting than what have now. LZ winrate seems close enough (the difference between 22% or 26% is kinda irrelevant)

Bill Spight · Post by **Bill Spight** » Thu Aug 30, 2018 9:50 am

Tryss wrote:
Bill Spight wrote:IMO, the people who produce winrates should define the term.
It's "just" a metric of who's ahead.

That's what I thought when I started this thread, but apparently for the Zero bots it actually is an estimate of the winning percentage from the current position. But this seems to be a matter of dispute, at least here.

As this experiment show, it's not exactly the probability of winning. But this would be really hard to calculate, and wouldn't be much more usefull than what we actually have. "The probability this exact network win against itself at x visits in this exact position" is not much more interesting than what have now.

Except that humans want to use winrate differences to say whether certain plays are likely errors, and how bad the errors are.

LZ winrate seems close enough (the difference between 22% or 26% is kinda irrelevant)

Vanitas vanitatum, omnia vanitas.

Tryss · Post by **Tryss** » Thu Aug 30, 2018 3:08 pm

That's what I thought when I started this thread, but apparently for the Zero bots it actually is an estimate of the winning percentage from the current position.

This metric is derived from game results. But it's interpolated data. You feed the self-play positions and the result to the network, and it try to fit itself to these data.

One thing that may have an impact : the network is trained on games played by older networks. But hard to say how much impact it has.

Bill Spight wrote:Except that humans want to use winrate differences to say whether certain plays are likely errors, and how bad the errors are.

We can already do that, and that's how LZ use these winrate too. It doesn't needs to be truly accurate for this, just monotonous and consistent enough (and "nice" enough).

For exemple, if LZ winrate in function of the true winrate looks something like this :

Then it's perfectly usable by humans players.

Life In 19x19

On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates