On the accuracy of winrates
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: On the accuracy of winrates
On second thought there is at least one particular problem with bot winrates, if used for anything else outside the bot's search. Because of the way pure MCTS works (averaging), winrates can only change slowly (the later in the search the slower potential changes are).
In peaceful positions this should be ok, just refining raw NN evaluations. But suppose there are two candidates A and B, with estimates 70% and 65% after few thousands visits. Then suddenly a tesuji/refutation move is found below A (actually B is only move). Now with further search A winrate starts to decrease (as losing lines start to average into it) but will only move towards 30% (its correct value) slowly. (Depending on remaining analysis limit the bot may even still play A knowing its refutation, and even after it fell below B in value, if further visits are not enough to also overcome its visit disadvantage.)
So the returned value MAY be in the middle of a slow but significant change, lagging behind current knowledge (thus quite random), but there is no obvious indication of this. Maybe UIs could display small red/green down/up arrows (like stock prices) beside moves under such reconsideration. It could also be possible to compare estimate distribution to visit distribution, and guess the stability of current eval from this.
In peaceful positions this should be ok, just refining raw NN evaluations. But suppose there are two candidates A and B, with estimates 70% and 65% after few thousands visits. Then suddenly a tesuji/refutation move is found below A (actually B is only move). Now with further search A winrate starts to decrease (as losing lines start to average into it) but will only move towards 30% (its correct value) slowly. (Depending on remaining analysis limit the bot may even still play A knowing its refutation, and even after it fell below B in value, if further visits are not enough to also overcome its visit disadvantage.)
So the returned value MAY be in the middle of a slow but significant change, lagging behind current knowledge (thus quite random), but there is no obvious indication of this. Maybe UIs could display small red/green down/up arrows (like stock prices) beside moves under such reconsideration. It could also be possible to compare estimate distribution to visit distribution, and guess the stability of current eval from this.
- Waylon
- Dies in gote
- Posts: 24
- Joined: Sat May 14, 2016 1:30 am
- GD Posts: 0
- Location: Vienna, Austria
- Has thanked: 686 times
- Been thanked: 11 times
Re: On the accuracy of winrates
Building the search tree and selecting the move to play are two different tasks that could be treated differently.moha wrote:On second thought there is at least one particular problem with bot winrates, if used for anything else outside the bot's search. Because of the way pure MCTS works (averaging), winrates can only change slowly (the later in the search the slower potential changes are).
In peaceful positions this should be ok, just refining raw NN evaluations. But suppose there are two candidates A and B, with estimates 70% and 65% after few thousands visits. Then suddenly a tesuji/refutation move is found below A (actually B is only move). Now with further search A winrate starts to decrease (as losing lines start to average into it) but will only move towards 30% (its correct value) slowly. (Depending on remaining analysis limit the bot may even still play A knowing its refutation, and even after it fell below B in value, if further visits are not enough to also overcome its visit disadvantage.)
So the returned value MAY be in the middle of a slow but significant change, lagging behind current knowledge (thus quite random), but there is no obvious indication of this. Maybe UIs could display small red/green down/up arrows (like stock prices) beside moves under such reconsideration. It could also be possible to compare estimate distribution to visit distribution, and guess the stability of current eval from this.
MCTS seems to work fine for guiding the search. But to avoid the problem of insensitivity to sudden changes at the leaf nodes, one could use a version of the alpha-beta algorithm to select the move to play in the root position.
Of course one must be careful not to go to the other extreme: Propagating the values from the leaf nodes back to the root with alpha-beta could make the move selection to sensitiv to a single wrong evaluation. A possible solution could consider only "reliable" leaf nodes, i.e. such nodes with at least a certain number of evaluated child nodes below them.
-
chut
- Dies in gote
- Posts: 23
- Joined: Sun May 20, 2018 5:47 am
- GD Posts: 0
- Has thanked: 7 times
- Been thanked: 3 times
Re: On the accuracy of winrates
LZ trying to escape a ladder and failed.
Games between AQ-GO (w) and LZ (b)
Both are android version running on the same phone. There are threads on how to set up both in this forum.
LZ is set to 10 sec/move 10 sec/move on android is not a lot of computer power for tree search - but that is the sort of timing that human will tolerate playing against a computer.
AQ has menu options that say "show ladder capture", "show ladder escape", that means AQ as built-in logic for ladder.
I am convinced that LZ need similar control. I don't play a lot as I am more interested in the algorithm. But I have already encounter enough failed ladder in LZ.
Games between AQ-GO (w) and LZ (b)
Both are android version running on the same phone. There are threads on how to set up both in this forum.
LZ is set to 10 sec/move 10 sec/move on android is not a lot of computer power for tree search - but that is the sort of timing that human will tolerate playing against a computer.
AQ has menu options that say "show ladder capture", "show ladder escape", that means AQ as built-in logic for ladder.
I am convinced that LZ need similar control. I don't play a lot as I am more interested in the algorithm. But I have already encounter enough failed ladder in LZ.
-
Uberdude
- Judan
- Posts: 6727
- Joined: Thu Nov 24, 2011 11:35 am
- Rank: UK 4 dan
- GD Posts: 0
- KGS: Uberdude 4d
- OGS: Uberdude 7d
- Location: Cambridge, UK
- Has thanked: 436 times
- Been thanked: 3718 times
Re: On the accuracy of winrates
Bill et al,
Here is a github thread in which people are making Elf or LZ play against itself from the same position (ones posted by the Russian Go Fed twitter with extreme Elf viewpoints after a human joseki) to test the accuracy of winrates. Quick summary, Elf v1 gave a position 4%, but in 50 game match at 1.6k visits it won 22%. In the one we discussed here which Elf v1 gave 1% for black latest LZ 20b gave 25% and in a 170 game match won 22%, much closer. Of course it's possible Elf's win% would be closer to a match result if the match was with Elf's not LZ's engine and at a gazillion playouts per move.
Here is a github thread in which people are making Elf or LZ play against itself from the same position (ones posted by the Russian Go Fed twitter with extreme Elf viewpoints after a human joseki) to test the accuracy of winrates. Quick summary, Elf v1 gave a position 4%, but in 50 game match at 1.6k visits it won 22%. In the one we discussed here which Elf v1 gave 1% for black latest LZ 20b gave 25% and in a 170 game match won 22%, much closer. Of course it's possible Elf's win% would be closer to a match result if the match was with Elf's not LZ's engine and at a gazillion playouts per move.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: On the accuracy of winrates
Many thanks.Uberdude wrote:Bill et al,
Here is a github thread in which people are making Elf or LZ play against itself from the same position (ones posted by the Russian Go Fed twitter with extreme Elf viewpoints after a human joseki) to test the accuracy of winrates. Quick summary, Elf v1 gave a position 4%, but in 50 game match at 1.6k visits it won 22%. In the one we discussed here which Elf v1 gave 1% for black latest LZ 20b gave 25% and in a 170 game match won 22%, much closer. Of course it's possible Elf's win% would be closer to a match result if the match was with Elf's not LZ's engine and at a gazillion playouts per move.
OC, playouts per move matter. I would want at least 10k visits, myself. But I doubt if that would overcome an 18% difference.
Another thing at work is the statistical phenomenon of regression to the mean. That is, we should expect positions chosen because they are extreme to produce less extreme results. Still, the degree of regression is quite shocking.
Edit: And testing Elf's projections, based upon its own self play should not have used Leela's self play. As dfan pointed out somewhere recently, since Leela is weaker than Elf, Leela's self play results should be closer to 50%.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
Uberdude
- Judan
- Posts: 6727
- Joined: Thu Nov 24, 2011 11:35 am
- Rank: UK 4 dan
- GD Posts: 0
- KGS: Uberdude 4d
- OGS: Uberdude 7d
- Location: Cambridge, UK
- Has thanked: 436 times
- Been thanked: 3718 times
Re: On the accuracy of winrates
They did use Elf's network, converted for use in the LZ engine. The source code for the Elf engine is available but apparently it's really hard to compile, I don't know if anyone has managed yet. How much difference the LZ vs Elf engine makes is an open question to me (but should at least be less that the weights!).Bill Spight wrote: Edit: And testing Elf's projections, based upon its own self play should not have used Leela's self play.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: On the accuracy of winrates
OIC. Thanks.Uberdude wrote:They did use Elf's network, converted for use in the LZ engine. The source code for the Elf engine is available but apparently it's really hard to compile, I don't know if anyone has managed yet. How much difference the LZ vs Elf engine makes is an open question to me (but should at least be less that the weights!).Bill Spight wrote: Edit: And testing Elf's projections, based upon its own self play should not have used Leela's self play.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: On the accuracy of winrates
As far as the number of visits goes, my preliminary results suggest that with a setting of 100k, Leela Zero's margin of error is at least 3%. With only 1600 visits, God only knows!
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: On the accuracy of winrates
Elf value head is known to be much sharper, more sensitive to slight advantages or disadvantages. Would these results still seem incorrect if we interpret Elf estimate as "probability that a perfect player would win against a perfect player" from here? That value cannot be determined but can only be 0 or 1.
Actual LZ or even Elf play have some randomness in their moves, so such practical win percentages are expected to be closer to 50%. On the above interpretation if the engine estimate is close to practical win percentage it may even mean it is less accurate. (No point for it to include this huge random factor - btw it would also be interesting to do the playout test with such randomness disabled.)
The only problem with this viewpoint is that AFAIK neither Elf nor other bot training targeted the perfect solution.
And it seems hard to imagine how that direction would be possible indirectly, without direct training data (maybe at price of huge training slowdown - or maybe simply using less randomness in selfplay could have slightly similar effect).
In any case, directly comparing bot selfplay winrates to bot estimates seems incorrect - these are two completely different things. At the very least the former depends heavily on bot move randomness configuration, so comparing to it has reduced meaning.
Actual LZ or even Elf play have some randomness in their moves, so such practical win percentages are expected to be closer to 50%. On the above interpretation if the engine estimate is close to practical win percentage it may even mean it is less accurate. (No point for it to include this huge random factor - btw it would also be interesting to do the playout test with such randomness disabled.)
The only problem with this viewpoint is that AFAIK neither Elf nor other bot training targeted the perfect solution.
In any case, directly comparing bot selfplay winrates to bot estimates seems incorrect - these are two completely different things. At the very least the former depends heavily on bot move randomness configuration, so comparing to it has reduced meaning.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: On the accuracy of winrates
That depends upon the meaning of probability.moha wrote:Elf value head is known to be much sharper, more sensitive to slight advantages or disadvantages. Would these results still seem incorrect if we interpret Elf estimate as "probability that a perfect player would win against a perfect player" from here?
If that is your meaning of probability, then your proposed interpretation is impossible. In that case, better interpret winrates as assuming errors. The question is, whose errors. From what I hear they are the bot's errors in self play.That value cannot be determined but can only be 0 or 1.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: On the accuracy of winrates
I meant the actual, observable (if would be possible) value for "winner with perfect play" is 0 or 1.
The point is, predicting/estimating this is completely different to predicting selfplay results (which are almost always much closer to 50%).
With "probability" I meant the best guess from all available information.
The point is, predicting/estimating this is completely different to predicting selfplay results (which are almost always much closer to 50%).
With "probability" I meant the best guess from all available information.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: On the accuracy of winrates
IMO, the people who produce winrates should define the term.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
Tryss
- Lives in gote
- Posts: 502
- Joined: Tue May 24, 2011 1:07 pm
- Rank: KGS 2k
- GD Posts: 100
- KGS: Tryss
- Has thanked: 1 time
- Been thanked: 153 times
Re: On the accuracy of winrates
Actually, it should be more accurate the closer you're to the training parameters.Bill Spight wrote:As far as the number of visits goes, my preliminary results suggest that with a setting of 100k, Leela Zero's margin of error is at least 3%. With only 1600 visits, God only knows!
It's "just" a metric of who's ahead. As this experiment show, it's not exactly the probability of winning. But this would be really hard to calculate, and wouldn't be much more usefull than what we actually have. "The probability this exact network win against itself at x visits in this exact position" is not much more interesting than what have now. LZ winrate seems close enough (the difference between 22% or 26% is kinda irrelevant)IMO, the people who produce winrates should define the term.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: On the accuracy of winrates
That's what I thought when I started this thread, but apparently for the Zero bots it actually is an estimate of the winning percentage from the current position. But this seems to be a matter of dispute, at least here.Tryss wrote:It's "just" a metric of who's ahead.Bill Spight wrote:IMO, the people who produce winrates should define the term.
Except that humans want to use winrate differences to say whether certain plays are likely errors, and how bad the errors are.As this experiment show, it's not exactly the probability of winning. But this would be really hard to calculate, and wouldn't be much more usefull than what we actually have. "The probability this exact network win against itself at x visits in this exact position" is not much more interesting than what have now.
Vanitas vanitatum, omnia vanitas.LZ winrate seems close enough (the difference between 22% or 26% is kinda irrelevant)
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
Tryss
- Lives in gote
- Posts: 502
- Joined: Tue May 24, 2011 1:07 pm
- Rank: KGS 2k
- GD Posts: 100
- KGS: Tryss
- Has thanked: 1 time
- Been thanked: 153 times
Re: On the accuracy of winrates
This metric is derived from game results. But it's interpolated data. You feed the self-play positions and the result to the network, and it try to fit itself to these data.That's what I thought when I started this thread, but apparently for the Zero bots it actually is an estimate of the winning percentage from the current position.
One thing that may have an impact : the network is trained on games played by older networks. But hard to say how much impact it has.
We can already do that, and that's how LZ use these winrate too. It doesn't needs to be truly accurate for this, just monotonous and consistent enough (and "nice" enough).Bill Spight wrote:Except that humans want to use winrate differences to say whether certain plays are likely errors, and how bad the errors are.
For exemple, if LZ winrate in function of the true winrate looks something like this :
Then it's perfectly usable by humans players.