On the accuracy of winrates

moha · Post by **moha** » Thu Aug 09, 2018 1:45 pm

dfan wrote:I am currently working professionally on a technique to allow neural networks to output an amount of confidence in their results as well as the results themselves

To an extent a binary output does this in itself. If the net would have no idea it would output 0.5 as this minimizes the loss. For some classification tasks if the outputs are kept separate with individual activations, this behaviour is quite noticable. A board evaluation is somewhat similar (70% winrate -> +0.4 in the asymetry scale -> the confidence in we are winning).

Bill Spight · Post by **Bill Spight** » Thu Aug 09, 2018 1:57 pm

dfan wrote:
Bill Spight wrote:I bow to your superior knowledge, but aren't the bots trained only on complete games?
Yes. (Resigning is allowed.)

Furthermore, the bots do not choose their plays based solely on winrates.
Right, there's sort of a chicken and egg thing going on. The bot's network is trying to emulate the thinking of a stronger bot (which consists of itself armed with tree search). Its value output is trying to predict the result of that stronger bot's play. So it is always a little behind itself, in some sense, as is usually the case in reinforcement learning.

Convergence of winrates may be guaranteed in infinite time, but, while not a side effect, it is not the main effect, or goal of training. Pardon me if I am skeptical of finite results.
I think what you are challenging me to show (and are rightfully skeptical of) is different from what I thought I was showing. I didn't intend to claim that a win probability of 62.4% actually means that the engine playing itself would win 624 games out of 1000. I just mean that although 1) the win probabilities that are being trained are moving targets, and 2) they are targets that the network is trying to learn to hit, not its actual outputs (both big caveats), it is still true that the units of those targets really are "fraction of times this bot would win against itself". In your opening post of this thread I thought you were trying to argue that the win rates were much more abstract than that (although, rereading your post now, I may have been putting words in your mouth).

Thanks for your post. Very helpful.

Just think of me as being 25 years behind the times.

People are responding to my winrate claims, which may be incorrect or poorly expressed. Fine.

I am still willing to take winrate estimates of Leela (but not Leela Zero or certainly not Elf) of an ongoing game at move 100 between two evenly matched amateur dans or pros as the basis for a moderate bet, if I can bet on the projected winner. I think that they are too close to 50%.

What I am hoping is for people to generate multiple winrate estimates for games and to come up with error estimates. I think that information would be very helpful to people using bots to review games and joseki.

Gomoto · Post by **Gomoto** » Thu Aug 09, 2018 2:18 pm

I manage without any problems to reach around 70% in 6 tournament games in row and loose all the six games with one silly tactical mistake. So at least for me 70% winrate has not much meaning

. Or it has the meaning I have to stop playing thousands of games and have to start doing thousands of problems if I want to improve any further reliable. Pros tell me you have a good feeling, you will win the next one. They tell me this 6 times in a row

.

I will enjoy go anyway. If I win or loose, does not matter.

moha · Post by **moha** » Fri Aug 10, 2018 1:10 am

Following up on binary outputs: even if we interpret NN value outputs as confidence values, it seems quite easy to get precise masurement of the accuracy/relevancy of them.

Just take the last few million selfplay records from LZ, where each move/position is labeled by search results (and hopefully value net evaluation as well, though I'm not sure - maybe they only record visit counts for each move? would be a pity). Then create a graph with a few dozen data points, like actual game win% of cases where B had 50%, 51%, 52% and so on (maybe subdivide further for game phase, move ranges 0-20, 40-60 ...) The resulting graph should tell you everything about accuracy in LZ<>LZ games (not in LZ<>nonLZ or nonLZ<>nonLZ games though, OC).

Javaness2 · Post by **Javaness2** » Fri Aug 10, 2018 2:13 am

In chess I believe the evaluations are done in terms of centipawns. This can be translated into actual pieces on the board. The classic values being Pawn=100, Bishop/Knight=300, Rook=500, Queen=900. The evaluation has a material basis.

In go, the evaluation (winrate) has no material basis, or cannot be translated to one. This differs completely from the human approach to evaluation. As a result, most of us must have a hard time understanding what the hell a computer is spitting out at terminal. Dropping 4% doesn't correspond to a fixed points value on the board. Can the AI of today ever translate their winrates into material values, or can they co-display material value estimates in their output?

I suspect that they cannot, thus I personally struggle to trust the accuracy of their winrates in early parts of the game.
I also feel that AI is also going to lack value in terms of instruction until such an approach can really exist.

lightvector · Post by **lightvector** » Fri Aug 10, 2018 6:15 am

Javaness2 wrote:In chess I believe the evaluations are done in terms of centipawns. This can be translated into actual pieces on the board. The classic values being Pawn=100, Bishop/Knight=300, Rook=500, Queen=900. The evaluation has a material basis.

In go, the evaluation (winrate) has no material basis, or cannot be translated to one. This differs completely from the human approach to evaluation. As a result, most of us must have a hard time understanding what the hell a computer is spitting out at terminal. Dropping 4% doesn't correspond to a fixed points value on the board. Can the AI of today ever translate their winrates into material values, or can they co-display material value estimates in their output?

I suspect that they cannot, thus I personally struggle to trust the accuracy of their winrates in early parts of the game.
I also feel that AI is also going to lack value in terms of instruction until such an approach can really exist.

For a fixed neural network, (e.g. choose and fix a specific one of the Leela Zero networks for use in Lizzie), so long as the neural network is not too vastly stronger than me (e.g. Elf), I find I can readily develop a sense of what the neural network's winrate corresponds to over the course of a few game reviews with it of slow-paced games I played. I bet you can too if you try the same.

For slower-paced games where I have time to think as well as to count a few times over the course of the game, like everyone else, I have my own sense of "how much" I'm ahead. And for me, that "how much" feeling is definitely nonlinear in points. If we're headed into endgame and I've counted that I'm ahead by 5 points taking into account who has sente, against an equally-ranked opponent that feels to me like quite a solid buffer and hard to lose, whereas if it's still midgame (e.g. I'm ahead by 5 points in solid territory, and center access and influence and development potential all seem roughly equitable, but there are still invasions and fighting yet to happen), then it's really anybody's game still.

Over the course of doing several game reviews with a fixed neural net, I find I can pretty readily associate the neural network's winrates with my own sense of aheadness that I felt during the game. I'll find that when I felt so-and-so-much ahead the neural net will typically say numbers from 80%-90%, and when I felt so-and-so-much slighly-behind the neural network will typically be saying numbers from 30%-40%, etc. Sometimes it will say numbers that that are very different and outside of that range, because my own evaluation is just way off and I was misjudging something, because of course the bot is much stronger than me. But on average through repeated interaction it seems pretty easy to me to develop this intuitive correspondence. Then, when the bot reports a percentage change like 5%, I have a very intuitive "how much" that is - it's 1/6 of the the amount of "aheadness" that I typically associate with the bot saying 80% versus an even game 50%.

The trick is that you have to fix the neural network, and you have to be using it actively reviewing your own games so that your intuition calibrates to it. Different neural networks from different sources (like ELF vs Leela Zero) will have very different confidence scales, unsurprisingly typically the stronger neural nets will give values that are much more extreme for any given fixed "amount of advantage", because of course the stronger a player and opponent both are, the more a fixed amount of advantage is likely to result in the advantaged player winning.

So that does mean that when you see *other* people make posts saying this or that bot spit out this or that winrate, it's hard to interpret, because what the number means depends heavily on what neural network they're using, and it may easily not be one that you've used enough yourself to get a feel for. That's definitely unfortunate, and means communication about winrates on a forum like this often rightly feel adrift and ungrounded. But as for simply reviewing your own games, winrates from the current neural networks are already quite usable once you calibrate yourself to one particular network, just don't use one that's so much stronger than you that you can't do that (e.g. ELF network shooting up to saying 95% win the moment someone makes a small opening error).

mhlepore · Post by **mhlepore** » Fri Aug 10, 2018 9:53 am

Sorry I'm joining this conversation a little late...

Bill Spight wrote: If at some point in the game Black is estimated to have a winrate of 55%, how confident are we that Black is really ahead?

Perhaps I am getting too hung up on a word, but to me, for Black to be "really ahead" suggests the game has been solved. We aren't estimating who is ahead - Black is really ahead. Yet if it has been solved, we would see a winrate of 1 or 0, as someone mentioned earlier.

Implicit in the winrate, therefore, is the idea that the game hasn't been solved and that maybe this question cannot be answered to everyone's complete satisfaction.

Bill Spight wrote: Then if White makes a play that increases Black's estimated winrate by 3%, how confident are we that White has made a mistake?

I think this may depend on where you are in the game. At some point (perhaps in yose) the winrate will spike close to 1 or collapse close to 0 when things become certain. Suppose White has a bad position (low winrate) at move 100. It plays its best after that, but cannot overcome its bad position and loses. I would imagine Black's winrate will rise after W makes the best moves she can, if for no other reason than we are getting to the end of the game and White's chances to turn the game around are disappearing.

Bill Spight · Post by **Bill Spight** » Fri Aug 10, 2018 11:00 am

mhlepore wrote:Sorry I'm joining this conversation a little late...

Glad to have your input.

Bill Spight wrote: If at some point in the game Black is estimated to have a winrate of 55%, how confident are we that Black is really ahead?
Perhaps I am getting too hung up on a word, but to me, for Black to be "really ahead" suggests the game has been solved. We aren't estimating who is ahead - Black is really ahead. Yet if it has been solved, we would see a winrate of 1 or 0, as someone mentioned earlier.

People use fuzzy language like ahead, behind, having the edge, having chances, etc., all the time. Furthermore, people may disagree in their assessments, even experts. Which means that people make mistakes. So player A may say, Black is ahead and then Player B may say, no, Black is really behind. Usually we do not have an objective and practical way to decide whether A or B is right, but in this case we may. Let suitably strong and matched bots play the game out from that position many times. If White usually wins, then Black was not really ahead. If you will, this is a very weak form of solving the game.

Bill Spight wrote: Then if White makes a play that increases Black's estimated winrate by 3%, how confident are we that White has made a mistake?
I think this may depend on where you are in the game. At some point (perhaps in yose) the winrate will spike close to 1 or collapse close to 0 when things become certain. Suppose White has a bad position (low winrate) at move 100. It plays its best after that, but cannot overcome its bad position and loses.

I agree that whether we want to call a play that loses 3% in winrate a mistake may differ in different parts of the game. In the recent past, MCTS bots' winrate differences in the endgame have definitely been peculiar. My impression is that the current best NN bots are better in that regard, but I don't really know.

I would imagine Black's winrate will rise after W makes the best moves she can, if for no other reason than we are getting to the end of the game and White's chances to turn the game around are disappearing.

IIUC, in theory that is supposed to happen only some of the time. E.g., if at move 200 Black has a winrate of 80% then 20% of the time Black's winrate should drop to 0 by the end of the game.

Edit: Both of these questions may also be addressed if we have error estimates. So if Black is estimated to be 55% ahead, with an average error of 10%, I would hardly say that Black was really ahead. But if the average error was 1% I would be willing to offer my opinion that Black is really ahead. And if White's play increased Black's winrate by 3% my confidence that it was an error would depend upon the error estimates of both winrates. And the error estimates should generally drop as the end of the game approaches.

Bill Spight · Post by **Bill Spight** » Fri Aug 10, 2018 11:29 am

Javaness2 wrote:In chess I believe the evaluations are done in terms of centipawns. This can be translated into actual pieces on the board. The classic values being Pawn=100, Bishop/Knight=300, Rook=500, Queen=900. The evaluation has a material basis.

In go, the evaluation (winrate) has no material basis, or cannot be translated to one. This differs completely from the human approach to evaluation. As a result, most of us must have a hard time understanding what the hell a computer is spitting out at terminal. Dropping 4% doesn't correspond to a fixed points value on the board. Can the AI of today ever translate their winrates into material values, or can they co-display material value estimates in their output?

I suspect that they cannot, thus I personally struggle to trust the accuracy of their winrates in early parts of the game.
I also feel that AI is also going to lack value in terms of instruction until such an approach can really exist.

In the not too distant past, some MCTS bots evaluated the game in terms of points. As I understand, they were not as successful as those that evaluated the game in terms of winrates. In my considered opinion, evaluation by points requires the concept of temperature, as well. For instance, suppose that you are 2.5 points behind but it is your move. If the temperature is 7 you have good chances to win, but if it is 3 you do not.

I know of no reason that neural networks cannot make evaluations in terms of points and temperature, but winrates continue to be successful, so who is going to give such an approach a try?

moha · Post by **moha** » Fri Aug 10, 2018 1:20 pm

Bill Spight wrote:Both of these questions may also be addressed if we have error estimates. So if Black is estimated to be 55% ahead, with an average error of 10%, I would hardly say that Black was really ahead.

I don't think it is reasonable to expect a deviation term from the SAME SOURCE as the winrate (estimated probability of winning). If a bot would think it's chances are 55% with high potential error, he could adjust that towards 50%. It seems provable that for any given set of information, the winning probability (it's best estimate) always collapse to a single number.

If you are interested in how good these estimates are, their practical correlation to actual game outcomes in the long run, you could measure this nicely as I suggested earlier. But with such accuracy table available, you or the bot itself could again adjust the bot's winrates/guesses (depending on game phase etc. if you have such data), so it would again collapse to a single (corrected) net probability.

It would be different if the bot would evaluate to an estimate of score (some already do this btw), a normal-ish distribution with a deviation term. This deviation (and more extra data) would be meaningful for some decision making (like small sure win is better than high EV with high deviation). But even then you could calculate / collapse to a win probability in the end - a single "goodness" value is necessary for sorting your options to choose the best one.

Bill Spight · Post by **Bill Spight** » Fri Aug 10, 2018 1:47 pm

moha wrote:
Bill Spight wrote:Both of these questions may also be addressed if we have error estimates. So if Black is estimated to be 55% ahead, with an average error of 10%, I would hardly say that Black was really ahead.
I don't think it is reasonable to expect a deviation term from the SAME SOURCE as the winrate (estimated probability of winning). If a bot would think it's chances are 55% with high potential error, he could adjust that towards 50%. It seems provable that for any given set of information, the winning probability (it's best estimate) always collapse to a single number.

I don't think so. That is, even with an estimated winrate, you can also have an error term. However, they are not giving an error term, so there we are.

I am suggesting other ways of getting error estimates.

Jan.van.Rongen · Post by **Jan.van.Rongen** » Fri Aug 10, 2018 1:48 pm

I am also joining late in this discussion. I ran a lot of tests using various configuration of bots.

Test 1: top pro game. Analysed by Leela Zero network #157, but using the Ray machine (because it knows about ladders). Go review partner was used to run the analysis; GPU was a 1050 or 1080 Ti.

Code: Select all

playouts black_mean (sd)    white_mean (sd)
1600    -3.24 (4.84)        -2.94 (4.61)
6400    -1.78 (2.32)        -2.08 (2.80)
25600   -0.99 (1.81)        -1.29 (2.14)_
51200   -0.40 (1.39)        -0.72 (1.48)
102400  -0.30 (1.11)        -0.62 (1.50)
409600  -0.39 (1.32)        -0.72 (1.6)
1000000 -0.38 (1.22)        -0.70 (1.44)

What this means that with 1600 playouts the black moves were valued 3.24% (on average) below the choice of the bot and that the standard deviation of these differences was 4.84.

So for this game the evaluation was unstable until about 50,000 playouts.

Edit 2018-08-12: ran the 1M playouts per move last night. Still no sign of further convergence.

Test 2: 5d amateurs game, same setup

Code: Select all

playouts black_mean (sd)    white_mean (sd)
1600    -2.46 (5.41)        -2.9 (5.07)
51200   -1.65 (3.04)        -2.12 (3.98)
409600  -1.48 (3.26)        -1.94 (3.99)

Calvin Clark · Post by **Calvin Clark** » Fri Aug 10, 2018 2:29 pm

Part of the problem is the mindset of trying to solve this with one AI. More things would be possible with multiple strengths exploring the same position. A position that is won for Ke Jie is not necessarily won for me. I don't care if a bot says it thinks it has a won position. I want it to simulate what I would do, or an opponent of my level would do.

Bill Spight · Post by **Bill Spight** » Fri Aug 10, 2018 2:41 pm

Calvin Clark wrote:Part of the problem is the mindset of trying to solve this with one AI. More things would be possible with multiple strengths exploring the same position. A position that is won for Ke Jie is not necessarily won for me. I don't care if a bot says it thinks it has a won position. I want it to simulate what I would do, or an opponent of my level would do.

Yes, winrates depend upon strength.

moha · Post by **moha** » Fri Aug 10, 2018 2:49 pm

Bill Spight wrote:
moha wrote:
Bill Spight wrote:Both of these questions may also be addressed if we have error estimates. So if Black is estimated to be 55% ahead, with an average error of 10%, I would hardly say that Black was really ahead.
I don't think it is reasonable to expect a deviation term from the SAME SOURCE as the winrate (estimated probability of winning). If a bot would think it's chances are 55% with high potential error, he could adjust that towards 50%. It seems provable that for any given set of information, the winning probability (it's best estimate) always collapse to a single number.
I don't think so. That is, even with an estimated winrate, you can also have an error term.

Could you give an example of such a dual evaluation where it is not possible to collapse it to a single probability of winning?

Life In 19x19

On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates