On the accuracy of winrates

Bill Spight · **#41**

Jan.van.Rongen wrote:

The Win_raw_ is thus of limited interest, it's like playing with 1 playout.

Based upon the writing of I. J. Good, I think that we can say that Win_raw is like playing with many playouts. (How many is another question.

)

moha · **#42**

Bill Spight wrote:

moha wrote:

What I say that even if you can make the theoretical distinction, for a bot's perspective there is no viable way to maintain or make use of an error term,

The bot doesn't use the error term. We do.

But you still expect it from the bot, which seems unreasonable.

Quote:

Edit: The fact that bots do not always choose the play with the best winrate shows that they have different ways of dealing with winrate uncertainty. If they needed error terms, they would calculate them.

I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term. Why would it keep and use the lower quality estimate in this case? And the error term doesn't apply to the corrected estimate.

The bot could instead calculate a wider representation of the position internally (including estimated score, ev WITH deviation, for example). And calculate the winrate from those only at the last step, for comparison operations. This would still result in winrates with no error term, but it could also output those internal states (for human curiosity only) which would include some deviation information (but not about the final winrate). Btw something like this actually happens in the DM algorithm where the search itself is controlled by neural nets (instead of winrates the tree contains some NN-handled unknown blackbox state representation, with NN transformations calculating the effects of visits.)

Bill Spight · **#43**

moha wrote:

Bill Spight wrote:

moha wrote:

What I say that even if you can make the theoretical distinction, for a bot's perspective there is no viable way to maintain or make use of an error term,

The bot doesn't use the error term. We do.

But you still expect it from the bot, which seems unreasonable.

Who says I expect error estimates from the bots? They are trained to play better, not necessarily to make better evaluations. Although they use evaluations in making their decisions.

But people use them in reviews to evaluate positions and plays. However, when they do, they do not know how accurate the evaluations are.

Quote:

Edit: The fact that bots do not always choose the play with the best winrate shows that they have different ways of dealing with winrate uncertainty. If they needed error terms, they would calculate them.

I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term.

Does not follow. But as I said, there is a literature about the estimation of probabilities. I started this thread to address the practical question that players face concerning the accuracy of estimated winrates, not to debate whether bots could produce perfect winrate estimates.

Gomoto · **#44**

In the pre AI phase, when I had to evaluate a unknown/unclear position I asked the strongest player (in my club, or as next step my pro database).

Today I ask still the strongest player (AI).

Nothing changed (the answers are not accurate, but for my practical purposes it still works quite well).

chut · **#45**

Quote:

I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term. Why would it keep and use the lower quality estimate in this case? And the error term doesn't apply to the corrected estimate.

The error term, if it exists, would be a measure of how confident we have of a certain move. That could be used to guide MCTS to focus more on the more uncertain branch right?

moha · **#46**

chut wrote:

Quote:

I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term. Why would it keep and use the lower quality estimate in this case? And the error term doesn't apply to the corrected estimate.

The error term, if it exists, would be a measure of how confident we have of a certain move. That could be used to guide MCTS to focus more on the more uncertain branch right?

Winrate already means confidence in winning, and uncertainity is measured by visit counts. So we already have [estimate,visits] which MCTS is based on. You can, for example, adjust down the weight of the just performed visit based on some error term, or adjust its resulting value estimate as above. In both cases the error term is assimilated. And in the end the answer to whether A>B will come in the form of an ultimate estimate from search, which contains all known information - and this is what the user sees (with visit totals). To get an idea about the accuracy of this you need further, external information.

Btw I would guess if you test the actual correlation to game outcomes you may find it is reasonably accurate near 50% and 100%, and not necessarily linear but monotonic in between. So 75% may be somewhat off from 0.75, but still measurably better than 70%. The correctness of this relative estimate difference is what bot strength is based on.

chut · **#47**

moha wrote:

Winrate already means confidence in winning, and uncertainity is measured by visit counts. So we already have [estimate,visits] which MCTS is based on. You can, for example, adjust down the weight of the just performed visit based on some error term, or adjust its resulting value estimate as above. In both cases the error term is assimilated. And in the end the answer to whether A>B will come in the form of an ultimate estimate from search, which contains all known information - and this is what the user sees (with visit totals). To get an idea about the accuracy of this you need further, external information.

If I understand it correctly, the visit counts is based on the probability of a certain move being played according to the network weights, i.e. more probable move will get evaluated more by MCTS. But this is contrary to how human evaluate the uncertainty of a move. The 'obvious' moves (or the most probable moves) are the ones we are more certain of, either winning or loosing. It is the less obvious moves are the ones that we consider as more risky and less certain of.

In a ladder or a capturing race situation the winning rate is deemed unknown (or zero certainty) until we read out the situation fully. I think the uncertainty factor is a meta-level evaluation that is lacking in the current MCTS. It is the reason why very strong bots like LZ still fall flat on situations that are obvious to human.

Tryss · **#48**

That's contradictory : in a ladder, all the moves are obvious, so we should be certain of the situation, but at the same time, we have zero certainty about the situation? :scratch:

moha · **#49**

There are two reasons search tries a move: it was not tried enough before (exploration of relatively uncertain moves), or because it looked good at earlier tries (exploitation of current knowledge - most of the effort). NN just helps search stats initialize to a good guess at each new node.

MCTS idea of uncertainity is a move with relatively low visit count. Your idea seems to be a kind of sudden death situation, a ladder that either wins or loses the game. IMO those are not simply uncertain (still 50%), they relate to the "quiescence" idea of chess - that the just reached leaf pos should not be evaluated at all, only searched further. AFAIK this approach is not currently in use in go (maybe it fits better to the more structured search models prevalent in chess?). In go the NN (or rollouts in Master's days) does take on the evaluation of even very dynamic positions (how and how well is another question).

Note though that if you do return some kind of quiescence/error info, using it to decrease the weight of such visits as I wrote above leads to reasonable behaviour: directing further visits here to reduce the uncertainity (which MCTS "sees" from the abnormally slowly increasing counts).

chut · **#50**

Human recognize a ladder and life/death of a group (big or small) as a top level evaluation. As I see this level of reasoning is absent in the current architecture for go. If there is no help from this level of reasoning, the MCTS will never know that a certain branch need to push deep, and it is extremely hard for the NN to learn the ladder pattern unless we give MCTS a LOT of time for the pattern to emerge.

I think this is why even the mighty EFL is susceptible to failed ladder. I got LZ to fall into the ladder trap a few times, and LZ went berserk. I do believe that we need a meta level guidance system for MCTS.

chut · **#51**

I think there is definitely a missing link here. The intermediate patterns in the middle of a ladder/capturing race has no meaning. It is the board position when the ladder is fully elaborated that we can attribute a meaningful winrate to it. When I got LZ to play out a failed ladder, that is exactly what it did, by treating every intermediate position as independent in and of itself, and the MCTS went berserk.

To use a human analogy the NN would be like our intuition, it is a pattern recognition engine that accumulate experience through millions of game play. The MCTS would be like our calculating/counting brain. With the current architecture the NN is dominant and the MCTS in a strictly subordinate position. I think we need a meta level control that can tell the MCTS when to suppress the NN and guides its tree search strategy. Maybe some hand coded logic to recognize groups in danger and play out the variations. This could be a deep learning project itself.

yakcyll · **#52**

A small digression: as far as I can remember, MCTS produces fully random playouts once the candidate move is selected. Is it different for AGZ or Leela?

Tryss · **#53**

Yes. LZ or AlphaGo don't do simulation (random plays until the end of the game), this is replaced by the evaluation of the position by the value network evaluation.

moha · **#54**

yakcyll wrote:

A small digression: as far as I can remember, MCTS produces fully random playouts once the candidate move is selected. Is it different for AGZ or Leela?

MC rollouts <> MCTS (tree search). Rollouts mean quick playouts to the end (last used in Master). MCTS means trying various things out of order (unlike rigid minimax-derived methods in chess), based on a weighted random scheme where more interesting lines get more attention.

Btw Master's rollouts were not full-random but half-random (guided by some simple handcrafted policy). They couldn't use NN based rollouts for performance reasons, but it was suggested multiple times that if such would be possible it would give much better evaluations than a single value net. In fact, "let's play it out" a few times is the best known evaluation method (depending on playout quality oc).

dfan · **#55**

chut wrote:

I think we need a meta level control that can tell the MCTS when to suppress the NN and guides its tree search strategy. Maybe some hand coded logic to recognize groups in danger and play out the variations. This could be a deep learning project itself.

Note that it is not hard to add ladder logic to a program like AlphaGo Zero, and in fact the original AlphaGo had it. (They didn't add logic to the tree search; they just added more handcrafted features to the neural net input.) It was removed in AlphaGo Zero because DeepMind wanted to show that mastery was achievable without any help about game strategy from humans. The other public AlphaGo Zero-inspired programs (Leela Zero, ELF OpenGo) have followed its lead, perhaps because they wanted to start by reproducing AlphaGo Zero's results. It would not be hard in principle to put such logic back in, and perhaps some private projects already have.

moha · **#56**

dfan wrote:

to add ladder logic to a program like AlphaGo Zero, and in fact the original AlphaGo had it. ... It would not be hard in principle to put such logic back in, and perhaps some private projects already have.

There is also a seemingly more elegant solution: don't add back "ladder capture" / "ladder escape", just add intermediate info like "color and distance of closest stone" in all directions. Then let the net make whatever use it can from this. One nice feature of NNs is that they can "grow around" (like a tree) and use any kind of info externally provided. Such scans just ease the problems with network connectivity over distances, are not really go knowledge.

Vance · **#57**

Bill Spight wrote:

By now it is well understood that the winrates calculated by today's top bots are not actually win rates of anything known. For instance, if at a certain point in a game White is said to have a winrate of 60%, that does not mean that if we play out the game 10,000 times under certain known conditions, White will win approximately 6,000 times. So we cannot test estimated winrates against actual play and determine the accuracy (error rates) of those estimates.

But we do need to be able to measure the accuracy of those estimates. If at some point in the game Black is estimated to have a winrate of 55%, how confident are we that Black is really ahead? Then if White makes a play that increases Black's estimated winrate by 3%, how confident are we that White has made a mistake? If we compare two plays and one of them has a winrate 1.2% better than the other, how confident are we that it is the better play?

"Winrate" here means something like "estimated probabality of winning when playing against itself from this position". Which is more or less like gut feeling we might have for the position, or like betting odds for the game.

For any particular position it's true that you can't expect to get 60% wins out of 10,000 playouts. For example, if it's a complex midgame position where a dragon might live or die, the estimate may be 60% for black, but in 10 more moves it might turn out that white is actually likely to win.

The get more accurate evaluation (as related to perfect play) for a particular position, you of course need a stronger bot. Or more thinking time. The stronger it is, the closer the estimate would get to 0% or 100%, until the position is completely solved.

However, as a "winrate" that estimate would be only accurate for the stronger players. To get better "winrate" estimates for human games, you'd need a bot that plays at particular level, and has been trained to predict human moves at that level.

If the question is how to test whether the estimates are actually good at predicting the self-play results, the obvious idea is testing over a lot of different positions.

For example, find 1000 different positions where the estimate is 60% for black, and play them all out. If black actually wins 60% of the games, the estimates were pretty good. If black wins 80%, the estimates were useful, but very conservative. If white wins 60%, well, there is something wrong.

Bill Spight · **#58**

When I wrote the first note, I was unaware that "winrate" did not mean the results of random or semi-random playouts, but was intended to estimate the results of the bot's self play.

How well the winrate does that is still an open question.

Vance · **#59**

Well, that's what I addressed in the last point. Collect samples, have the bot play them out, compare actual results with the estimates.

moha · **#60**

Vance wrote:

Collect samples, have the bot play them out, compare actual results with the estimates.

There should be no need to play out, just make a detailed/parametrized correlation table from LZ's last million of selfplay games. Except they don't seem to record winrate estimates (only visit counts) in the training data, and selfplay sgfs are not annotated AFAIK (if kept at all). :scratch:

On the accuracy of winrates

Who is online