On the accuracy of winrates

Bill Spight · Post by **Bill Spight** » Fri Aug 10, 2018 4:45 pm

Bill Spight wrote:Both of these questions may also be addressed if we have error estimates. So if Black is estimated to be 55% ahead, with an average error of 10%, I would hardly say that Black was really ahead.

moha wrote:
Bill Spight wrote:
moha wrote:I don't think it is reasonable to expect a deviation term from the SAME SOURCE as the winrate (estimated probability of winning). If a bot would think it's chances are 55% with high potential error, he could adjust that towards 50%. It seems provable that for any given set of information, the winning probability (it's best estimate) always collapse to a single number.
I don't think so. That is, even with an estimated winrate, you can also have an error term.
Could you give an example of such a dual evaluation where it is not possible to collapse it to a single probability of winning?

All it requires is for the "collapsed" probability to equal the estimated probability. The error term does not disappear.

Kirby · Post by **Kirby** » Fri Aug 10, 2018 5:35 pm

I used to work on SQL Server, and when you run a query, you can get the query plan, ie., what steps the computer wants to take to run the query. There are many different ways to execute a query to get the same result, but the computer picks the plan having the best cost.

Seems reasonable, but what does it mean to have best cost?

Long ago, some guy did some benchmarking on different parts of query execution, and the cost could be correlated to runtime- probably from some computer in the 90s.

Nowadays, query "cost" has no meaning in itself - it only matters that the computer chooses the cheapest one.

I feel the same way about win percentages. We can use the relative values of candidate moves to see what the computer might select in that position, but drawing more than that seems to be a stretch to me.

moha · Post by **moha** » Fri Aug 10, 2018 6:29 pm

Bill Spight wrote:
moha wrote:Could you give an example of such a dual evaluation where it is not possible to collapse it to a single probability of winning?
All it requires is for the "collapsed" probability to equal the estimated probability. The error term does not disappear.

I think it does (for binary probabilities). For example, a 60% estimation will remain 60% only if the error/deviation is 0 (no new info), but will become 50% if the error term is very large ("this is the kind of situation where our estimation function is almost always wrong, does not correlate well to experiments, so in reality we know almost nothing") and so on. You didn't say where the error data comes from (external/internal), but in both cases it becames part of our given set of information (input for the revised estimation function).

Even if one tries to make a theoretical distinction between being sure to win half of the games from here, or having no clue about who is ahead (doubtful already, as far as win probability goes), once the relative value of these two is decided the bot gets the single scalar goodness / win confidence value for sorting. (All this is similar to binary distributions that cannot have a variance independent of ev.)

Bill Spight · Post by **Bill Spight** » Fri Aug 10, 2018 7:51 pm

moha wrote:
Bill Spight wrote:
moha wrote:Could you give an example of such a dual evaluation where it is not possible to collapse it to a single probability of winning?
All it requires is for the "collapsed" probability to equal the estimated probability. The error term does not disappear.
I think it does (for binary probabilities). For example, a 60% estimation will remain 60% only if the error/deviation is 0 (no new info), but will become 50% if the error term is very large ("this is the kind of situation where our estimation function is almost always wrong, does not correlate well to experiments, so in reality we know almost nothing") and so on. You didn't say where the error data comes from (external/internal), but in both cases it becames part of our given set of information (input for the revised estimation function).

Even if one tries to make a theoretical distinction between being sure to win half of the games from here, or having no clue about who is ahead (doubtful already, as far as win probability goes), once the relative value of these two is decided the bot gets the single scalar goodness / win confidence value for sorting. (All this is similar to binary distributions that cannot have a variance independent of ev.)

I think you are confusing a probability with an estimated probability. There's a whole literature on this, and our opinions are of little interest.

jlt · Post by **jlt** » Fri Aug 10, 2018 10:59 pm

I have the impression that the term "winrate" is used for different things:

WR_raw: the winrate estimated by the raw neural network. For a human this would correspond to picking the most intuitive move and try to guess at a glance the chances of winning.
WR_n: the winrate estimated after n playouts, where n is a large number (given Jan van Rongen's tests above, n=50000 would be a good compromise between accuracy and computation time). For a human, this would correspond to estimating winrate after deep reading.
WR_true: the limit when N tends to infinity of the proportion of won games when N test matches are run starting from the position.

The best way to estimate WR_true would be to run a large number N of test matches and calculate the proportion of won games (~~this is what AlphaZero did to create its~~ ~~teaching tool~~). Call this proportion p. The estimate of WR_true is p, and we can estimate the error by 2 sqrt(p(1-p)/N) (if we want a confidence interval of about 95%).

However we usually don't do that because it's too computationally expensive, so we use WR_raw or WR_n (or WR_m for a smaller number m, like m=1000) as estimates. The number WR_n takes more time to compute, but is probably a better estimate, than WR_raw.

So currently we consider that WR_n is a good estimator of WR_true, but we don't know how large the error |WR_n-WR_true| can be. It might be possible in the future to train a computer to give a good estimate the error, but for the moment we can't do that. Perhaps |WR_n-WR_raw| gives an idea of the magnitude of the error, but some tests would be needed to determine whether this is true.

moha · Post by **moha** » Sat Aug 11, 2018 1:54 am

Bill Spight wrote:
moha wrote:For example, a 60% estimation will remain 60% only if the error/deviation is 0 (no new info), but will become 50% if the error term is very large ("this is the kind of situation where our estimation function is almost always wrong, does not correlate well to experiments, so in reality we know almost nothing") and so on. You didn't say where the error data comes from (external/internal), but in both cases it becames part of our given set of information (input for the revised estimation function).

Even if one tries to make a theoretical distinction between being sure to win half of the games from here, or having no clue about who is ahead (doubtful already, as far as win probability goes), once the relative value of these two is decided the bot gets the single scalar goodness / win confidence value for sorting. (All this is similar to binary distributions that cannot have a variance independent of ev.)
I think you are confusing a probability with an estimated probability. There's a whole literature on this, and our opinions are of little interest.

I hope I'm not (similarity, not identity).

What I say that even if you can make the theoretical distinction, for a bot's perspective there is no viable way to maintain or make use of an error term, since the best corrected guess of win probability will assimilate it, the bot can not act on it (even on exact, externally and posteriorly measured accuracy data). It can only be used outside of the process of finding the best move, for other purposes.

jlt wrote: WR_n: the winrate estimated after n playouts, where n is a large number ... For a human, this would correspond to estimating winrate after deep reading.
WR_true: the limit when N tends to infinity of the proportion of won games when N test matches are run starting from the position.

If you use "playouts" in the same sense bots do, then note that for MCTS search this latter, true value can only be 0 or 1 (convergent to minimax solution), very different to a win proportion in selfplays starting from the position.

The best way to estimate WR_true would be to run a large number N of test matches and calculate the proportion of won games (this is what AlphaZero did to create its teaching tool). Call this proportion p. The estimate of WR_true is p, and we can estimate the error by 2 sqrt(p(1-p)/N) (if we want a confidence interval of about 95%).

This assumes sample independence, which is not really true here. Consider an early position with a few stones. Most of bot estimates will be around 50%, both raw net and long searches, and also millions of selfplays will result in similar values. Which is very different from the correct value of 0 or 1, the result of nearly-infinite search. If you do longer and longer searches to the extreme, you will see the winrate estimate somewhat stabilizing first, but start to change heavily later. (btw Alphago teach only did long searches, not selfplays from positions IIRC)

So currently we consider that WR_n is a good estimator of WR_true, but we don't how large the error |WR_n-WR_true| can be.

But we can at least do the posterior calculation of correlation of bot winrates to actual outcomes.

pnprog · Post by **pnprog** » Sat Aug 11, 2018 4:21 am

Hi,
Sorry, I haven't read everything yet, but, would it make more sens if we replace "Win rate" by "Confidence"?

So a 55% confidence means LeelaZero is confident she would win 55% of the games starting from that current board position, against an opponent of the same strength.

So now, if for the same position, ELF has a confidence of 68%, it does not really contradict with LeelaZero own confidence, it's just that ELF has more confidence for that game position than LeelaZero.

If human players were making different confidence statement for a same board position could be a matter of difference of level, or style. And 2 human players of same strength could reasonably have different confidence level for a similar position.

Jan.van.Rongen · Post by **Jan.van.Rongen** » Sat Aug 11, 2018 8:04 am

I do not like the term win rate either, it is an estimated value for a move on a (0,1) scale. So the neural net when fed with a position gives for each (legal) move two values: the plausibility that it will be played and the estimated value. Both are fed into the MCTS to generate plausible sequences of playout. That means that many implausible sequences will never be considered within a compuattionally feasonable number of playouts.

The Win_raw_ is thus of limited interest, it's like playing with 1 playout.

Next you could wonder is the bot will converge to a single opinion when the number of playouts increases. In the pro game above it did not: the difference remained at 30-35% between the 100K and 400K playout runs.

So there is no concept of "the best move" from a bot; it is conditional to the number of playouts. Then hopefully the "win rate" converges? It does not. The average deviation between 100K and 400K playouts is still above 1% (5% for one particular move).

Bill Spight · Post by **Bill Spight** » Sat Aug 11, 2018 9:24 am

moha wrote:What I say that even if you can make the theoretical distinction, for a bot's perspective there is no viable way to maintain or make use of an error term,

The bot doesn't use the error term. We do.

Edit: The fact that bots do not always choose the play with the best winrate shows that they have different ways of dealing with winrate uncertainty. If they needed error terms, they would calculate them.

Gomoto · Post by **Gomoto** » Sat Aug 11, 2018 9:40 am

For now I consider the winrate: The bot "feels" this is better.

With the AIs available today you always have to play out the variations for a few moves to get a better picture and sometimes big changes in winrates occur after a few moves.

Bill Spight · Post by **Bill Spight** » Sat Aug 11, 2018 9:41 am

Jan.van.Rongen wrote:The Win_raw_ is thus of limited interest, it's like playing with 1 playout.

Based upon the writing of I. J. Good, I think that we can say that Win_raw is like playing with many playouts. (How many is another question.

)

moha · Post by **moha** » Mon Aug 13, 2018 12:57 am

Bill Spight wrote:
moha wrote:What I say that even if you can make the theoretical distinction, for a bot's perspective there is no viable way to maintain or make use of an error term,
The bot doesn't use the error term. We do.

But you still expect it from the bot, which seems unreasonable.

Edit: The fact that bots do not always choose the play with the best winrate shows that they have different ways of dealing with winrate uncertainty. If they needed error terms, they would calculate them.

I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term. Why would it keep and use the lower quality estimate in this case? And the error term doesn't apply to the corrected estimate.

The bot could instead calculate a wider representation of the position internally (including estimated score, ev WITH deviation, for example). And calculate the winrate from those only at the last step, for comparison operations. This would still result in winrates with no error term, but it could also output those internal states (for human curiosity only) which would include some deviation information (but not about the final winrate). Btw something like this actually happens in the DM algorithm where the search itself is controlled by neural nets (instead of winrates the tree contains some NN-handled unknown blackbox state representation, with NN transformations calculating the effects of visits.)

Bill Spight · Post by **Bill Spight** » Mon Aug 13, 2018 1:15 am

moha wrote:
Bill Spight wrote:
moha wrote:What I say that even if you can make the theoretical distinction, for a bot's perspective there is no viable way to maintain or make use of an error term,
The bot doesn't use the error term. We do.
But you still expect it from the bot, which seems unreasonable.

Who says I expect error estimates from the bots? They are trained to play better, not necessarily to make better evaluations. Although they use evaluations in making their decisions.

But people use them in reviews to evaluate positions and plays. However, when they do, they do not know how accurate the evaluations are.

Edit: The fact that bots do not always choose the play with the best winrate shows that they have different ways of dealing with winrate uncertainty. If they needed error terms, they would calculate them.
I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term.

Does not follow. But as I said, there is a literature about the estimation of probabilities. I started this thread to address the practical question that players face concerning the accuracy of estimated winrates, not to debate whether bots could produce perfect winrate estimates.

Gomoto · Post by **Gomoto** » Mon Aug 13, 2018 3:53 am

In the pre AI phase, when I had to evaluate a unknown/unclear position I asked the strongest player (in my club, or as next step my pro database).

Today I ask still the strongest player (AI).

Nothing changed (the answers are not accurate, but for my practical purposes it still works quite well).

chut · Post by **chut** » Tue Aug 14, 2018 5:09 am

I still don't see how. If a bot had a winrate estimate and an error term, it could transform those to a better estimate with no error term. Why would it keep and use the lower quality estimate in this case? And the error term doesn't apply to the corrected estimate.

The error term, if it exists, would be a measure of how confident we have of a certain move. That could be used to guide MCTS to focus more on the more uncertain branch right?

Life In 19x19

On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates

Re: On the accuracy of winrates