I have the impression that the term "winrate" is used for different things:
- WRraw: the winrate estimated by the raw neural network. For a human this would correspond to picking the most intuitive move and try to guess at a glance the chances of winning.
- WRn: the winrate estimated after n playouts, where n is a large number (given Jan van Rongen's tests above, n=50000 would be a good compromise between accuracy and computation time). For a human, this would correspond to estimating winrate after deep reading.
- WRtrue: the limit when N tends to infinity of the proportion of won games when N test matches are run starting from the position.
The best way to estimate WR
true would be to run a large number N of test matches and calculate the proportion of won games (
this is what AlphaZero did to create its teaching tool). Call this proportion p. The estimate of WR
true is p, and we can estimate the error by 2 sqrt(p(1-p)/N) (if we want a confidence interval of about 95%).
However we usually don't do that because it's too computationally expensive, so we use WR
raw or WR
n (or WR
m for a smaller number m, like m=1000) as estimates. The number WR
n takes more time to compute, but is probably a better estimate, than WR
raw.
So currently we consider that WR
n is a good estimator of WR
true, but we don't know how large the error |WR
n-WR
true| can be. It might be possible in the future to train a computer to give a good estimate the error, but for the moment we can't do that. Perhaps |WR
n-WR
raw| gives an idea of the magnitude of the error, but some tests would be needed to determine whether this is true.