Bill Spight wrote:
Suppose that we have a winrate estimate for Black in a given position with White to play of 87%. What does that mean? Well, it is said that it means that Black has an estimated 87% probability of winning the game. That's ambiguous. First, there are two main schools of probability, frequentist and Bayesian. The Bayesian school is the oldest and newest. The frequentist school replaced it in the late 19th to early 20th century, but Bayesian probability made a comeback in the late 20th century. Both are in use today. The frequentist school says that a single event has no fractional probability; it either occurs or it doesn't. The probability is 0 or 1. By that thinking the margin of error of the 87% estimate is 87%. That's no help.
But suppose that we were to play out the game several times from that position with White to play, where the AI plays both sides with the same randomized, non-deterministic strategy. Then sometimes Black will win and sometimes White will win. These results will give us a second and different, statistical estimate of Black's winrate under those conditions of play. Furthermore, we can calculate the standard error of that winrate estimate and from that derive a margin of error within which we expect that the actual winrate lies. Typically we find the margin of error by multiplying the standard error by 2. We expect that the actual winrate will lie within that margin of error approximately 95% of the time. Obviously, this is not how the program makes its original winrate estimate. If the difference between the program's original winrate estimate and the statistical estimate is greater than the latter's margin of error, we get suspicious of the original estimate.
But note that we have only calculated the margin of error of the statistical estimate, not the margin of error of the original winrate estimate, which is what we want. What I want, anyway. We have found 1 error estimate, that's all.
There's a subtle but important point that might not be obvious to people.
Suppose I perform two self-play training runs, run A and run B, where the bot from run A plays with some amount of randomness and the bot from run B, except for perhaps the early opening, plays entirely
deterministically. And yes, you can do this with AlphaZero-style training! As long as there is sufficient randomization in the variety of positions and trajectories you feed the bot as input, the whole self-play training process works even if at every particular moment in time the bot itself then behaves deterministically in those positions. Basically, this is because the neural net can't tell the difference, either way it sees the same thing: a never-ending stream of positions, never repeating (except for the very-early opening), each one labeled as "win" or "loss", without a way to tell counterfactually whether another alternative would have been possible.
So say I produce a final bot A and bot B that are both about equally strong. For any particular choice of search depth and other settings, Bot B is deterministic, but if you were to play a game against either one, you likely would not be able to tell the difference between A and B. Perhaps maybe the programmer hardcodedly adds some "artificial" randomization to the first 20 moves, so that even with several games, you normally wouldn't be able to tell. And also if you used them for game analysis, both of the would still report the same fuzzy winrates, like 30%, or 75%, or whatever (yes, bot B would still report such winrates even though it learned from data produced via "deterministic" selfplay).
Now we consider applying this procedure:
Quote:
But suppose that we were to play out the game several times from that position with White to play, where the AI plays both sides with the same randomized, non-deterministic strategy. Then sometimes Black will win and sometimes White will win. These results will give us a second and different, statistical estimate of Black's winrate under those conditions of play.
The two bots vary drastically on the metric of this "second and different statistical estimate". For bot A, you'll get some fuzzy percentage. For bot B, you'll get either 0 or 1 - because it
doesn't have a randomized nondeterministic strategy. Despite that, the winrate is still meaningful and about the same from the user's perspective - maximizing it still gives you superhuman play, and it looks about the same in many cases as bot A's winrates. And among "90%" positions for bot B, most of them do indeed play out to be wins. Nonetheless you took any sigle "90%" position for bot B, playing that position out repeatedly will either win 100% of the time or lose 100% of the time, so the "margin of error" on that position by this metric would be... well... either 10% or 90%.
The takeaway of this thought experiment is to make it more stark how, this metric - "how often does the bot actually win if repeatedly played from this position" - cannot, alone, be the story of what you want (very much agreeing with what Bill said). Despite varying drastically on this metric, in many of the ways that matter from a practical user perspective, the two bots are very similar. For both of them, the user is faced with the question "how much can I trust the bot's evaluations" and for both bots those winrates are often about the same. What this thought experiment shows is that in some cases, what you're measuring like this has less to do with "how much should I trust those winrates I see on my screen" and more to do with irrelevant details of the nature of the bot's implementation that don't significantly affect those winrates.
Okay, maybe this metric still tells you something useful in practice with actual bots, even if not the whole story. That's why I brought it up myself too as a starting point. But still, the question arises what
is the right metric, and how do you measure it?
This is partly why I dislike the phrasing "
the margin of error". Because repeatedly campaigning for research on "
the margin of error", linguistically, loosely gives the impression to people listening in that "the margin of error" is a concrete thing that is already agreed-upon, unique, and well-defined. It's known how to compute it, and it's just up to those darned programmers to finally start doing so and reporting it. When, as Robert Jasiek pointed out, actually all we have that would be tractable are different proxies, no bright-line or canonical choice of which one to pick or how good they are.
Instead of "
we should do more research into ways of the margin of error of bot winrates"...
I like the phrasing "
we should do more research into ways of quantifying or measuring the uncertainty of bot winrates".
And instead of "
in this situation, I estimate the bot's margin of error is X"...
I like "
in {situations like this, openings, endgames, this exact case I tested}, on average the bot {fluctuates, changes-its-mind, disagrees, ...} with {itself, other bots, the empirical winrate from rolling out the game repeatedly,...} by X" (in each case, pick words that make it less ambiguous how you got X and what it represents).
Right now, a nontrivial part of the task isn't ready for the software engineers. Rather it's up to the people who happen to be both Go players and skilled mathematicians, to come up with proposals for the mathematical quantity to be measured that will most likely correspond to what their Go-intuitions want to see. Or at least, if the latter don't have clear proposals yet, then it at least shouldn't be a surprise if the former haven't coded it up yet.
