Leela Elf applied to the Metta vs. Ben David game
I think that the first steps to take are as indicated above. But Ales Cieply has graciously made two analyses (rsgf files) of the now infamous Metta - Ben David game by Leela Elf available. (
viewtopic.php?p=234293#p234293 ) One file is set for 100k rollouts for the whole game; the other is set for 200k rollouts, starting with move 30. I'm not sure what the rollout number means, as I have not been able to make the reported rollouts in the files add up to either number. But the 200k rollout file does, I guess, twice as much exploration as the 100k rollout file.
Now, since the rollout settings differ, we cannot do the basic comparison I have suggested above. But, on the assumption that the 200k rollout winrates are more accurate than the 100k rollout winrates, we can take the difference between the winrates for each position as an estimate of the error of the 100k rollout winrate.
Well, it certainly is plausible that the 200k winrates are more accurate than the 100k winrates, but do we have any evidence of that? Indeed, we do.
Suppose that a position has features that lead the 100k rollout winrate to overestimate the probability that Black wins. One move is not likely to change the position so much that the 100k rollout winrate will be correct or an underestimate. Its winrate for the next position is likely to also be an overestimate. Thus, if the 200k winrates are sufficiently more accurate than the 100k winrates, the sign of the difference between the two should usually stay the same between successive plays. (OC, you can flip the argument. The same would hold true if the 200k rollout winrates were underestimates. But still, if persistent features of the board lead to misestimates, the signs of consecutive differences should tend to remain the same.)
I calculated the differences for moves 30 - 166. (The last play was 165, but Elf calculated estimates for move 166.) That yields 135 consecutive differences. If the signs of the differences change randomly, there should be approximately 67.5 sign changes. There were two zero differences. One continued to change from plus to minus, the other went plus-zero-plus. If we count the latter as one half of a double sign change, we get 45 sign changes. If not, we get 44. That's at least 22.5 fewer sign changes than expected if they were random.