Engine Tournament

jann · Post by **jann** » Tue Sep 10, 2019 10:05 am

Bill Spight wrote:How does the winning percentage of the stronger player change with, say, doubling the playouts? My guess is that there is an optimal number of playouts to discriminate between players.

As the issue linked by jlt (and other similar reports) shows, there is no real upper optimum (within reasonable limits) - more search means more difference. (The artifact at extremes, like at very low playouts is a different effect - if the engine is only allowed to look at 5-10 nodes the rewards and penalties in choosing even a single one (in)efficiently can be very different).

Doubling the playouts is a bit tricky question, because going from 100 to 200 is NOT completely the same utility-wise as going from 300 to 600. But still, past tests often shown remarkably consistent Elo gains per playout multiplication.

as0770 wrote:Bills "joke" will help you understand why the winning chances will shift in direction to 50% with more playouts. Here it is your part to prove me wrong.

Just look at the data above.

The only question is: Does the number of playouts affect the statistical significance

As you can see the stronger engine is expected to win more games under high-search conditions. For the weaker net to win a 400 game match by a chance upset, he needs the noise / random deviation to overcome the strengthwise expected advantage of the stronger player. Random deviation is constant for 400 games, the advantage of the stronger player is bigger with more playouts, hence the probability of getting the winner/stronger side wrong is less for the same number of games.

as0770 · Post by **as0770** » Tue Sep 10, 2019 1:43 pm

jann wrote:Just look at the data above.

OMG. You completely mixed up different issues.

The data are related to the strength of nets of different sizes. Of course their strength depends on the number of playouts.

The issue of this topic is only the statistical significance of results...

jann · Post by **jann** » Tue Sep 10, 2019 2:06 pm

The data is about how strength DIFFERENCE (thus the expected win%) between the same two nets changes with increasing playouts (for both nets, proportionally). There are data points even for nets of the same sizes (both here and elsewhere).

as0770 · Post by **as0770** » Wed Sep 11, 2019 9:04 am

jann wrote:The data is about how strength DIFFERENCE (thus the expected win%) between the same two nets changes with increasing playouts (for both nets, proportionally). There are data points even for nets of the same sizes (both here and elsewhere).

Exactly. And the data say nothing about statistical significance related to the number of games.

Bill Spight · Post by **Bill Spight** » Wed Sep 11, 2019 9:09 am

BTW, has anyone come up with an explanation for the U shape graphs? Thanks.

jann · Post by **jann** » Wed Sep 11, 2019 10:28 am

Bill Spight wrote:BTW, has anyone come up with an explanation for the U shape graphs? Thanks.

Note that it's not complete U shape, only the low extreme below ~10 nodes behaves differently. As I wrote my guess is granularity (which makes small errors or differences impossible, and changes the rewards and penalties for selection).

With only a few nodes in total, you cannot get the order or move priority "slightly" wrong. Looking at a single useless node wastes 20% of the whole search if the total playouts is 5. Another thing to consider that if you choose the order of A and B inefficiently, this may cut off B completely if only a few nodes examined, while normally it only delays looking at B.

as0770 · Post by **as0770** » Wed Sep 11, 2019 10:30 am

Bill Spight wrote:BTW, has anyone come up with an explanation for the U shape graphs? Thanks.

I'd say with a few playouts the factor for the smaller net is higher because in a Monte Carlo search the number of playouts below 100 don't help much. The number of playouts must be high enough to get a statistical significant result. The same at the end of the scale. If you have a statistical significant result even more playouts don't help much either. So there is an area inbetween where the smaller/weaker net can compensate less knowledge with more playouts.

But I am not an expert and didn't look closely at it

Bill Spight · Post by **Bill Spight** » Wed Sep 11, 2019 11:47 am

as0770 wrote:
Bill Spight wrote:BTW, has anyone come up with an explanation for the U shape graphs? Thanks.
I'd say with a few playouts the factor for the smaller net is higher because in a Monte Carlo search the number of playouts below 100 don't help much. The number of playouts must be high enough to get a statistical significant result. The same at the end of the scale. If you have a statistical significant result even more playouts don't help much either. So there is an area inbetween where the smaller/weaker net can compensate less knowledge with more playouts.

But I am not an expert and didn't look closely at it

Well, statistical significance is a matter of convention, not skill at go, so I don't see how that matters.

as0770 · Post by **as0770** » Wed Sep 11, 2019 11:57 am

Bill Spight wrote:
as0770 wrote:
Bill Spight wrote:BTW, has anyone come up with an explanation for the U shape graphs? Thanks.
I'd say with a few playouts the factor for the smaller net is higher because in a Monte Carlo search the number of playouts below 100 don't help much. The number of playouts must be high enough to get a statistical significant result. The same at the end of the scale. If you have a statistical significant result even more playouts don't help much either. So there is an area inbetween where the smaller/weaker net can compensate less knowledge with more playouts.

But I am not an expert and didn't look closely at it
Well, statistical significance is a matter of convention, not skill at go, so I don't see how that matters.

Monte Carlo Search plays random moves to finish the game and usees the win/loss statistics to evaluate moves. With "statistical significant result" I mean the results of the Monte Carlo Search.

Mike Novack · Post by **Mike Novack** » Wed Sep 11, 2019 12:11 pm

I'll try to make that clearer.

The programs based upon MonteCarlo search are based upon the concept that IF (in a given position) the win rate of a number of random playouts is higher following move A than following move B then move A is a better move than move B.

Statistical relevance (the need for enough of these random playouts) is because otherwise the observed difference (win % following A higher than win % following B) might just be chance, or worse,worng just by chance.

Bill Spight · Post by **Bill Spight** » Wed Sep 11, 2019 12:37 pm

Bill Spight wrote:
as0770 wrote:
Bill Spight wrote:BTW, has anyone come up with an explanation for the U shape graphs? Thanks.
I'd say with a few playouts the factor for the smaller net is higher because in a Monte Carlo search the number of playouts below 100 don't help much. The number of playouts must be high enough to get a statistical significant result. The same at the end of the scale. If you have a statistical significant result even more playouts don't help much either. So there is an area inbetween where the smaller/weaker net can compensate less knowledge with more playouts.

But I am not an expert and didn't look closely at it
Well, statistical significance is a matter of convention, not skill at go, so I don't see how that matters.

as0770 wrote:jjMonte Carlo Search plays random moves to finish the game and usees the win/loss statistics to evaluate moves. With "statistical significant result" I mean the results of the Monte Carlo Search.

Mike Novack wrote:I'll try to make that clearer.

The programs based upon MonteCarlo search are based upon the concept that IF (in a given position) the win rate of a number of random playouts is higher following move A than following move B then move A is a better move than move B.

Statistical relevance (the need for enough of these random playouts) is because otherwise the observed difference (win % following A higher than win % following B) might just be chance, or worse,worng just by chance.

I'm not up on the internals of today's bots, but, IIRC, statistical arguments in the N-armed bandit solution have to do with choosing which node of the game tree to expand.

Anyway, in simple estimation, the more instances you have the greater the convergence. The U shape indicates that something else is going on. My first guess is that with sufficient playouts new moves are uncovered which alter the evaluation, and that stronger programs are better at taking that difference in evaluation into account. But that does not explain a sustained divergence in the winning percentage of the stronger program. {shrug}

lightvector · Post by **lightvector** » Thu Sep 12, 2019 6:16 am

At the high end, my intuition is strongly that the playout advantage needed for a "weaker" net to match a sufficiently "stronger" neural net should continue to increase almost arbitrarily high the further you go.

Basically, this comes from the "stronger" neural net being better at evaluation. And in Go, time horizons are long enough that no feasible amount of search can save you in many cases if you are fundamentally misjudging a position. E.g. if you think a given territory is larger than it actually is because you miss straightforward endgame aji, or that a group is alive when actually the opponent has a throw-in ko and is just waiting to build threats and/or for the ko to be valuable enough, then you're in trouble. If it's early in the game, it could take all the way until the later midgame or endgame before it becomes worth the opponent's time to exploit it, and search obviously cannot read that far, so search cannot help you correct the misjudgment.

So the intuitive mental model I have is that you have some distribution of mistakes that vary widely in how fixable-by-search they are, immediate tactical blunders, versus short-term misjudgments that will reveal their impact quickly, versus medium-term misjudgments that can still be corrected with very deep reading, versus long-term misjudgments that are unfixable. At lower playouts, the weaker net sometimes wins games because the stronger net makes tactical blunders sometimes, and only a small playout advantage is sufficient to reach equality because that will greatly decrease the weaker net's blunder rate compared to the stronger net. As you increase playouts enough, more and more of the tactical mistakes get fixed, increasingly often leaving just the errors in judgment, which aren't fixable by search. And therefore the playout advantage for the weaker net needed to reach 50-50 increases almost arbitrarily, because as fewer and fewer of the mistakes become easily fixable by search, the weaker net keeps needing more and more and more of a playout factor to be able to fix enough of them to compenstate the fact that it's also making a constant unfixable rate of judgment errors compared to the stronger net.

I saw these curves a long time ago, so the results aren't news to me, but at the time I first read it, the surprising part of the "U shape" was the ratio being high at the beginning, which is of course not explained by the above intuition. I'm still pretty confident in the above intuitive explanation for the right side of the "U shape", It's very likely an entirely different phenomenon responsible for the left half. (Which I have some intuitions about now too but I won't touch on since they're harder to put into words for me).

Keep in mind that none of the above is a logical proof or anything. Indeed, it would probably not be hard to construct a strong neural net at low playouts that gets *weaker* the more you search, flagrantly violating the above argument! In the extreme, take a strong neural net and negate its value input, so that as it searches, it directly prefers bad results to good ones. At low playouts, it will be dominated by its initial move intuitions and priors (the "policy" output), which are still likely good moves and will usually beat the weaker net at similarly low playouts. But as it searches, those priors will get corrupted by the horrendous evaluations, leading it to increasingly prefer worse moves after a little search. (I haven't actually tested this, but I'm pretty sure something like this would happen). So the above is just an intuition about neural nets we train in practice, not something that is true by necessity.

Bill Spight · Post by **Bill Spight** » Thu Sep 12, 2019 7:28 am

lightvector wrote:Basically, this comes from the "stronger" neural net being better at evaluation. And in Go, time horizons are long enough that no feasible amount of search can save you in many cases if you are fundamentally misjudging a position. E.g. if you think a given territory is larger than it actually is because you miss straightforward endgame aji, or that a group is alive when actually the opponent has a throw-in ko and is just waiting to build threats and/or for the ko to be valuable enough, then you're in trouble. If it's early in the game, it could take all the way until the later midgame or endgame before it becomes worth the opponent's time to exploit it, and search obviously cannot read that far, so search cannot help you correct the misjudgment.

Just to note that the examples given are things that humans, by our propensity for depth first local search, can and do uncover early in the game.

Edit: Let me add that, because of the nature of go, in which pieces do not move, local reading often remains pertinent for many moves.

jann · Post by **jann** » Thu Sep 12, 2019 10:45 am

There's actually another and very simple possible explanation why the stronger side's win% expectation keeps on moving further away from 50% with more playouts for both sides: the compounding effect.

Being stronger usually means that a look at a position gives slightly more correct move ordering, the first (highest policy) moves are slightly more often better objectively. Let's say this advantage worth 1.01 (in whatever scale). Now let's say that in a low-search condition the average search depth is 3, while in a high-search condition it is 8. This means this advantage is raised to different powers, since it manifests at ALL depth levels (from top to leaf). So the stronger side's search looks at more relevant answers to more relevant answers to more relevant moves etc. The deeper the search, the greater the efficiency/relevance advantage is (by the time a leaf node reached, eg. raised from 1.03 to 1.083 above).

(Edit: Since leaf nodes are the most numerous, their weight is high in the averages, thus their relevance strongly affect the accuracy of even the top evaluations.)

as0770 · Post by **as0770** » Thu Sep 12, 2019 11:06 pm

Bill Spight wrote:Anyway, in simple estimation, the more instances you have the greater the convergence.

For every engine you can say the more playouts, the stronger. But the benefit is not linear. For above mentioned reason there is less benefit with a very small and a very high number of playouts.

The graph compares engines with a different number of playouts. The benefit form 5 -> 10 po is likely less then from 500 -> 1000 po. The benefit from 500->1000 po is likely higher than from 500.000 to 1.000.000 po. This would explain the u-shape.

Life In 19x19

Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament