Engine Tournament

as0770 · Post by **as0770** » Sun Sep 08, 2019 11:19 am

jann wrote:
as0770 wrote:There is nothing like a random factor.
Without random factor the stronger net would always win (and the games may even be identical).

Now I got what you mean. But the stronger net won't necessarily win. The chance it would win all games is e.g. 60% but there is also a chance it will lose all games by e.g. 40%. The random factor is part of the match condition.

jann wrote:A winrate of eg. 54% may go up to 58% with quadruple playouts. This 58% makes slightly more statistical mass from the same number of games (because each sample weights nearly 1, while at very low playouts game results are more random, thus weight less than 1 - carry less information).

In a match between A and B it might happen that A wins with x playouts, and B with x*4 playouts. Then A is stronger in games with x playouts and B in games with x*4 playouts. The winning chances can be different for all kind of match conditions, and the results of a match are only valid for its own match condition. The statistical significance for every match condition depends exclusively on the number of games.

jann · Post by **jann** » Sun Sep 08, 2019 12:16 pm

as0770 wrote:The chance it would win all games is e.g. 60% but there is also a chance it will lose all games by e.g. 40%. The random factor is part of the match condition.

This is not just what I meant. For the stronger net to lose, there still must be some random factor - something that can go against it. Without such, and without even being unlucky, it won't lose.

In a match between A and B it might happen that A wins with x playouts, and B with x*4 playouts. Then A is stronger in games with x playouts and B in games with x*4 playouts.

No, this is quite unlikely. It was observed that higher playouts usually match the results of lower playouts, only with increased differences.

The statistical significance for every match condition depends exclusively on the number of games.

No, it also depends on the quality/representativeness of the games. A less representative / more random game sample can be though of like having N% chance of being replaced by a random value (thus resulting in lower number of effective samples).

Bill Spight · Post by **Bill Spight** » Sun Sep 08, 2019 12:41 pm

There is an argument that increasing the number of playouts makes the result of each engine less representative, not moreso. Taken to the extreme, with enough playouts each engine plays perfectly. From the results you can't tell which engine is which.

as0770 · Post by **as0770** » Mon Sep 09, 2019 12:43 pm

jann wrote:
as0770 wrote:The chance it would win all games is e.g. 60% but there is also a chance it will lose all games by e.g. 40%. The random factor is part of the match condition.
This is not just what I meant. For the stronger net to lose, there still must be some random factor - something that can go against it. Without such, and without even being unlucky, it won't lose.

If Engine A wins 60% against Engine B, it is supposed to be stronger. If there is no random factor, the Engines will play the same game again and again and one engine will win all games. Hence the chance, that the stronger Engine A loses all games, is 40%.

On the other hand the random factor will affect the result very much, a high random factor might force the stronger engine to play moves that it don't like and it may even lose because of that if the weaker engine can handle that better. So it is a very important point for interpreting the results of a match and you will get completely different results just by changing the random factor, and it is unpredictable which influence it has.

jann wrote:
In a match between A and B it might happen that A wins with x playouts, and B with x*4 playouts. Then A is stronger in games with x playouts and B in games with x*4 playouts.
No, this is quite unlikely.

Just because it is unlikely you can't ignore it. We are talking about determining the strength of an engine and how much games you need to get a statistical significant result. If you think to know the outcome of a match you don't need to play it. And as soon as you don't set the number of playouts but the time for each move you will find out that there are matchups small vs. big nets where with little time the small net will win and with more time the big net.

The same might happen with nets of equal size, one net understands ladders, the other one needs a special amount of playouts to calculate ladders.

jann wrote:It was observed that higher playouts usually match the results of lower playouts, only with increased differences.

With every match you evaluate the strength in a specific condition. What you are talking about is maybe an effect when matching similar engine nets. It might be different with other types of engines. In fact the opposite is true as Bill easily proved. But that is not at all the point. The only point I am talking about is the statistical significance of a result.

That means if you have a result of A vs. B of 220:180 the chance that B is stronger in this match conditions is still > 2%. Regardless the number of playouts.

And btw, good points don't begin with "it was observed that..."

jann wrote:
as0770 wrote:The statistical significance for every match condition depends exclusively on the number of games.
No, it also depends on the quality/representativeness of the games. A less representative / more random game sample can be though of like having N% chance of being replaced by a random value (thus resulting in lower number of effective samples).

The outcome of a game is like rolling a dice and the result depends on probabilities. It won't change the outcome if you roll the dice stronger. The results of a game with more playouts are more important for us, no doubt, but to get a statistical significant result you don't need less games as with few playouts. This is simple mathematics.

jann · Post by **jann** » Mon Sep 09, 2019 8:15 pm

as0770 wrote:If there is no random factor, the Engines will play the same game again and again and one engine will win all games. Hence the chance, that the stronger Engine A loses all games, is 40%.

Here you are contradicting yourself, still unconsciously thinking there is some random factor, there are still "chances".

In fact the opposite is true as Bill easily proved.

He was joking.

(If I really need to spell this out: that artifact only happens at the very end of the scale, where there are really no random factor and no competition anymore.)

jann wrote:A less representative / more random game sample can be though of like having N% chance of being replaced by a random value (thus resulting in lower number of effective samples).
The outcome of a game is like rolling a dice and the result depends on probabilities. It won't change the outcome if you roll the dice stronger. The results of a game with more playouts are more important for us, no doubt, but to get a statistical significant result you don't need less games as with few playouts.

Pls reread what I wrote. It's not with more playouts you need less games (sampe weight nearing 1) - it's with too few playouts you need more (sample weight <1).

Last attempt to help you understand: suppose there is a special match condition where engine strengths only have a minimal effect on results, who wins each match is almost completely random, but the stronger engine still have some tiny advantage. You play 400 games. The significance of the result will be very small: sd (from the random part) will still be +-10, and the useful, informative part will be dwarfed beside it (+-1 or so). You don't want to measure random noise, you want to measure a signal ("more important for us"). Statistical significance means how unlikely your results were caused by pure chance instead of that signal.

Bill Spight · Post by **Bill Spight** » Mon Sep 09, 2019 8:25 pm

jann wrote:
In fact the opposite is true as Bill easily proved.
He was joking. (If I really need to spell this out: that artifact only happens at the very end of the scale, where there are really no random factor and no competition anymore.)

Actually, I was not joking. I do not know enough to engage in this debate. But there is a problem with talking about representativeness without defining it. My point was really made earlier. Search is part of the strength of a program. Making no search does not make much sense, because how search is done is one of the characteristics of today's top programs. But if you are comparing two programs without specifying their search parameters, which includes number of playouts, then what are you saying?

jann · Post by **jann** » Mon Sep 09, 2019 8:46 pm

Bill Spight wrote:Making no search does not make much sense, because how search is done is one of the characteristics of today's top programs. But if you are comparing two programs without specifying their search parameters, which includes number of playouts, then what are you saying?

That restricting search to too little amounts means the results are getting less informative and closer to random, thus less reliable in choosing the stronger side. In typical cases the observed winrate between A and B is proportional to the amount of search allowed (not linearly oc).

Bill Spight · Post by **Bill Spight** » Mon Sep 09, 2019 9:13 pm

jann wrote:
Bill Spight wrote:Making no search does not make much sense, because how search is done is one of the characteristics of today's top programs. But if you are comparing two programs without specifying their search parameters, which includes number of playouts, then what are you saying?
That restricting search to too little amounts means the results are getting less informative and closer to random, thus less reliable in choosing the stronger side. In typical cases the observed winrate between A and B is proportional to the amount of search allowed (not linearly oc).

But when you change the search parameters, you change the program being compared. You can't say that more search makes a program more what it is, unless you have defined the program that way.

LZ with 200k playouts is different from LZ with 100k playouts. It is stronger. And it is stronger not just because of randomness. I have shown, with a version of Leela a few years ago, that the strength difference is not random. You can use the non-randomness to identify likely errors by the version with fewer playouts.

jann · Post by **jann** » Mon Sep 09, 2019 11:04 pm

I don't see why would that go against what I wrote, or against the basic statistical phenomenon: that it is significantly more likely that the weaker side wins a 100 game match (by chance) under low-search conditions, than under high-search conditions. (Provided that condition change can be performed in an unbiased way, which in practice is only possible if the engines are similar and the amount of search allowed is the same for them. Also usually more search = less randomness, and a player is not rigidly defined by a fixed search amount, even for humans there are things like time controls.)

Bill Spight · Post by **Bill Spight** » Mon Sep 09, 2019 11:28 pm

jann wrote:I don't see why would that go against what I wrote, or against the basic statistical phenomenon: that it is significantly more likely that the weaker side wins a 100 game match (by chance) under low-search conditions, than under high-search conditions. (Provided that condition change can be performed in an unbiased way, which in practice is only possible if the engines are similar and the amount of search allowed is the same for them. Also usually more search = less randomness.)

Well, I do not think that your claim is precise enough, nor do I see any evidence. It may well be that, given two similar neural net programs, there are search conditions that distinguish between them best with regard to strength. But that is an empirical question. It is not proven, and IMHO, not plausible, that simply increasing the number of playouts will always provide better discrimination between the programs. You need to demonstrate that claim.

Edit: I am not sure, but is this your claim? Given two similar neural net programs playing a match against each other with a certain number of games (10,000 maybe?), the more playouts you give each program, the more games the stronger program will win.

If so, that's a demonstrable claim. Within practical limits, OC.

jann · Post by **jann** » Tue Sep 10, 2019 4:50 am

Bill Spight wrote:Edit: I am not sure, but is this your claim? Given two similar neural net programs playing a match against each other with a certain number of games (10,000 maybe?), the more playouts you give each program, the more games the stronger program will win.

If so, that's a demonstrable claim. Within practical limits, OC.

Roughly yes (and the consequence regarding statistical significance). Exceptions are possible oc (and the effect is nowhere that linear), but in reality the same effect would still often occur under less restricted conditions, you just cannot cannot observe it (performing an unbiased experient will not be possible if the two sides start with different amount of search, for example). Btw this is not my claim but actual LZ stats.

Bill Spight · Post by **Bill Spight** » Tue Sep 10, 2019 8:03 am

jann wrote:
Bill Spight wrote:Edit: I am not sure, but is this your claim? Given two similar neural net programs playing a match against each other with a certain number of games (10,000 maybe?), the more playouts you give each program, the more games the stronger program will win.

If so, that's a demonstrable claim. Within practical limits, OC.
Roughly yes (and the consequence regarding statistical significance). Exceptions are possible oc (and the effect is nowhere that linear), but in reality the same effect would still often occur under less restricted conditions, you just cannot cannot observe it (performing an unbiased experient will not be possible if the two sides start with different amount of search, for example). Btw this is not my claim but actual LZ stats.

Thanks.

How does the winning percentage of the stronger player change with, say, doubling the playouts? My guess is that there is an optimal number of playouts to discriminate between players. What about games between players with sizable differences? Are there graphs somewhere of the systematic effects? Thanks.

jlt · Post by **jlt** » Tue Sep 10, 2019 8:46 am

See this post of Friday9i, written on 6 Oct. 2018:
https://github.com/leela-zero/leela-zero/issues/1914

Comparison curves between two nets tend to be U-shaped: the stronger net is much better than the weaker one at low (<10) playouts, or at high (>1000 or >10000) playouts.

Bill Spight · Post by **Bill Spight** » Tue Sep 10, 2019 9:09 am

jlt wrote:See this post of Friday9i, written on 6 Oct. 2018:
https://github.com/leela-zero/leela-zero/issues/1914

Comparison curves between two nets tend to be U-shaped: the stronger net is much better than the weaker one at low (<10) playouts, or at high (>1000 or >10000) playouts.

Many thanks.

I see that the graph is supposedly about game analysis instead of game play. Even now, as I have indicated elsewhere, I would not trust any bot for analysis below 10k. Maybe a broader search setting would be better for game analysis, I dunno. Edit: The point being that a broad search can uncover good plays that are not originally considered good enough to explore.

as0770 · Post by **as0770** » Tue Sep 10, 2019 9:24 am

jann wrote:
as0770 wrote:If there is no random factor, the Engines will play the same game again and again and one engine will win all games. Hence the chance, that the stronger Engine A loses all games, is 40%.
Here you are contradicting yourself, still unconsciously thinking there is some random factor, there are still "chances".

First, this was an answer to "The stronger engine won't lose" which is wrong. I don't see a contradiction. Before playing a game you don't know the outcome. You can estimate the chances by playing with other conditions or against other engines.

If Engine A wins a 1000 playouts match 100:0 and engine B 1001 playouts match 100:0, which one is stronger?

jann wrote:
In fact the opposite is true as Bill easily proved.
He was joking. (If I really need to spell this out: that artifact only happens at the very end of the scale, where there are really no random factor and no competition anymore.)

You seem not to have much experiences in AI games. Usually the "draw factor" raises with more calculation time. This is proven in many cases with A/B search AI engines. I see no reason why this should be different in a Monte Carlo search. Bills "joke" will help you understand why the winning chances will shift in direction to 50% with more playouts. Here it is your part to prove me wrong.

jann wrote:
jann wrote:A less representative / more random game sample can be though of like having N% chance of being replaced by a random value (thus resulting in lower number of effective samples).
The outcome of a game is like rolling a dice and the result depends on probabilities. It won't change the outcome if you roll the dice stronger. The results of a game with more playouts are more important for us, no doubt, but to get a statistical significant result you don't need less games as with few playouts.
Pls reread what I wrote. It's not with more playouts you need less games (sampe weight nearing 1) - it's with too few playouts you need more (sample weight <1).

Please explain the difference between "with few playouts you need more games" and "with much playouts you need less games"

jann wrote:Last attempt to help you understand:

Thanks, but what you say doesn't explain anything in question.

jann wrote:suppose there is a special match condition where engine strengths only have a minimal effect on results, who wins each match is almost completely random, but the stronger engine still have some tiny advantage. You play 400 games. The significance of the result will be very small: sd (from the random part) will still be +-10, and the useful, informative part will be dwarfed beside it (+-1 or so). You don't want to measure random noise, you want to measure a signal ("more important for us"). Statistical significance means how unlikely your results were caused by pure chance instead of that signal.

I didn't say something else. The only question is: Does the number of playouts affect the statistical significance, And no matter what you do or say: For every condition there is a winning chance.

And my last attempt to help you understand: The result of A vs. B of 220:180 means the probability that A is better is 97%. No matter how many playouts.

Life In 19x19

Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament