Engine Tournament

jann · Post by **jann** » Fri Sep 13, 2019 4:43 am

Even this graph shows the same curve for nets of the same sizes starting with the same number of playouts, where the strength difference is smaller thus the low-mid point is at playout parity (1:1).

Also as I mentioned there were more similar reports, some of them compared dozens of nets (mostly the same size) and also found bigger Elo difference between the same two nets at the same playouts with more playouts.

as0770 · Post by **as0770** » Fri Sep 13, 2019 6:44 am

jann wrote:Even this graph shows the same curve for nets of the same sizes starting with the same number of playouts, where the strength difference is smaller thus the low-mid point is at playout parity (1:1).

Well, I see 2 graphs of nets with the same size, at one you can adumbrate some kind of a u shape, at the other one you can't. Not that much of a sample... Of course the same effect might happen with nets of the same size but different strength, but you can't disclaim that it is at least "less distinctive".

as0770 · Post by **as0770** » Sat Sep 14, 2019 2:08 am

as0770 wrote:
jann wrote:Even this graph shows the same curve for nets of the same sizes starting with the same number of playouts, where the strength difference is smaller thus the low-mid point is at playout parity (1:1).
Well, I see 2 graphs of nets with the same size, at one you can adumbrate some kind of a u shape, at the other one you can't. Not that much of a sample... Of course the same effect might happen with nets of the same size but different strength, but you can't disclaim that it is at least "less distinctive".

Thinking about that... If the hypothesis is correct, it has nothing to do with the size of the net, but with the factor of the number of playouts you need to make the engines play on the same strength. The more the factor tend to 1, the less distinctive is the u-shape. And I think you can interpret the graphs in this way.

jann · Post by **jann** » Sat Sep 14, 2019 2:50 am

The size of the net and the low parity factor is closely related (larger nets are stronger but slower). Same-size nets tend to be closer in strength, that's why the curve is less steep. And again, there were plenty of other tests done beyond the single linked graph.

as0770 · Post by **as0770** » Sat Sep 14, 2019 2:53 am

jann wrote:The size of the net and the low parity factor is closely related (larger nets are stronger but slower). Same-size nets tend to be closer in strength, that's why the curve is less steep. And again, there were plenty of other tests done beyond the single linked graph.

And where is the contradiction to what I wrote?

q30 · Post by **q30** » Sat Sep 14, 2019 4:28 am

as0770 wrote:
jann wrote:
as0770 wrote:The chance it would win all games is e.g. 60% but there is also a chance it will lose all games by e.g. 40%. The random factor is part of the match condition.
This is not just what I meant. For the stronger net to lose, there still must be some random factor - something that can go against it. Without such, and without even being unlucky, it won't lose.
If Engine A wins 60% against Engine B, it is supposed to be stronger. If there is no random factor, the Engines will play the same game again and again and one engine will win all games. Hence the chance, that the stronger Engine A loses all games, is 40%.

On the other hand the random factor will affect the result very much, a high random factor might force the stronger engine to play moves that it don't like and it may even lose because of that if the weaker engine can handle that better. So it is a very important point for interpreting the results of a match and you will get completely different results just by changing the random factor, and it is unpredictable which influence it has.

jann wrote:
In a match between A and B it might happen that A wins with x playouts, and B with x*4 playouts. Then A is stronger in games with x playouts and B in games with x*4 playouts.
No, this is quite unlikely.
Just because it is unlikely you can't ignore it. We are talking about determining the strength of an engine and how much games you need to get a statistical significant result. If you think to know the outcome of a match you don't need to play it. And as soon as you don't set the number of playouts but the time for each move you will find out that there are matchups small vs. big nets where with little time the small net will win and with more time the big net.

The same might happen with nets of equal size, one net understands ladders, the other one needs a special amount of playouts to calculate ladders.

jann wrote:It was observed that higher playouts usually match the results of lower playouts, only with increased differences.
With every match you evaluate the strength in a specific condition. What you are talking about is maybe an effect when matching similar engine nets. It might be different with other types of engines. In fact the opposite is true as Bill easily proved. But that is not at all the point. The only point I am talking about is the statistical significance of a result.

That means if you have a result of A vs. B of 220:180 the chance that B is stronger in this match conditions is still > 2%. Regardless the number of playouts.

And btw, good points don't begin with "it was observed that..." ;-)

jann wrote:
as0770 wrote:The statistical significance for every match condition depends exclusively on the number of games.
No, it also depends on the quality/representativeness of the games. A less representative / more random game sample can be though of like having N% chance of being replaced by a random value (thus resulting in lower number of effective samples).
The outcome of a game is like rolling a dice and the result depends on probabilities. It won't change the outcome if you roll the dice stronger. The results of a game with more playouts are more important for us, no doubt, but to get a statistical significant result you don't need less games as with few playouts. This is simple mathematics.
...
The data are related to the strength of nets of different sizes. Of course their strength depends on the number of playouts.

The issue of this topic is only the statistical significance of results...

When I wrote, that Your tests are "synthetic" with these small amounts of the thinking time (playouts), and You answered, that my tests aren't "statistically significant" with these amounts of games, I answered, that Your tests aren't "practically significant", because they can get another results, than in sparring with real time control. I wrote too (I didn't remember: there or in PM), that in case of pure MC engines with amount of time (playouts) on move --> 0 the game will --> to random and the match result --> to 50%/50% regardless of engine strength (but stronger engine can get <50% of win because of statistical deviation).
I don't know, will it be or not the same U-shaped curves in case if x-axes will be in time (with constant PC performance) or playouts on move and y-axes will be in win % (and much more in case of other neuronets and engines), that in the data (with amounts of visits) above (not all curves even there are U-shaped), but if these curves will cross the straight line of 50%, the results will depend from the number of playouts not only quantitatively, but also qualitatively...

The number of playouts must be high enough to get a statistical significant result.

I am glad, that You understood the main idea...

q30 · Post by **q30** » Sat Sep 14, 2019 4:43 am

jann wrote:The size of the net and the low parity factor is closely related (larger nets are stronger but slower). Same-size nets tend to be closer in strength, that's why the curve is less steep. And again, there were plenty of other tests done beyond the single linked graph.

The larger nets are stronger only potentially, because they are "thinking slower" not only when playing, but also when learning (example).

as0770 · Post by **as0770** » Sat Sep 14, 2019 4:50 am

q30 wrote:
The number of playouts must be high enough to get a statistical significant result.
I am glad, that You understood the main idea...

I am sorry to say that, but once again you didn't understand at all... This was related to Monte Carlo Tree search...

You can't participate in such discussions with Google Translator.

q30 · Post by **q30** » Sat Sep 14, 2019 5:44 am

as0770 wrote:
q30 wrote:
The number of playouts must be high enough to get a statistical significant result.
I am glad, that You understood the main idea...
I am sorry to say that, but once again you didn't understand at all... This was related to Monte Carlo Tree search...

You can't participate in such discussions with Google Translator.

Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...

I try without any translator, but for some words and expressions use https://www.translate.ru.

as0770 · Post by **as0770** » Sat Sep 14, 2019 6:19 am

q30 wrote:Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...

You still didn't understand. It was related to "Monte Carlo Tree search" and not to "engines that use Monte Carlo Tree search".

q30 · Post by **q30** » Sat Sep 14, 2019 9:24 am

as0770 wrote:
q30 wrote:Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...
You still didn't understand. It was related to "Monte Carlo Tree search" and not to "engines that use Monte Carlo Tree search".

So, You are thinking still, that even in case of pure MC engines there is more "statistically significant" to minimize the engines thinking time (down to 0 in limit) and maximize the amount of games (up to infinity in limit) for receiving real idea of the engines strength ratio, aren't You?

as0770 · Post by **as0770** » Sat Sep 14, 2019 10:02 am

q30 wrote:
as0770 wrote:
q30 wrote:Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...
You still didn't understand. It was related to "Monte Carlo Tree search" and not to "engines that use Monte Carlo Tree search".
So, You are thinking still, that even in case of pure MC engines there is more "statistically significant" to minimize the engines thinking time (down to 0 in limit) and maximize the amount of games (up to infinity in limit) for receiving real idea of the engines strength ratio, aren't You?

If you use little time/playouts, you can determine the strength with little time/playouts. If you want to know the strength with much time/playouts you have to play with much time/playouts. In both cases you need the same amount of games to get a statistical significant result. Quite simple, isn't it?

jann · Post by **jann** » Sun Sep 15, 2019 3:46 am

as0770 wrote:In both cases you need the same amount of games to get a statistical significant result.

jann wrote:As you can see the stronger engine is expected to win more games under high-search conditions. For the weaker net to win a 400 game match by a chance upset, he needs the noise / random deviation to overcome the strengthwise expected advantage of the stronger player. Random deviation is constant for 400 games, the advantage of the stronger player is bigger with more playouts, hence the probability of getting the winner/stronger side wrong is less for the same number of games.

Your basic oversight is only worrying about the absolute margin of error. But statistical significance is about the proportion between the signal to be observed and the margin of error, ie. the relative error.

as0770 · Post by **as0770** » Sun Sep 15, 2019 6:04 am

jann wrote:Your basic oversight is only worrying about the absolute margin of error.

Indeed, this was the subject of debate. Some mean that you can measure the strength with a few games as long as the quality is good enough.

jann wrote:As you can see the stronger engine is expected to win more games under high-search conditions. For the weaker net to win a 400 game match by a chance upset, he needs the noise / random deviation to overcome the strengthwise expected advantage of the stronger player. Random deviation is constant for 400 games, the advantage of the stronger player is bigger with more playouts, hence the probability of getting the winner/stronger side wrong is less for the same number of games.

That's because the strength difference in high- and low-search conditions seems to be bigger (up to one point). The reason for that is a completely different topic.

jann · Post by **jann** » Mon Sep 16, 2019 3:40 am

as0770 wrote:Some mean that you can measure the strength with a few games as long as the quality is good enough.

I haven't seen such claims, but that's wrong as well. Confidence requires both quality and quantity - a fair amount of representative samples.

The reason I phrased like low quality games need more samples (and not the reverse) is because the two directions are asymmetric. The effect of amplifying the signal is not necessarily always strong enough, may have rare exceptions etc so it's better to just think of sample weights ~= 1 if the playouts are decent enough. But the opposite is different. If there is even a POTENTIAL that your samples become more random, your signal weakens and the expected difference falls near 50%, you already need more samples.

Life In 19x19

Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament