Engine Tournament
Re: Engine Tournament
Even this graph shows the same curve for nets of the same sizes starting with the same number of playouts, where the strength difference is smaller thus the low-mid point is at playout parity (1:1).
Also as I mentioned there were more similar reports, some of them compared dozens of nets (mostly the same size) and also found bigger Elo difference between the same two nets at the same playouts with more playouts.
Also as I mentioned there were more similar reports, some of them compared dozens of nets (mostly the same size) and also found bigger Elo difference between the same two nets at the same playouts with more playouts.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
Well, I see 2 graphs of nets with the same size, at one you can adumbrate some kind of a u shape, at the other one you can't. Not that much of a sample... Of course the same effect might happen with nets of the same size but different strength, but you can't disclaim that it is at least "less distinctive".jann wrote:Even this graph shows the same curve for nets of the same sizes starting with the same number of playouts, where the strength difference is smaller thus the low-mid point is at playout parity (1:1).
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
Thinking about that... If the hypothesis is correct, it has nothing to do with the size of the net, but with the factor of the number of playouts you need to make the engines play on the same strength. The more the factor tend to 1, the less distinctive is the u-shape. And I think you can interpret the graphs in this way.as0770 wrote:Well, I see 2 graphs of nets with the same size, at one you can adumbrate some kind of a u shape, at the other one you can't. Not that much of a sample... Of course the same effect might happen with nets of the same size but different strength, but you can't disclaim that it is at least "less distinctive".jann wrote:Even this graph shows the same curve for nets of the same sizes starting with the same number of playouts, where the strength difference is smaller thus the low-mid point is at playout parity (1:1).
Re: Engine Tournament
The size of the net and the low parity factor is closely related (larger nets are stronger but slower). Same-size nets tend to be closer in strength, that's why the curve is less steep. And again, there were plenty of other tests done beyond the single linked graph.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
And where is the contradiction to what I wrote?jann wrote:The size of the net and the low parity factor is closely related (larger nets are stronger but slower). Same-size nets tend to be closer in strength, that's why the curve is less steep. And again, there were plenty of other tests done beyond the single linked graph.
-
q30
- Lives with ko
- Posts: 145
- Joined: Sat Aug 13, 2016 8:23 am
- Rank: 30 kyu
- GD Posts: 0
- Has thanked: 1 time
- Been thanked: 1 time
Re: Engine Tournament
When I wrote, that Your tests are "synthetic" with these small amounts of the thinking time (playouts), and You answered, that my tests aren't "statistically significant" with these amounts of games, I answered, that Your tests aren't "practically significant", because they can get another results, than in sparring with real time control. I wrote too (I didn't remember: there or in PM), that in case of pure MC engines with amount of time (playouts) on move --> 0 the game will --> to random and the match result --> to 50%/50% regardless of engine strength (but stronger engine can get <50% of win because of statistical deviation).as0770 wrote:If Engine A wins 60% against Engine B, it is supposed to be stronger. If there is no random factor, the Engines will play the same game again and again and one engine will win all games. Hence the chance, that the stronger Engine A loses all games, is 40%.jann wrote:This is not just what I meant. For the stronger net to lose, there still must be some random factor - something that can go against it. Without such, and without even being unlucky, it won't lose.as0770 wrote:The chance it would win all games is e.g. 60% but there is also a chance it will lose all games by e.g. 40%. The random factor is part of the match condition.
On the other hand the random factor will affect the result very much, a high random factor might force the stronger engine to play moves that it don't like and it may even lose because of that if the weaker engine can handle that better. So it is a very important point for interpreting the results of a match and you will get completely different results just by changing the random factor, and it is unpredictable which influence it has.
Just because it is unlikely you can't ignore it. We are talking about determining the strength of an engine and how much games you need to get a statistical significant result. If you think to know the outcome of a match you don't need to play it. And as soon as you don't set the number of playouts but the time for each move you will find out that there are matchups small vs. big nets where with little time the small net will win and with more time the big net.jann wrote:No, this is quite unlikely.In a match between A and B it might happen that A wins with x playouts, and B with x*4 playouts. Then A is stronger in games with x playouts and B in games with x*4 playouts.
The same might happen with nets of equal size, one net understands ladders, the other one needs a special amount of playouts to calculate ladders.
With every match you evaluate the strength in a specific condition. What you are talking about is maybe an effect when matching similar engine nets. It might be different with other types of engines. In fact the opposite is true as Bill easily proved. But that is not at all the point. The only point I am talking about is the statistical significance of a result.jann wrote:It was observed that higher playouts usually match the results of lower playouts, only with increased differences.
That means if you have a result of A vs. B of 220:180 the chance that B is stronger in this match conditions is still > 2%. Regardless the number of playouts.
And btw, good points don't begin with "it was observed that..." ;-)
The outcome of a game is like rolling a dice and the result depends on probabilities. It won't change the outcome if you roll the dice stronger. The results of a game with more playouts are more important for us, no doubt, but to get a statistical significant result you don't need less games as with few playouts. This is simple mathematics.jann wrote:No, it also depends on the quality/representativeness of the games. A less representative / more random game sample can be though of like having N% chance of being replaced by a random value (thus resulting in lower number of effective samples).as0770 wrote:The statistical significance for every match condition depends exclusively on the number of games.
...
The data are related to the strength of nets of different sizes. Of course their strength depends on the number of playouts.
The issue of this topic is only the statistical significance of results...
I don't know, will it be or not the same U-shaped curves in case if x-axes will be in time (with constant PC performance) or playouts on move and y-axes will be in win % (and much more in case of other neuronets and engines), that in the data (with amounts of visits) above (not all curves even there are U-shaped), but if these curves will cross the straight line of 50%, the results will depend from the number of playouts not only quantitatively, but also qualitatively...
I am glad, that You understood the main idea...The number of playouts must be high enough to get a statistical significant result.
-
q30
- Lives with ko
- Posts: 145
- Joined: Sat Aug 13, 2016 8:23 am
- Rank: 30 kyu
- GD Posts: 0
- Has thanked: 1 time
- Been thanked: 1 time
Re: Engine Tournament
The larger nets are stronger only potentially, because they are "thinking slower" not only when playing, but also when learning (example).jann wrote:The size of the net and the low parity factor is closely related (larger nets are stronger but slower). Same-size nets tend to be closer in strength, that's why the curve is less steep. And again, there were plenty of other tests done beyond the single linked graph.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
I am sorry to say that, but once again you didn't understand at all... This was related to Monte Carlo Tree search...q30 wrote:I am glad, that You understood the main idea...The number of playouts must be high enough to get a statistical significant result.
You can't participate in such discussions with Google Translator.
-
q30
- Lives with ko
- Posts: 145
- Joined: Sat Aug 13, 2016 8:23 am
- Rank: 30 kyu
- GD Posts: 0
- Has thanked: 1 time
- Been thanked: 1 time
Re: Engine Tournament
Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...as0770 wrote:I am sorry to say that, but once again you didn't understand at all... This was related to Monte Carlo Tree search...q30 wrote:I am glad, that You understood the main idea...The number of playouts must be high enough to get a statistical significant result.
You can't participate in such discussions with Google Translator.
I try without any translator, but for some words and expressions use https://www.translate.ru.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
You still didn't understand. It was related to "Monte Carlo Tree search" and not to "engines that use Monte Carlo Tree search".q30 wrote:Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...
-
q30
- Lives with ko
- Posts: 145
- Joined: Sat Aug 13, 2016 8:23 am
- Rank: 30 kyu
- GD Posts: 0
- Has thanked: 1 time
- Been thanked: 1 time
Re: Engine Tournament
So, You are thinking still, that even in case of pure MC engines there is more "statistically significant" to minimize the engines thinking time (down to 0 in limit) and maximize the amount of games (up to infinity in limit) for receiving real idea of the engines strength ratio, aren't You?as0770 wrote:You still didn't understand. It was related to "Monte Carlo Tree search" and not to "engines that use Monte Carlo Tree search".q30 wrote:Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
If you use little time/playouts, you can determine the strength with little time/playouts. If you want to know the strength with much time/playouts you have to play with much time/playouts. In both cases you need the same amount of games to get a statistical significant result. Quite simple, isn't it?q30 wrote:So, You are thinking still, that even in case of pure MC engines there is more "statistically significant" to minimize the engines thinking time (down to 0 in limit) and maximize the amount of games (up to infinity in limit) for receiving real idea of the engines strength ratio, aren't You?as0770 wrote:You still didn't understand. It was related to "Monte Carlo Tree search" and not to "engines that use Monte Carlo Tree search".q30 wrote:Almost all engines with neuronets are using MC search too (and are using it results for resign), for example, in LZ: neuronets - visits (and nneval win values), MC - playouts (and win %)...
Re: Engine Tournament
as0770 wrote:In both cases you need the same amount of games to get a statistical significant result.
Your basic oversight is only worrying about the absolute margin of error. But statistical significance is about the proportion between the signal to be observed and the margin of error, ie. the relative error.jann wrote:As you can see the stronger engine is expected to win more games under high-search conditions. For the weaker net to win a 400 game match by a chance upset, he needs the noise / random deviation to overcome the strengthwise expected advantage of the stronger player. Random deviation is constant for 400 games, the advantage of the stronger player is bigger with more playouts, hence the probability of getting the winner/stronger side wrong is less for the same number of games.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
Indeed, this was the subject of debate. Some mean that you can measure the strength with a few games as long as the quality is good enough.jann wrote:Your basic oversight is only worrying about the absolute margin of error.
That's because the strength difference in high- and low-search conditions seems to be bigger (up to one point). The reason for that is a completely different topic.jann wrote:As you can see the stronger engine is expected to win more games under high-search conditions. For the weaker net to win a 400 game match by a chance upset, he needs the noise / random deviation to overcome the strengthwise expected advantage of the stronger player. Random deviation is constant for 400 games, the advantage of the stronger player is bigger with more playouts, hence the probability of getting the winner/stronger side wrong is less for the same number of games.
Re: Engine Tournament
I haven't seen such claims, but that's wrong as well. Confidence requires both quality and quantity - a fair amount of representative samples.as0770 wrote:Some mean that you can measure the strength with a few games as long as the quality is good enough.
The reason I phrased like low quality games need more samples (and not the reverse) is because the two directions are asymmetric. The effect of amplifying the signal is not necessarily always strong enough, may have rare exceptions etc so it's better to just think of sample weights ~= 1 if the playouts are decent enough. But the opposite is different. If there is even a POTENTIAL that your samples become more random, your signal weakens and the expected difference falls near 50%, you already need more samples.